Part 1: Technology Stack & Architecture Foundations
Payment & Chargeback Fraud Platform - Principal-Level Design Document
Phase 1: Real-Time Streaming & Decisioning
1. Battle-Hardened Technology Stack
| Component | Technology Choice | Why (Mapped to Practitioner Constraints) |
|---|---|---|
| Event Streaming | Apache Kafka (Confluent Cloud or self-managed) | Constraint 5 (Velocity): Sub-100ms publish latency, exactly-once semantics with idempotent producers, log compaction for entity state recovery. Partitioning by card_token ensures ordered processing per card. |
| Stream Processing | Apache Flink (stateful, exactly-once) | Constraint 5 (Velocity): Native sliding/tumbling windows for velocity features, checkpoint-based recovery, <50ms processing latency. Keyed state for entity aggregates without external lookups in hot path. |
| Fast State Store | Redis Cluster (Redis Enterprise or Elasticache) | Constraint 5 (Velocity): <5ms p99 reads for entity profiles, atomic increments for counters, TTL-based expiration for sliding windows. Lua scripting for complex atomic operations. |
| Feature Store | Feast (online) + Delta Lake (offline) | Constraint 7 (Economic Loop): Point-in-time correct feature retrieval for training prevents leakage. Online store backed by Redis for <10ms serving. Offline store enables historical replay for threshold simulation. |
| Policy Engine | Open Policy Agent (OPA) + Custom Decision Service | Constraint 2 (Policy > ML): Rego policies are version-controlled, hot-reloadable without deployment. Decision service wraps OPA with profit-threshold logic, experiment routing, and audit logging. |
| ML Model Serving | Seldon Core / KServe on Kubernetes | Constraint 8 (Model Lifecycle): Native canary deployments, traffic splitting for champion/challenger, A/B routing by request headers. Model registry integration (MLflow). Stateless inference <20ms. |
| Evidence Vault | PostgreSQL (immutable append) + S3 (blob evidence) | Constraint 4 (Evidence = Revenue): WORM-style table with INSERT-only permissions, cryptographic hashing for tamper detection. S3 for device fingerprints, screenshots, 3DS payloads. 7-year retention. |
| Model Registry & Training | MLflow + Kubeflow Pipelines | Constraint 8 (Model Lifecycle): Versioned model artifacts, experiment tracking, automated retraining DAGs. Label lag handling via configurable maturity windows. |
| Economic Optimization Service | Custom Python service + Metabase/Superset | Constraint 7 (Economic Loop): Batch job computes approval-rate vs loss curves nightly. Exposes threshold recommendations via API. Dashboards for finance/fraud ops to adjust risk budgets. |
| Governance & Audit | Apache Atlas + PostgreSQL audit tables | Constraint 9 (Governance): Data lineage tracking, policy change history with changed_by, changed_at, previous_value. Role-based access control. PII tagging for automated masking. |
| Observability | Prometheus + Grafana + OpenTelemetry + PagerDuty | Constraint 6 (Failure-Resilient): Distributed tracing for latency debugging, custom metrics for feature drift, alert routing with escalation. Safe-mode triggers based on metric thresholds. |
| PCI Tokenization | Basis Theory / Skyflow / VGS (Vault-as-a-Service) | Constraint 9 (PCI Boundaries): Fraud platform never sees raw PAN. Receives card_token and BIN (first 6-8 digits). Tokenization happens at PSP/gateway layer. |
2. Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ PAYMENT & CHARGEBACK FRAUD PLATFORM │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌──────────────┐ ┌──────────────────┐ ┌─────────────────────────────────────┐
│ PSP/Gateway │────▶│ Webhook Ingress │────▶│ Kafka: raw-payment-events │
│ (Stripe, │ │ (API Gateway + │ │ Partitioned by card_token │
│ Adyen, etc) │ │ Idempotency │ └─────────────────────────────────────┘
└──────────────┘ │ Check) │ │
└──────────────────┘ ▼
┌─────────────────────────────────────┐
┌──────────────┐ │ Flink Job: Event Normalization │
│ Issuer/ │ │ - Schema validation │
│ Network │────────────────────────────────▶│ - Deduplication (idempotency key) │
│ Alerts │ │ - Event type classification │
└──────────────┘ └─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Kafka: normalized-events │
└─────────────────────────────────────┘
│
┌─────────────────────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────┐
│ Flink Job: Entity │ │ Flink Job: Velocity │ │ Flink Job: Label │
│ Feature Aggregation │ │ Counter Updates │ │ & Outcome Joiner │
│ (user, device, card, │ │ (Redis writes) │ │ (chargebacks → │
│ service profiles) │ │ │ │ training data) │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────┐
│ Feature Store (Feast) │ │ Redis Cluster │ │ Delta Lake │
│ Online: Redis │ │ (velocity counters, │ │ (labeled training │
│ Offline: Delta Lake │ │ entity profiles) │ │ datasets) │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────┘
│ │
└────────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ RISK SCORING SERVICE │
│ ┌─────────────────────────────────┐│
│ │ Feature Assembly (<10ms) ││
│ │ - Redis lookups (parallel) ││
│ │ - BIN/issuer enrichment ││
│ │ - External signal merge ││
│ └─────────────────────────────────┘│
│ ┌─────────────────────────────────┐│
│ │ Model Inference (<20ms) ││
│ │ - Criminal fraud score ││
│ │ - Friendly fraud score ││
│ │ - Confidence intervals ││
│ └─────────────────────────────────┘│
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ POLICY ENGINE │
│ ┌─────────────────────────────────┐│
│ │ OPA: Business Rules ││
│ │ - Velocity limits ││
│ │ - Blocklists/allowlists ││
│ │ - Service-specific rules ││
│ └─────────────────────────────────┘│
│ ┌─────────────────────────────────┐│
│ │ Decision Service ││
│ │ - Profit-threshold evaluation ││
│ │ - Experiment routing ││
│ │ - Override handling ││
│ │ - Audit logging ││
│ └─────────────────────────────────┘│
└─────────────────────────────────────┘
│
┌───────────────┼───────────────┬───────────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ ALLOW │ │ FRICTION │ │ REVIEW │ │ BLOCK │
│ │ │ (3DS/MFA)│ │ (Queue) │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
└───────────────┴───────────────┴───────────────┘
│
▼
┌─────────────────────────────────────┐
│ EVIDENCE VAULT │
│ - Transaction context snapshot │
│ - Feature values at decision time │
│ - Policy rule trace │
│ - Device/IP fingerprints │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Kafka: decision-events │
│ (for downstream systems, analytics)│
└─────────────────────────────────────┘
3. Event & Idempotency Design
3.1 Event Types
| Event Type | Source | Timing | Key Fields |
|---|---|---|---|
AUTHORIZATION | PSP webhook | Real-time (<1s) | auth_id, card_token, amount, currency, service_id, device_fingerprint, ip_address, billing_address, avs_result, cvv_result |
CAPTURE | PSP webhook | Minutes to days after auth | capture_id, auth_id, captured_amount, capture_timestamp |
REFUND | PSP webhook / Internal | Varies | refund_id, auth_id, refund_amount, refund_reason, initiated_by |
CHARGEBACK_INITIATED | PSP / Network | 1-120 days post-transaction | chargeback_id, auth_id, reason_code, amount, initiated_date |
CHARGEBACK_OUTCOME | PSP / Network | Days to months after initiation | chargeback_id, outcome (won/lost/partial), final_amount, evidence_submitted |
DISPUTE_EVIDENCE_REQUEST | PSP / Network | During dispute window | chargeback_id, deadline, required_evidence_types |
ISSUER_ALERT | Network (TC40, SAFE) | 1-7 days post-fraud | alert_id, card_token, service_id, alert_type, fraud_amount |
3.2 Idempotency Key Design
idempotency_key = SHA256(
source_system + # e.g., "stripe", "adyen"
event_type + # e.g., "AUTHORIZATION"
source_event_id + # PSP's unique ID
event_timestamp # ISO8601 with ms precision
)
Storage: Redis with 72-hour TTL
Key: idempotency:{idempotency_key}
Value: {
"processed_at": "2025-01-15T10:30:00.123Z",
"internal_event_id": "evt_abc123",
"decision": "ALLOW" // for auth events
}
TTL: 72 hours
3.3 Exactly-Once Business Effects
┌─────────────────────────────────────────────────────────────────────┐
│ IDEMPOTENCY HANDLING FLOW │
└─────────────────────────────────────────────────────────────────────┘
Webhook arrives
│
▼
┌─────────────────┐
│ Compute │
│ idempotency_key │
└─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ Redis GET │────▶│ Key exists? │
│ idempotency:key │ └─────────────────┘
└─────────────────┘ │
┌─────┴─────┐
│ │
YES NO
│ │
▼ ▼
┌─────────────┐ ┌─────────────────┐
│ Return │ │ Redis SET NX │
│ cached │ │ idempotency:key │
│ decision │ │ with 72h TTL │
└─────────────┘ └─────────────────┘
│
┌─────┴─────┐
│ │
SET OK SET FAILED
(we won) (race lost)
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Process │ │ Retry GET │
│ event │ │ return │
│ (full flow) │ │ cached │
└─────────────┘ └─────────────┘
│
▼
┌─────────────────┐
│ Update Redis │
│ with decision │
│ result │
└─────────────────┘
Pseudo-code for Idempotency Check:
async def handle_webhook(raw_event: dict) -> DecisionResponse:
# 1. Compute idempotency key
idem_key = compute_idempotency_key(raw_event)
# 2. Check if already processed
cached = await redis.get(f"idempotency:{idem_key}")
if cached:
logger.info(f"Duplicate event detected: {idem_key}")
return DecisionResponse.from_cache(cached)
# 3. Attempt to claim processing rights (SET NX = set if not exists)
claimed = await redis.set(
f"idempotency:{idem_key}",
json.dumps({"status": "processing", "started_at": now_iso()}),
nx=True, # Only set if not exists
ex=72 * 3600 # 72 hour TTL
)
if not claimed:
# Race condition: another instance claimed it
await asyncio.sleep(0.1) # Brief wait for other instance
cached = await redis.get(f"idempotency:{idem_key}")
if cached and cached.get("status") == "completed":
return DecisionResponse.from_cache(cached)
return DecisionResponse.pending()
# 4. Process the event (we have the lock)
try:
decision = await process_event_full_flow(raw_event)
# 5. Store result for future duplicates
await redis.set(
f"idempotency:{idem_key}",
json.dumps({
"status": "completed",
"decision": decision.to_dict(),
"processed_at": now_iso()
}),
ex=72 * 3600
)
return decision
except Exception as e:
# Clear the lock so retries can attempt again
await redis.delete(f"idempotency:{idem_key}")
raise
3.4 Out-of-Order Event Handling
Transaction State Machine:
VALID_TRANSITIONS = {
None: ["AUTHORIZATION"],
"AUTHORIZED": ["CAPTURE", "VOID", "CHARGEBACK_INITIATED"],
"CAPTURED": ["REFUND", "CHARGEBACK_INITIATED"],
"PARTIALLY_REFUNDED": ["REFUND", "CHARGEBACK_INITIATED"],
"FULLY_REFUNDED": ["CHARGEBACK_INITIATED"],
"VOIDED": [],
"CHARGEBACK_INITIATED": ["CHARGEBACK_OUTCOME"],
"CHARGEBACK_WON": [],
"CHARGEBACK_LOST": [],
}
async def handle_event_with_ordering(event: NormalizedEvent) -> ProcessingResult:
txn_key = f"txn_state:{event.auth_id}"
async with redis.lock(f"lock:{txn_key}", timeout=5):
current_state = await redis.hgetall(txn_key)
if event.event_type == "AUTHORIZATION":
if current_state:
return ProcessingResult.already_processed(current_state)
# Process authorization...
elif event.event_type in ["CAPTURE", "REFUND", "CHARGEBACK_INITIATED"]:
if not current_state:
# Parent transaction not yet processed - queue for retry
await enqueue_delayed_event(event, delay_seconds=30)
return ProcessingResult.deferred()
if not is_valid_transition(current_state["status"], event.event_type):
logger.warning(f"Invalid transition: {current_state['status']} -> {event.event_type}")
await store_invalid_transition(event, current_state)
return ProcessingResult.invalid_sequence()
4. Feedback Loop & Label Hygiene
4.1 Label Taxonomy
| Label Category | Definition | Source Signals | Model/Policy Treatment |
|---|---|---|---|
| CRIMINAL_FRAUD | Third-party fraud: stolen credentials, account takeover, SIM farm attacks, device resale fraud | Issuer-confirmed fraud (TC40/SAFE), chargeback reason codes 10.1-10.5 (Visa) / 4837-4863 (MC), law enforcement reports | Hard blocks, device/IP blacklists, aggressive velocity limits |
| FRIENDLY_FRAUD | First-party abuse: legitimate cardholder disputes valid transaction, refund abuse | Chargeback reason 13.x (services/subscription), high dispute rate per user, pattern of disputes after service activation | Friction (3DS), soft denies with appeal path, evidence-heavy representment |
| SERVICE_ERROR | Operational issues: duplicate charges, wrong amounts, service provisioning failures | Chargeback reason 12.x (processing errors), internal ops flags, customer service tickets | Process fixes, no model penalty, operational alerts |
| LEGITIMATE | Valid transaction, no fraud | No chargeback within 120 days, successful service activation | Baseline for approval optimization |
| UNKNOWN | Insufficient signal to classify | Transaction too recent (<120 days), chargeback pending | Excluded from training until resolved |
4.2 Reason Code Mapping
VISA_REASON_CODE_MAP = {
# Criminal Fraud (10.x)
"10.1": "CRIMINAL_FRAUD", # EMV Liability Shift Counterfeit
"10.2": "CRIMINAL_FRAUD", # EMV Liability Shift Non-Counterfeit
"10.3": "CRIMINAL_FRAUD", # Other Fraud - Card Present
"10.4": "CRIMINAL_FRAUD", # Other Fraud - Card Absent
"10.5": "CRIMINAL_FRAUD", # Visa Fraud Monitoring Program
# Processing Errors (12.x) -> Service Error
"12.1": "SERVICE_ERROR", # Late Presentment
"12.2": "SERVICE_ERROR", # Incorrect Transaction Code
"12.3": "SERVICE_ERROR", # Incorrect Currency
"12.4": "SERVICE_ERROR", # Incorrect Account Number
"12.5": "SERVICE_ERROR", # Incorrect Amount
"12.6": "SERVICE_ERROR", # Duplicate Processing
"12.7": "SERVICE_ERROR", # Invalid Data
# Consumer Disputes (13.x) -> Friendly Fraud
"13.1": "FRIENDLY_FRAUD", # Merchandise/Services Not Received
"13.2": "FRIENDLY_FRAUD", # Cancelled Recurring
"13.3": "FRIENDLY_FRAUD", # Not as Described
"13.4": "FRIENDLY_FRAUD", # Counterfeit Merchandise
"13.5": "FRIENDLY_FRAUD", # Misrepresentation
"13.6": "FRIENDLY_FRAUD", # Credit Not Processed
"13.7": "FRIENDLY_FRAUD", # Cancelled Merchandise/Services
}
def classify_chargeback(
reason_code: str,
user_dispute_history: dict,
delivery_confirmed: bool,
issuer_alert_exists: bool,
customer_service_contact: bool
) -> str:
base_label = VISA_REASON_CODE_MAP.get(reason_code, "UNKNOWN")
# Override rules
if issuer_alert_exists and base_label != "CRIMINAL_FRAUD":
return "CRIMINAL_FRAUD"
if base_label == "FRIENDLY_FRAUD":
if not delivery_confirmed and reason_code == "13.1":
return "SERVICE_ERROR" # Genuine non-receipt
if user_dispute_history.get("dispute_count_12m", 0) > 3:
return "FRIENDLY_FRAUD" # Serial disputer confirmed
if customer_service_contact and not delivery_confirmed:
return "SERVICE_ERROR" # Customer tried to resolve
return base_label
4.3 Label Maturity Windows
┌─────────────────────────────────────────────────────────────────────┐
│ LABEL MATURITY TIMELINE │
└─────────────────────────────────────────────────────────────────────┘
Transaction Date ─────────────────────────────────────────────────────▶
│
│ Day 0-30: Most criminal fraud chargebacks arrive
│ Day 30-60: Friendly fraud chargebacks peak
│ Day 60-90: Late chargebacks, representment outcomes
│ Day 90-120: Final stragglers, network adjustments
│ Day 120+: Transaction considered "mature" for training
│
▼
TRAINING DATA WINDOWS:
┌──────────────────────────────────────────────────────────────────────┐
│ Include transactions where: │
│ transaction_date < (current_date - 120 days) │
│ │
│ For "accelerated" training (fraud signal only): │
│ Include CRIMINAL_FRAUD labels from transactions 30-120 days old │
│ (high-confidence positives even if window incomplete) │
│ │
│ Exclude from training: │
│ - Transactions < 30 days old (too immature) │
│ - Labels marked UNKNOWN │
│ - Transactions under active dispute │
└──────────────────────────────────────────────────────────────────────┘
5. Model Update & Rollback Paths
5.1 End-to-End Model Lifecycle
STAGE 1: TRAINING
─────────────────
Training Data (Delta Lake) → Feature Engineering → Model Training (XGBoost/LGB)
│
▼
MLflow Registry (Stage: "None")
STAGE 2: OFFLINE VALIDATION
───────────────────────────
- Re-score 30 days of historical transactions with new model
- Compare decisions to actual outcomes
- Compute: precision, recall, false positive rate
- Compute: simulated revenue impact (approval rate × avg ticket)
- Compute: simulated loss (fraud rate × chargeback cost)
Pass Criteria:
- Fraud detection rate ≥ champion - 2%
- False positive rate ≤ champion + 0.5%
- Simulated net revenue ≥ champion - $X threshold
STAGE 3: SHADOW MODE (Optional, 3-7 days)
─────────────────────────────────────────
- New model scores live traffic but decisions NOT used
- Log predictions alongside champion predictions
- Compare prediction distributions, look for anomalies
STAGE 4: A/B TESTING (14-30 days)
─────────────────────────────────
- Champion: 90% of traffic
- Challenger: 10% of traffic
- Assignment: Deterministic hash of auth_id for consistency
STAGE 5: PRODUCTION
───────────────────
- Continuous monitoring for feature/concept drift
- Alert if PSI > 0.2 on any top-10 feature
- Alert if approval rate changes > 3% week-over-week
- Alert if fraud rate changes > 20% week-over-week
5.2 Rollback Mechanism
model_deployment_config = {
"champion": {
"model_uri": "models:/fraud_model/Champion",
"version": "v2.3.1",
"traffic_percentage": 100,
"rollback_to": "v2.2.0"
},
"challenger": {
"model_uri": "models:/fraud_model/Challenger",
"version": "v2.4.0-rc1",
"traffic_percentage": 0,
"auto_rollback_triggers": {
"false_positive_rate_increase": 0.01,
"approval_rate_decrease": 0.03,
"latency_p99_increase_ms": 50
}
}
}
async def check_auto_rollback(metrics: dict, config: dict) -> bool:
triggers = config["challenger"]["auto_rollback_triggers"]
champion_metrics = metrics["champion"]
challenger_metrics = metrics["challenger"]
# Check false positive rate
fpr_increase = challenger_metrics["false_positive_rate"] - champion_metrics["false_positive_rate"]
if fpr_increase > triggers["false_positive_rate_increase"]:
await execute_rollback("challenger", reason="fpr_threshold_exceeded")
return True
# Check approval rate
apr_decrease = champion_metrics["approval_rate"] - challenger_metrics["approval_rate"]
if apr_decrease > triggers["approval_rate_decrease"]:
await execute_rollback("challenger", reason="approval_rate_threshold")
return True
return False
6. Failure & Attack Response Controls
6.1 Safe Mode Configuration
SAFE_MODE_CONFIG = {
"enabled": False,
"auto_triggers": {
"model_error_rate_threshold": 0.05,
"model_latency_p99_threshold_ms": 500,
"redis_error_rate_threshold": 0.10,
"consecutive_failures_threshold": 10
},
"fallback_rules": {
"default_decision": "ALLOW",
"hard_blocks": [
{"condition": "card_on_blocklist", "decision": "BLOCK"},
{"condition": "device_on_blocklist", "decision": "BLOCK"},
{"condition": "ip_on_blocklist", "decision": "BLOCK"},
{"condition": "amount_usd > 5000", "decision": "REVIEW"},
{"condition": "country_high_risk AND amount_usd > 500", "decision": "FRICTION"}
],
"velocity_limits": {
"card_attempts_1h": {"limit": 5, "action": "FRICTION"},
"device_attempts_1h": {"limit": 10, "action": "FRICTION"},
"ip_attempts_1h": {"limit": 20, "action": "REVIEW"}
}
}
}
6.2 Threshold Rotation (Anti-Gaming)
THRESHOLD_ROTATION_CONFIG = {
"enabled": True,
"rotation_interval_hours": 4,
"jitter_percentage": 0.10, # ±10% random variation
"base_thresholds": {
"criminal_fraud_block": 0.85,
"criminal_fraud_friction": 0.60,
"friendly_fraud_friction": 0.70,
"review_queue": 0.50
},
"constraints": {
"criminal_fraud_block": {"min": 0.75, "max": 0.95},
"criminal_fraud_friction": {"min": 0.50, "max": 0.70},
"friendly_fraud_friction": {"min": 0.60, "max": 0.80},
"review_queue": {"min": 0.40, "max": 0.60}
}
}
7. Latency Budget Breakdown
Target: < 200ms end-to-end for online decisioning
| Component | Budget | P50 | P99 | Notes |
|---|---|---|---|---|
| API Gateway + Auth | 10ms | 3ms | 8ms | JWT validation |
| Idempotency Check (Redis) | 10ms | 2ms | 7ms | Single GET |
| Event Normalization | 5ms | 1ms | 3ms | In-memory |
| Feature Assembly | 50ms | 25ms | 45ms | Parallel Redis calls |
| Model Inference | 30ms | 12ms | 25ms | Seldon/KServe |
| Policy Evaluation (OPA) | 15ms | 5ms | 12ms | Rego eval |
| Decision Logging | 5ms | 2ms | 4ms | Async Kafka produce |
| Evidence Capture | 10ms | 4ms | 8ms | Async S3 + PG |
| Response Serialization | 5ms | 2ms | 4ms | - |
| TOTAL | 140ms | 56ms | 128ms | 60ms buffer |
Key Optimizations:
- Parallel Redis calls for all entity profiles
- Async evidence capture (does not block response)
- Cached BIN data with 24h TTL
- Pre-warmed connection pools
8. Sprint-1 Architecture Decisions Summary
| Decision | Choice | Rationale |
|---|---|---|
| Streaming | Kafka + Flink | Exactly-once semantics, stateful processing |
| Feature Store | Redis (online) + Feast + Delta Lake (offline) | <10ms online, point-in-time correct offline |
| Policy Engine | OPA + custom wrapper | Hot-reloadable, audit logging, experiments |
| Model Serving | Seldon Core | Canary deployments, traffic splitting |
| Evidence Storage | PostgreSQL + S3 | Immutable, compliant, cost-effective |
| PCI Boundary | Tokenization at PSP | Fraud platform never sees raw PAN |
| Failure Mode | Safe mode with fallback rules | Graceful degradation |
Next: Part 2: External Entities, Data Schemas & Feature Engineering