Skip to main content

Part 1: Technology Stack & Architecture Foundations

Payment & Chargeback Fraud Platform - Principal-Level Design Document

Phase 1: Real-Time Streaming & Decisioning


1. Battle-Hardened Technology Stack

ComponentTechnology ChoiceWhy (Mapped to Practitioner Constraints)
Event StreamingApache Kafka (Confluent Cloud or self-managed)Constraint 5 (Velocity): Sub-100ms publish latency, exactly-once semantics with idempotent producers, log compaction for entity state recovery. Partitioning by card_token ensures ordered processing per card.
Stream ProcessingApache Flink (stateful, exactly-once)Constraint 5 (Velocity): Native sliding/tumbling windows for velocity features, checkpoint-based recovery, <50ms processing latency. Keyed state for entity aggregates without external lookups in hot path.
Fast State StoreRedis Cluster (Redis Enterprise or Elasticache)Constraint 5 (Velocity): <5ms p99 reads for entity profiles, atomic increments for counters, TTL-based expiration for sliding windows. Lua scripting for complex atomic operations.
Feature StoreFeast (online) + Delta Lake (offline)Constraint 7 (Economic Loop): Point-in-time correct feature retrieval for training prevents leakage. Online store backed by Redis for <10ms serving. Offline store enables historical replay for threshold simulation.
Policy EngineOpen Policy Agent (OPA) + Custom Decision ServiceConstraint 2 (Policy > ML): Rego policies are version-controlled, hot-reloadable without deployment. Decision service wraps OPA with profit-threshold logic, experiment routing, and audit logging.
ML Model ServingSeldon Core / KServe on KubernetesConstraint 8 (Model Lifecycle): Native canary deployments, traffic splitting for champion/challenger, A/B routing by request headers. Model registry integration (MLflow). Stateless inference <20ms.
Evidence VaultPostgreSQL (immutable append) + S3 (blob evidence)Constraint 4 (Evidence = Revenue): WORM-style table with INSERT-only permissions, cryptographic hashing for tamper detection. S3 for device fingerprints, screenshots, 3DS payloads. 7-year retention.
Model Registry & TrainingMLflow + Kubeflow PipelinesConstraint 8 (Model Lifecycle): Versioned model artifacts, experiment tracking, automated retraining DAGs. Label lag handling via configurable maturity windows.
Economic Optimization ServiceCustom Python service + Metabase/SupersetConstraint 7 (Economic Loop): Batch job computes approval-rate vs loss curves nightly. Exposes threshold recommendations via API. Dashboards for finance/fraud ops to adjust risk budgets.
Governance & AuditApache Atlas + PostgreSQL audit tablesConstraint 9 (Governance): Data lineage tracking, policy change history with changed_by, changed_at, previous_value. Role-based access control. PII tagging for automated masking.
ObservabilityPrometheus + Grafana + OpenTelemetry + PagerDutyConstraint 6 (Failure-Resilient): Distributed tracing for latency debugging, custom metrics for feature drift, alert routing with escalation. Safe-mode triggers based on metric thresholds.
PCI TokenizationBasis Theory / Skyflow / VGS (Vault-as-a-Service)Constraint 9 (PCI Boundaries): Fraud platform never sees raw PAN. Receives card_token and BIN (first 6-8 digits). Tokenization happens at PSP/gateway layer.

2. Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────────────┐
│ PAYMENT & CHARGEBACK FRAUD PLATFORM │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌──────────────┐ ┌──────────────────┐ ┌─────────────────────────────────────┐
│ PSP/Gateway │────▶│ Webhook Ingress │────▶│ Kafka: raw-payment-events │
│ (Stripe, │ │ (API Gateway + │ │ Partitioned by card_token │
│ Adyen, etc) │ │ Idempotency │ └─────────────────────────────────────┘
└──────────────┘ │ Check) │ │
└──────────────────┘ ▼
┌─────────────────────────────────────┐
┌──────────────┐ │ Flink Job: Event Normalization │
│ Issuer/ │ │ - Schema validation │
│ Network │────────────────────────────────▶│ - Deduplication (idempotency key) │
│ Alerts │ │ - Event type classification │
└──────────────┘ └─────────────────────────────────────┘


┌─────────────────────────────────────┐
│ Kafka: normalized-events │
└─────────────────────────────────────┘

┌─────────────────────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────┐
│ Flink Job: Entity │ │ Flink Job: Velocity │ │ Flink Job: Label │
│ Feature Aggregation │ │ Counter Updates │ │ & Outcome Joiner │
│ (user, device, card, │ │ (Redis writes) │ │ (chargebacks → │
│ service profiles) │ │ │ │ training data) │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────┐
│ Feature Store (Feast) │ │ Redis Cluster │ │ Delta Lake │
│ Online: Redis │ │ (velocity counters, │ │ (labeled training │
│ Offline: Delta Lake │ │ entity profiles) │ │ datasets) │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────┘
│ │
└────────────────────┬────────────────────┘


┌─────────────────────────────────────┐
│ RISK SCORING SERVICE │
│ ┌─────────────────────────────────┐│
│ │ Feature Assembly (&lt;10ms) ││
│ │ - Redis lookups (parallel) ││
│ │ - BIN/issuer enrichment ││
│ │ - External signal merge ││
│ └─────────────────────────────────┘│
│ ┌─────────────────────────────────┐│
│ │ Model Inference (&lt;20ms) ││
│ │ - Criminal fraud score ││
│ │ - Friendly fraud score ││
│ │ - Confidence intervals ││
│ └─────────────────────────────────┘│
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│ POLICY ENGINE │
│ ┌─────────────────────────────────┐│
│ │ OPA: Business Rules ││
│ │ - Velocity limits ││
│ │ - Blocklists/allowlists ││
│ │ - Service-specific rules ││
│ └─────────────────────────────────┘│
│ ┌─────────────────────────────────┐│
│ │ Decision Service ││
│ │ - Profit-threshold evaluation ││
│ │ - Experiment routing ││
│ │ - Override handling ││
│ │ - Audit logging ││
│ └─────────────────────────────────┘│
└─────────────────────────────────────┘

┌───────────────┼───────────────┬───────────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ ALLOW │ │ FRICTION │ │ REVIEW │ │ BLOCK │
│ │ │ (3DS/MFA)│ │ (Queue) │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
└───────────────┴───────────────┴───────────────┘


┌─────────────────────────────────────┐
│ EVIDENCE VAULT │
│ - Transaction context snapshot │
│ - Feature values at decision time │
│ - Policy rule trace │
│ - Device/IP fingerprints │
└─────────────────────────────────────┘


┌─────────────────────────────────────┐
│ Kafka: decision-events │
│ (for downstream systems, analytics)│
└─────────────────────────────────────┘

3. Event & Idempotency Design

3.1 Event Types

Event TypeSourceTimingKey Fields
AUTHORIZATIONPSP webhookReal-time (<1s)auth_id, card_token, amount, currency, service_id, device_fingerprint, ip_address, billing_address, avs_result, cvv_result
CAPTUREPSP webhookMinutes to days after authcapture_id, auth_id, captured_amount, capture_timestamp
REFUNDPSP webhook / InternalVariesrefund_id, auth_id, refund_amount, refund_reason, initiated_by
CHARGEBACK_INITIATEDPSP / Network1-120 days post-transactionchargeback_id, auth_id, reason_code, amount, initiated_date
CHARGEBACK_OUTCOMEPSP / NetworkDays to months after initiationchargeback_id, outcome (won/lost/partial), final_amount, evidence_submitted
DISPUTE_EVIDENCE_REQUESTPSP / NetworkDuring dispute windowchargeback_id, deadline, required_evidence_types
ISSUER_ALERTNetwork (TC40, SAFE)1-7 days post-fraudalert_id, card_token, service_id, alert_type, fraud_amount

3.2 Idempotency Key Design

idempotency_key = SHA256(
source_system + # e.g., "stripe", "adyen"
event_type + # e.g., "AUTHORIZATION"
source_event_id + # PSP's unique ID
event_timestamp # ISO8601 with ms precision
)

Storage: Redis with 72-hour TTL

Key:    idempotency:{idempotency_key}
Value: {
"processed_at": "2025-01-15T10:30:00.123Z",
"internal_event_id": "evt_abc123",
"decision": "ALLOW" // for auth events
}
TTL: 72 hours

3.3 Exactly-Once Business Effects

┌─────────────────────────────────────────────────────────────────────┐
│ IDEMPOTENCY HANDLING FLOW │
└─────────────────────────────────────────────────────────────────────┘

Webhook arrives


┌─────────────────┐
│ Compute │
│ idempotency_key │
└─────────────────┘


┌─────────────────┐ ┌─────────────────┐
│ Redis GET │────▶│ Key exists? │
│ idempotency:key │ └─────────────────┘
└─────────────────┘ │
┌─────┴─────┐
│ │
YES NO
│ │
▼ ▼
┌─────────────┐ ┌─────────────────┐
│ Return │ │ Redis SET NX │
│ cached │ │ idempotency:key │
│ decision │ │ with 72h TTL │
└─────────────┘ └─────────────────┘

┌─────┴─────┐
│ │
SET OK SET FAILED
(we won) (race lost)
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Process │ │ Retry GET │
│ event │ │ return │
│ (full flow) │ │ cached │
└─────────────┘ └─────────────┘


┌─────────────────┐
│ Update Redis │
│ with decision │
│ result │
└─────────────────┘

Pseudo-code for Idempotency Check:

async def handle_webhook(raw_event: dict) -> DecisionResponse:
# 1. Compute idempotency key
idem_key = compute_idempotency_key(raw_event)

# 2. Check if already processed
cached = await redis.get(f"idempotency:{idem_key}")
if cached:
logger.info(f"Duplicate event detected: {idem_key}")
return DecisionResponse.from_cache(cached)

# 3. Attempt to claim processing rights (SET NX = set if not exists)
claimed = await redis.set(
f"idempotency:{idem_key}",
json.dumps({"status": "processing", "started_at": now_iso()}),
nx=True, # Only set if not exists
ex=72 * 3600 # 72 hour TTL
)

if not claimed:
# Race condition: another instance claimed it
await asyncio.sleep(0.1) # Brief wait for other instance
cached = await redis.get(f"idempotency:{idem_key}")
if cached and cached.get("status") == "completed":
return DecisionResponse.from_cache(cached)
return DecisionResponse.pending()

# 4. Process the event (we have the lock)
try:
decision = await process_event_full_flow(raw_event)

# 5. Store result for future duplicates
await redis.set(
f"idempotency:{idem_key}",
json.dumps({
"status": "completed",
"decision": decision.to_dict(),
"processed_at": now_iso()
}),
ex=72 * 3600
)
return decision

except Exception as e:
# Clear the lock so retries can attempt again
await redis.delete(f"idempotency:{idem_key}")
raise

3.4 Out-of-Order Event Handling

Transaction State Machine:

VALID_TRANSITIONS = {
None: ["AUTHORIZATION"],
"AUTHORIZED": ["CAPTURE", "VOID", "CHARGEBACK_INITIATED"],
"CAPTURED": ["REFUND", "CHARGEBACK_INITIATED"],
"PARTIALLY_REFUNDED": ["REFUND", "CHARGEBACK_INITIATED"],
"FULLY_REFUNDED": ["CHARGEBACK_INITIATED"],
"VOIDED": [],
"CHARGEBACK_INITIATED": ["CHARGEBACK_OUTCOME"],
"CHARGEBACK_WON": [],
"CHARGEBACK_LOST": [],
}

async def handle_event_with_ordering(event: NormalizedEvent) -> ProcessingResult:
txn_key = f"txn_state:{event.auth_id}"

async with redis.lock(f"lock:{txn_key}", timeout=5):
current_state = await redis.hgetall(txn_key)

if event.event_type == "AUTHORIZATION":
if current_state:
return ProcessingResult.already_processed(current_state)
# Process authorization...

elif event.event_type in ["CAPTURE", "REFUND", "CHARGEBACK_INITIATED"]:
if not current_state:
# Parent transaction not yet processed - queue for retry
await enqueue_delayed_event(event, delay_seconds=30)
return ProcessingResult.deferred()

if not is_valid_transition(current_state["status"], event.event_type):
logger.warning(f"Invalid transition: {current_state['status']} -> {event.event_type}")
await store_invalid_transition(event, current_state)
return ProcessingResult.invalid_sequence()

4. Feedback Loop & Label Hygiene

4.1 Label Taxonomy

Label CategoryDefinitionSource SignalsModel/Policy Treatment
CRIMINAL_FRAUDThird-party fraud: stolen credentials, account takeover, SIM farm attacks, device resale fraudIssuer-confirmed fraud (TC40/SAFE), chargeback reason codes 10.1-10.5 (Visa) / 4837-4863 (MC), law enforcement reportsHard blocks, device/IP blacklists, aggressive velocity limits
FRIENDLY_FRAUDFirst-party abuse: legitimate cardholder disputes valid transaction, refund abuseChargeback reason 13.x (services/subscription), high dispute rate per user, pattern of disputes after service activationFriction (3DS), soft denies with appeal path, evidence-heavy representment
SERVICE_ERROROperational issues: duplicate charges, wrong amounts, service provisioning failuresChargeback reason 12.x (processing errors), internal ops flags, customer service ticketsProcess fixes, no model penalty, operational alerts
LEGITIMATEValid transaction, no fraudNo chargeback within 120 days, successful service activationBaseline for approval optimization
UNKNOWNInsufficient signal to classifyTransaction too recent (<120 days), chargeback pendingExcluded from training until resolved

4.2 Reason Code Mapping

VISA_REASON_CODE_MAP = {
# Criminal Fraud (10.x)
"10.1": "CRIMINAL_FRAUD", # EMV Liability Shift Counterfeit
"10.2": "CRIMINAL_FRAUD", # EMV Liability Shift Non-Counterfeit
"10.3": "CRIMINAL_FRAUD", # Other Fraud - Card Present
"10.4": "CRIMINAL_FRAUD", # Other Fraud - Card Absent
"10.5": "CRIMINAL_FRAUD", # Visa Fraud Monitoring Program

# Processing Errors (12.x) -> Service Error
"12.1": "SERVICE_ERROR", # Late Presentment
"12.2": "SERVICE_ERROR", # Incorrect Transaction Code
"12.3": "SERVICE_ERROR", # Incorrect Currency
"12.4": "SERVICE_ERROR", # Incorrect Account Number
"12.5": "SERVICE_ERROR", # Incorrect Amount
"12.6": "SERVICE_ERROR", # Duplicate Processing
"12.7": "SERVICE_ERROR", # Invalid Data

# Consumer Disputes (13.x) -> Friendly Fraud
"13.1": "FRIENDLY_FRAUD", # Merchandise/Services Not Received
"13.2": "FRIENDLY_FRAUD", # Cancelled Recurring
"13.3": "FRIENDLY_FRAUD", # Not as Described
"13.4": "FRIENDLY_FRAUD", # Counterfeit Merchandise
"13.5": "FRIENDLY_FRAUD", # Misrepresentation
"13.6": "FRIENDLY_FRAUD", # Credit Not Processed
"13.7": "FRIENDLY_FRAUD", # Cancelled Merchandise/Services
}

def classify_chargeback(
reason_code: str,
user_dispute_history: dict,
delivery_confirmed: bool,
issuer_alert_exists: bool,
customer_service_contact: bool
) -> str:

base_label = VISA_REASON_CODE_MAP.get(reason_code, "UNKNOWN")

# Override rules
if issuer_alert_exists and base_label != "CRIMINAL_FRAUD":
return "CRIMINAL_FRAUD"

if base_label == "FRIENDLY_FRAUD":
if not delivery_confirmed and reason_code == "13.1":
return "SERVICE_ERROR" # Genuine non-receipt

if user_dispute_history.get("dispute_count_12m", 0) > 3:
return "FRIENDLY_FRAUD" # Serial disputer confirmed

if customer_service_contact and not delivery_confirmed:
return "SERVICE_ERROR" # Customer tried to resolve

return base_label

4.3 Label Maturity Windows

┌─────────────────────────────────────────────────────────────────────┐
│ LABEL MATURITY TIMELINE │
└─────────────────────────────────────────────────────────────────────┘

Transaction Date ─────────────────────────────────────────────────────▶

│ Day 0-30: Most criminal fraud chargebacks arrive
│ Day 30-60: Friendly fraud chargebacks peak
│ Day 60-90: Late chargebacks, representment outcomes
│ Day 90-120: Final stragglers, network adjustments
│ Day 120+: Transaction considered "mature" for training



TRAINING DATA WINDOWS:
┌──────────────────────────────────────────────────────────────────────┐
│ Include transactions where: │
│ transaction_date < (current_date - 120 days) │
│ │
│ For "accelerated" training (fraud signal only): │
│ Include CRIMINAL_FRAUD labels from transactions 30-120 days old │
│ (high-confidence positives even if window incomplete) │
│ │
│ Exclude from training: │
│ - Transactions < 30 days old (too immature) │
│ - Labels marked UNKNOWN │
│ - Transactions under active dispute │
└──────────────────────────────────────────────────────────────────────┘

5. Model Update & Rollback Paths

5.1 End-to-End Model Lifecycle

STAGE 1: TRAINING
─────────────────
Training Data (Delta Lake) → Feature Engineering → Model Training (XGBoost/LGB)


MLflow Registry (Stage: "None")

STAGE 2: OFFLINE VALIDATION
───────────────────────────
- Re-score 30 days of historical transactions with new model
- Compare decisions to actual outcomes
- Compute: precision, recall, false positive rate
- Compute: simulated revenue impact (approval rate × avg ticket)
- Compute: simulated loss (fraud rate × chargeback cost)

Pass Criteria:
- Fraud detection rate ≥ champion - 2%
- False positive rate ≤ champion + 0.5%
- Simulated net revenue ≥ champion - $X threshold

STAGE 3: SHADOW MODE (Optional, 3-7 days)
─────────────────────────────────────────
- New model scores live traffic but decisions NOT used
- Log predictions alongside champion predictions
- Compare prediction distributions, look for anomalies

STAGE 4: A/B TESTING (14-30 days)
─────────────────────────────────
- Champion: 90% of traffic
- Challenger: 10% of traffic
- Assignment: Deterministic hash of auth_id for consistency

STAGE 5: PRODUCTION
───────────────────
- Continuous monitoring for feature/concept drift
- Alert if PSI > 0.2 on any top-10 feature
- Alert if approval rate changes > 3% week-over-week
- Alert if fraud rate changes > 20% week-over-week

5.2 Rollback Mechanism

model_deployment_config = {
"champion": {
"model_uri": "models:/fraud_model/Champion",
"version": "v2.3.1",
"traffic_percentage": 100,
"rollback_to": "v2.2.0"
},
"challenger": {
"model_uri": "models:/fraud_model/Challenger",
"version": "v2.4.0-rc1",
"traffic_percentage": 0,
"auto_rollback_triggers": {
"false_positive_rate_increase": 0.01,
"approval_rate_decrease": 0.03,
"latency_p99_increase_ms": 50
}
}
}

async def check_auto_rollback(metrics: dict, config: dict) -> bool:
triggers = config["challenger"]["auto_rollback_triggers"]
champion_metrics = metrics["champion"]
challenger_metrics = metrics["challenger"]

# Check false positive rate
fpr_increase = challenger_metrics["false_positive_rate"] - champion_metrics["false_positive_rate"]
if fpr_increase > triggers["false_positive_rate_increase"]:
await execute_rollback("challenger", reason="fpr_threshold_exceeded")
return True

# Check approval rate
apr_decrease = champion_metrics["approval_rate"] - challenger_metrics["approval_rate"]
if apr_decrease > triggers["approval_rate_decrease"]:
await execute_rollback("challenger", reason="approval_rate_threshold")
return True

return False

6. Failure & Attack Response Controls

6.1 Safe Mode Configuration

SAFE_MODE_CONFIG = {
"enabled": False,
"auto_triggers": {
"model_error_rate_threshold": 0.05,
"model_latency_p99_threshold_ms": 500,
"redis_error_rate_threshold": 0.10,
"consecutive_failures_threshold": 10
},
"fallback_rules": {
"default_decision": "ALLOW",

"hard_blocks": [
{"condition": "card_on_blocklist", "decision": "BLOCK"},
{"condition": "device_on_blocklist", "decision": "BLOCK"},
{"condition": "ip_on_blocklist", "decision": "BLOCK"},
{"condition": "amount_usd > 5000", "decision": "REVIEW"},
{"condition": "country_high_risk AND amount_usd > 500", "decision": "FRICTION"}
],

"velocity_limits": {
"card_attempts_1h": {"limit": 5, "action": "FRICTION"},
"device_attempts_1h": {"limit": 10, "action": "FRICTION"},
"ip_attempts_1h": {"limit": 20, "action": "REVIEW"}
}
}
}

6.2 Threshold Rotation (Anti-Gaming)

THRESHOLD_ROTATION_CONFIG = {
"enabled": True,
"rotation_interval_hours": 4,
"jitter_percentage": 0.10, # ±10% random variation

"base_thresholds": {
"criminal_fraud_block": 0.85,
"criminal_fraud_friction": 0.60,
"friendly_fraud_friction": 0.70,
"review_queue": 0.50
},

"constraints": {
"criminal_fraud_block": {"min": 0.75, "max": 0.95},
"criminal_fraud_friction": {"min": 0.50, "max": 0.70},
"friendly_fraud_friction": {"min": 0.60, "max": 0.80},
"review_queue": {"min": 0.40, "max": 0.60}
}
}

7. Latency Budget Breakdown

Target: < 200ms end-to-end for online decisioning

ComponentBudgetP50P99Notes
API Gateway + Auth10ms3ms8msJWT validation
Idempotency Check (Redis)10ms2ms7msSingle GET
Event Normalization5ms1ms3msIn-memory
Feature Assembly50ms25ms45msParallel Redis calls
Model Inference30ms12ms25msSeldon/KServe
Policy Evaluation (OPA)15ms5ms12msRego eval
Decision Logging5ms2ms4msAsync Kafka produce
Evidence Capture10ms4ms8msAsync S3 + PG
Response Serialization5ms2ms4ms-
TOTAL140ms56ms128ms60ms buffer

Key Optimizations:

  1. Parallel Redis calls for all entity profiles
  2. Async evidence capture (does not block response)
  3. Cached BIN data with 24h TTL
  4. Pre-warmed connection pools

8. Sprint-1 Architecture Decisions Summary

DecisionChoiceRationale
StreamingKafka + FlinkExactly-once semantics, stateful processing
Feature StoreRedis (online) + Feast + Delta Lake (offline)<10ms online, point-in-time correct offline
Policy EngineOPA + custom wrapperHot-reloadable, audit logging, experiments
Model ServingSeldon CoreCanary deployments, traffic splitting
Evidence StoragePostgreSQL + S3Immutable, compliant, cost-effective
PCI BoundaryTokenization at PSPFraud platform never sees raw PAN
Failure ModeSafe mode with fallback rulesGraceful degradation

Next: Part 2: External Entities, Data Schemas & Feature Engineering