Part 1: Technology Stack & Architecture Foundations

Payment & Chargeback Fraud Platform - Principal-Level Design Document

Phase 1: Real-Time Streaming & Decisioning

1. Battle-Hardened Technology Stack

Component	Technology Choice	Why (Mapped to Practitioner Constraints)
Event Streaming	Apache Kafka (Confluent Cloud or self-managed)	Constraint 5 (Velocity): Sub-100ms publish latency, exactly-once semantics with idempotent producers, log compaction for entity state recovery. Partitioning by `card_token` ensures ordered processing per card.
Stream Processing	Apache Flink (stateful, exactly-once)	Constraint 5 (Velocity): Native sliding/tumbling windows for velocity features, checkpoint-based recovery, <50ms processing latency. Keyed state for entity aggregates without external lookups in hot path.
Fast State Store	Redis Cluster (Redis Enterprise or Elasticache)	Constraint 5 (Velocity): <5ms p99 reads for entity profiles, atomic increments for counters, TTL-based expiration for sliding windows. Lua scripting for complex atomic operations.
Feature Store	Feast (online) + Delta Lake (offline)	Constraint 7 (Economic Loop): Point-in-time correct feature retrieval for training prevents leakage. Online store backed by Redis for <10ms serving. Offline store enables historical replay for threshold simulation.
Policy Engine	Open Policy Agent (OPA) + Custom Decision Service	Constraint 2 (Policy > ML): Rego policies are version-controlled, hot-reloadable without deployment. Decision service wraps OPA with profit-threshold logic, experiment routing, and audit logging.
ML Model Serving	Seldon Core / KServe on Kubernetes	Constraint 8 (Model Lifecycle): Native canary deployments, traffic splitting for champion/challenger, A/B routing by request headers. Model registry integration (MLflow). Stateless inference <20ms.
Evidence Vault	PostgreSQL (immutable append) + S3 (blob evidence)	Constraint 4 (Evidence = Revenue): WORM-style table with `INSERT`-only permissions, cryptographic hashing for tamper detection. S3 for device fingerprints, screenshots, 3DS payloads. 7-year retention.
Model Registry & Training	MLflow + Kubeflow Pipelines	Constraint 8 (Model Lifecycle): Versioned model artifacts, experiment tracking, automated retraining DAGs. Label lag handling via configurable maturity windows.
Economic Optimization Service	Custom Python service + Metabase/Superset	Constraint 7 (Economic Loop): Batch job computes approval-rate vs loss curves nightly. Exposes threshold recommendations via API. Dashboards for finance/fraud ops to adjust risk budgets.
Governance & Audit	Apache Atlas + PostgreSQL audit tables	Constraint 9 (Governance): Data lineage tracking, policy change history with `changed_by`, `changed_at`, `previous_value`. Role-based access control. PII tagging for automated masking.
Observability	Prometheus + Grafana + OpenTelemetry + PagerDuty	Constraint 6 (Failure-Resilient): Distributed tracing for latency debugging, custom metrics for feature drift, alert routing with escalation. Safe-mode triggers based on metric thresholds.
PCI Tokenization	Basis Theory / Skyflow / VGS (Vault-as-a-Service)	Constraint 9 (PCI Boundaries): Fraud platform never sees raw PAN. Receives `card_token` and BIN (first 6-8 digits). Tokenization happens at PSP/gateway layer.

2. Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              PAYMENT & CHARGEBACK FRAUD PLATFORM                     │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌──────────────┐     ┌──────────────────┐     ┌─────────────────────────────────────┐
│   PSP/Gateway │────▶│  Webhook Ingress │────▶│  Kafka: raw-payment-events          │
│   (Stripe,    │     │  (API Gateway +  │     │  Partitioned by card_token          │
│   Adyen, etc) │     │  Idempotency     │     └─────────────────────────────────────┘
└──────────────┘     │  Check)          │                      │
                     └──────────────────┘                      ▼
                                                 ┌─────────────────────────────────────┐
┌──────────────┐                                 │  Flink Job: Event Normalization     │
│  Issuer/     │                                 │  - Schema validation                │
│  Network     │────────────────────────────────▶│  - Deduplication (idempotency key)  │
│  Alerts      │                                 │  - Event type classification        │
└──────────────┘                                 └─────────────────────────────────────┘
                                                               │
                                                               ▼
                                                 ┌─────────────────────────────────────┐
                                                 │  Kafka: normalized-events            │
                                                 └─────────────────────────────────────┘
                                                               │
                     ┌─────────────────────────────────────────┼─────────────────────────┐
                     │                                         │                         │
                     ▼                                         ▼                         ▼
       ┌─────────────────────────┐           ┌─────────────────────────┐   ┌─────────────────────┐
       │  Flink Job: Entity      │           │  Flink Job: Velocity    │   │  Flink Job: Label   │
       │  Feature Aggregation    │           │  Counter Updates        │   │  & Outcome Joiner   │
       │  (user, device, card,   │           │  (Redis writes)         │   │  (chargebacks →     │
       │   service profiles)     │           │                         │   │   training data)    │
       └─────────────────────────┘           └─────────────────────────┘   └─────────────────────┘
                     │                                         │                         │
                     ▼                                         ▼                         ▼
       ┌─────────────────────────┐           ┌─────────────────────────┐   ┌─────────────────────┐
       │  Feature Store (Feast)  │           │  Redis Cluster          │   │  Delta Lake         │
       │  Online: Redis          │           │  (velocity counters,    │   │  (labeled training  │
       │  Offline: Delta Lake    │           │   entity profiles)      │   │   datasets)         │
       └─────────────────────────┘           └─────────────────────────┘   └─────────────────────┘
                     │                                         │
                     └────────────────────┬────────────────────┘
                                          │
                                          ▼
                           ┌─────────────────────────────────────┐
                           │  RISK SCORING SERVICE               │
                           │  ┌─────────────────────────────────┐│
                           │  │ Feature Assembly (&lt;10ms)        ││
                           │  │ - Redis lookups (parallel)      ││
                           │  │ - BIN/issuer enrichment         ││
                           │  │ - External signal merge         ││
                           │  └─────────────────────────────────┘│
                           │  ┌─────────────────────────────────┐│
                           │  │ Model Inference (&lt;20ms)         ││
                           │  │ - Criminal fraud score          ││
                           │  │ - Friendly fraud score          ││
                           │  │ - Confidence intervals          ││
                           │  └─────────────────────────────────┘│
                           └─────────────────────────────────────┘
                                          │
                                          ▼
                           ┌─────────────────────────────────────┐
                           │  POLICY ENGINE                      │
                           │  ┌─────────────────────────────────┐│
                           │  │ OPA: Business Rules             ││
                           │  │ - Velocity limits               ││
                           │  │ - Blocklists/allowlists         ││
                           │  │ - Service-specific rules        ││
                           │  └─────────────────────────────────┘│
                           │  ┌─────────────────────────────────┐│
                           │  │ Decision Service                ││
                           │  │ - Profit-threshold evaluation   ││
                           │  │ - Experiment routing            ││
                           │  │ - Override handling             ││
                           │  │ - Audit logging                 ││
                           │  └─────────────────────────────────┘│
                           └─────────────────────────────────────┘
                                          │
                          ┌───────────────┼───────────────┬───────────────┐
                          ▼               ▼               ▼               ▼
                    ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
                    │  ALLOW   │   │ FRICTION │   │  REVIEW  │   │  BLOCK   │
                    │          │   │ (3DS/MFA)│   │  (Queue) │   │          │
                    └──────────┘   └──────────┘   └──────────┘   └──────────┘
                          │               │               │               │
                          └───────────────┴───────────────┴───────────────┘
                                          │
                                          ▼
                           ┌─────────────────────────────────────┐
                           │  EVIDENCE VAULT                     │
                           │  - Transaction context snapshot     │
                           │  - Feature values at decision time  │
                           │  - Policy rule trace                │
                           │  - Device/IP fingerprints           │
                           └─────────────────────────────────────┘
                                          │
                                          ▼
                           ┌─────────────────────────────────────┐
                           │  Kafka: decision-events             │
                           │  (for downstream systems, analytics)│
                           └─────────────────────────────────────┘

3. Event & Idempotency Design

3.1 Event Types

Event Type	Source	Timing	Key Fields
`AUTHORIZATION`	PSP webhook	Real-time (<1s)	`auth_id`, `card_token`, `amount`, `currency`, `service_id`, `device_fingerprint`, `ip_address`, `billing_address`, `avs_result`, `cvv_result`
`CAPTURE`	PSP webhook	Minutes to days after auth	`capture_id`, `auth_id`, `captured_amount`, `capture_timestamp`
`REFUND`	PSP webhook / Internal	Varies	`refund_id`, `auth_id`, `refund_amount`, `refund_reason`, `initiated_by`
`CHARGEBACK_INITIATED`	PSP / Network	1-120 days post-transaction	`chargeback_id`, `auth_id`, `reason_code`, `amount`, `initiated_date`
`CHARGEBACK_OUTCOME`	PSP / Network	Days to months after initiation	`chargeback_id`, `outcome` (won/lost/partial), `final_amount`, `evidence_submitted`
`DISPUTE_EVIDENCE_REQUEST`	PSP / Network	During dispute window	`chargeback_id`, `deadline`, `required_evidence_types`
`ISSUER_ALERT`	Network (TC40, SAFE)	1-7 days post-fraud	`alert_id`, `card_token`, `service_id`, `alert_type`, `fraud_amount`

3.2 Idempotency Key Design

idempotency_key = SHA256(
    source_system +      # e.g., "stripe", "adyen"
    event_type +         # e.g., "AUTHORIZATION"
    source_event_id +    # PSP's unique ID
    event_timestamp      # ISO8601 with ms precision
)

Storage: Redis with 72-hour TTL

Key:    idempotency:{idempotency_key}
Value:  {
          "processed_at": "2025-01-15T10:30:00.123Z",
          "internal_event_id": "evt_abc123",
          "decision": "ALLOW"  // for auth events
        }
TTL:    72 hours

3.3 Exactly-Once Business Effects

┌─────────────────────────────────────────────────────────────────────┐
│                    IDEMPOTENCY HANDLING FLOW                         │
└─────────────────────────────────────────────────────────────────────┘

Webhook arrives
      │
      ▼
┌─────────────────┐
│ Compute         │
│ idempotency_key │
└─────────────────┘
      │
      ▼
┌─────────────────┐     ┌─────────────────┐
│ Redis GET       │────▶│ Key exists?     │
│ idempotency:key │     └─────────────────┘
└─────────────────┘            │
                         ┌─────┴─────┐
                         │           │
                        YES         NO
                         │           │
                         ▼           ▼
              ┌─────────────┐  ┌─────────────────┐
              │ Return      │  │ Redis SET NX    │
              │ cached      │  │ idempotency:key │
              │ decision    │  │ with 72h TTL    │
              └─────────────┘  └─────────────────┘
                                     │
                               ┌─────┴─────┐
                               │           │
                           SET OK      SET FAILED
                           (we won)    (race lost)
                               │           │
                               ▼           ▼
                    ┌─────────────┐  ┌─────────────┐
                    │ Process     │  │ Retry GET   │
                    │ event       │  │ return      │
                    │ (full flow) │  │ cached      │
                    └─────────────┘  └─────────────┘
                               │
                               ▼
                    ┌─────────────────┐
                    │ Update Redis    │
                    │ with decision   │
                    │ result          │
                    └─────────────────┘

Pseudo-code for Idempotency Check:

async def handle_webhook(raw_event: dict) -> DecisionResponse:
    # 1. Compute idempotency key
    idem_key = compute_idempotency_key(raw_event)

    # 2. Check if already processed
    cached = await redis.get(f"idempotency:{idem_key}")
    if cached:
        logger.info(f"Duplicate event detected: {idem_key}")
        return DecisionResponse.from_cache(cached)

    # 3. Attempt to claim processing rights (SET NX = set if not exists)
    claimed = await redis.set(
        f"idempotency:{idem_key}",
        json.dumps({"status": "processing", "started_at": now_iso()}),
        nx=True,  # Only set if not exists
        ex=72 * 3600  # 72 hour TTL
    )

    if not claimed:
        # Race condition: another instance claimed it
        await asyncio.sleep(0.1)  # Brief wait for other instance
        cached = await redis.get(f"idempotency:{idem_key}")
        if cached and cached.get("status") == "completed":
            return DecisionResponse.from_cache(cached)
        return DecisionResponse.pending()

    # 4. Process the event (we have the lock)
    try:
        decision = await process_event_full_flow(raw_event)

        # 5. Store result for future duplicates
        await redis.set(
            f"idempotency:{idem_key}",
            json.dumps({
                "status": "completed",
                "decision": decision.to_dict(),
                "processed_at": now_iso()
            }),
            ex=72 * 3600
        )
        return decision

    except Exception as e:
        # Clear the lock so retries can attempt again
        await redis.delete(f"idempotency:{idem_key}")
        raise

3.4 Out-of-Order Event Handling

Transaction State Machine:

VALID_TRANSITIONS = {
    None: ["AUTHORIZATION"],
    "AUTHORIZED": ["CAPTURE", "VOID", "CHARGEBACK_INITIATED"],
    "CAPTURED": ["REFUND", "CHARGEBACK_INITIATED"],
    "PARTIALLY_REFUNDED": ["REFUND", "CHARGEBACK_INITIATED"],
    "FULLY_REFUNDED": ["CHARGEBACK_INITIATED"],
    "VOIDED": [],
    "CHARGEBACK_INITIATED": ["CHARGEBACK_OUTCOME"],
    "CHARGEBACK_WON": [],
    "CHARGEBACK_LOST": [],
}

async def handle_event_with_ordering(event: NormalizedEvent) -> ProcessingResult:
    txn_key = f"txn_state:{event.auth_id}"

    async with redis.lock(f"lock:{txn_key}", timeout=5):
        current_state = await redis.hgetall(txn_key)

        if event.event_type == "AUTHORIZATION":
            if current_state:
                return ProcessingResult.already_processed(current_state)
            # Process authorization...

        elif event.event_type in ["CAPTURE", "REFUND", "CHARGEBACK_INITIATED"]:
            if not current_state:
                # Parent transaction not yet processed - queue for retry
                await enqueue_delayed_event(event, delay_seconds=30)
                return ProcessingResult.deferred()

            if not is_valid_transition(current_state["status"], event.event_type):
                logger.warning(f"Invalid transition: {current_state['status']} -> {event.event_type}")
                await store_invalid_transition(event, current_state)
                return ProcessingResult.invalid_sequence()

4. Feedback Loop & Label Hygiene

4.1 Label Taxonomy

Label Category	Definition	Source Signals	Model/Policy Treatment
CRIMINAL_FRAUD	Third-party fraud: stolen credentials, account takeover, SIM farm attacks, device resale fraud	Issuer-confirmed fraud (TC40/SAFE), chargeback reason codes 10.1-10.5 (Visa) / 4837-4863 (MC), law enforcement reports	Hard blocks, device/IP blacklists, aggressive velocity limits
FRIENDLY_FRAUD	First-party abuse: legitimate cardholder disputes valid transaction, refund abuse	Chargeback reason 13.x (services/subscription), high dispute rate per user, pattern of disputes after service activation	Friction (3DS), soft denies with appeal path, evidence-heavy representment
SERVICE_ERROR	Operational issues: duplicate charges, wrong amounts, service provisioning failures	Chargeback reason 12.x (processing errors), internal ops flags, customer service tickets	Process fixes, no model penalty, operational alerts
LEGITIMATE	Valid transaction, no fraud	No chargeback within 120 days, successful service activation	Baseline for approval optimization
UNKNOWN	Insufficient signal to classify	Transaction too recent (<120 days), chargeback pending	Excluded from training until resolved

4.2 Reason Code Mapping

VISA_REASON_CODE_MAP = {
    # Criminal Fraud (10.x)
    "10.1": "CRIMINAL_FRAUD",  # EMV Liability Shift Counterfeit
    "10.2": "CRIMINAL_FRAUD",  # EMV Liability Shift Non-Counterfeit
    "10.3": "CRIMINAL_FRAUD",  # Other Fraud - Card Present
    "10.4": "CRIMINAL_FRAUD",  # Other Fraud - Card Absent
    "10.5": "CRIMINAL_FRAUD",  # Visa Fraud Monitoring Program

    # Processing Errors (12.x) -> Service Error
    "12.1": "SERVICE_ERROR",  # Late Presentment
    "12.2": "SERVICE_ERROR",  # Incorrect Transaction Code
    "12.3": "SERVICE_ERROR",  # Incorrect Currency
    "12.4": "SERVICE_ERROR",  # Incorrect Account Number
    "12.5": "SERVICE_ERROR",  # Incorrect Amount
    "12.6": "SERVICE_ERROR",  # Duplicate Processing
    "12.7": "SERVICE_ERROR",  # Invalid Data

    # Consumer Disputes (13.x) -> Friendly Fraud
    "13.1": "FRIENDLY_FRAUD",  # Merchandise/Services Not Received
    "13.2": "FRIENDLY_FRAUD",  # Cancelled Recurring
    "13.3": "FRIENDLY_FRAUD",  # Not as Described
    "13.4": "FRIENDLY_FRAUD",  # Counterfeit Merchandise
    "13.5": "FRIENDLY_FRAUD",  # Misrepresentation
    "13.6": "FRIENDLY_FRAUD",  # Credit Not Processed
    "13.7": "FRIENDLY_FRAUD",  # Cancelled Merchandise/Services
}

def classify_chargeback(
    reason_code: str,
    user_dispute_history: dict,
    delivery_confirmed: bool,
    issuer_alert_exists: bool,
    customer_service_contact: bool
) -> str:

    base_label = VISA_REASON_CODE_MAP.get(reason_code, "UNKNOWN")

    # Override rules
    if issuer_alert_exists and base_label != "CRIMINAL_FRAUD":
        return "CRIMINAL_FRAUD"

    if base_label == "FRIENDLY_FRAUD":
        if not delivery_confirmed and reason_code == "13.1":
            return "SERVICE_ERROR"  # Genuine non-receipt

        if user_dispute_history.get("dispute_count_12m", 0) > 3:
            return "FRIENDLY_FRAUD"  # Serial disputer confirmed

        if customer_service_contact and not delivery_confirmed:
            return "SERVICE_ERROR"  # Customer tried to resolve

    return base_label

4.3 Label Maturity Windows

┌─────────────────────────────────────────────────────────────────────┐
│                    LABEL MATURITY TIMELINE                           │
└─────────────────────────────────────────────────────────────────────┘

Transaction Date ─────────────────────────────────────────────────────▶
     │
     │  Day 0-30: Most criminal fraud chargebacks arrive
     │  Day 30-60: Friendly fraud chargebacks peak
     │  Day 60-90: Late chargebacks, representment outcomes
     │  Day 90-120: Final stragglers, network adjustments
     │  Day 120+: Transaction considered "mature" for training
     │
     ▼

TRAINING DATA WINDOWS:
┌──────────────────────────────────────────────────────────────────────┐
│  Include transactions where:                                          │
│    transaction_date < (current_date - 120 days)                       │
│                                                                       │
│  For "accelerated" training (fraud signal only):                      │
│    Include CRIMINAL_FRAUD labels from transactions 30-120 days old    │
│    (high-confidence positives even if window incomplete)              │
│                                                                       │
│  Exclude from training:                                               │
│    - Transactions < 30 days old (too immature)                        │
│    - Labels marked UNKNOWN                                            │
│    - Transactions under active dispute                                │
└──────────────────────────────────────────────────────────────────────┘

5. Model Update & Rollback Paths

5.1 End-to-End Model Lifecycle

STAGE 1: TRAINING
─────────────────
Training Data (Delta Lake) → Feature Engineering → Model Training (XGBoost/LGB)
                                                          │
                                                          ▼
                                               MLflow Registry (Stage: "None")

STAGE 2: OFFLINE VALIDATION
───────────────────────────
- Re-score 30 days of historical transactions with new model
- Compare decisions to actual outcomes
- Compute: precision, recall, false positive rate
- Compute: simulated revenue impact (approval rate × avg ticket)
- Compute: simulated loss (fraud rate × chargeback cost)

Pass Criteria:
  - Fraud detection rate ≥ champion - 2%
  - False positive rate ≤ champion + 0.5%
  - Simulated net revenue ≥ champion - $X threshold

STAGE 3: SHADOW MODE (Optional, 3-7 days)
─────────────────────────────────────────
- New model scores live traffic but decisions NOT used
- Log predictions alongside champion predictions
- Compare prediction distributions, look for anomalies

STAGE 4: A/B TESTING (14-30 days)
─────────────────────────────────
- Champion: 90% of traffic
- Challenger: 10% of traffic
- Assignment: Deterministic hash of auth_id for consistency

STAGE 5: PRODUCTION
───────────────────
- Continuous monitoring for feature/concept drift
- Alert if PSI > 0.2 on any top-10 feature
- Alert if approval rate changes > 3% week-over-week
- Alert if fraud rate changes > 20% week-over-week

5.2 Rollback Mechanism

model_deployment_config = {
    "champion": {
        "model_uri": "models:/fraud_model/Champion",
        "version": "v2.3.1",
        "traffic_percentage": 100,
        "rollback_to": "v2.2.0"
    },
    "challenger": {
        "model_uri": "models:/fraud_model/Challenger",
        "version": "v2.4.0-rc1",
        "traffic_percentage": 0,
        "auto_rollback_triggers": {
            "false_positive_rate_increase": 0.01,
            "approval_rate_decrease": 0.03,
            "latency_p99_increase_ms": 50
        }
    }
}

async def check_auto_rollback(metrics: dict, config: dict) -> bool:
    triggers = config["challenger"]["auto_rollback_triggers"]
    champion_metrics = metrics["champion"]
    challenger_metrics = metrics["challenger"]

    # Check false positive rate
    fpr_increase = challenger_metrics["false_positive_rate"] - champion_metrics["false_positive_rate"]
    if fpr_increase > triggers["false_positive_rate_increase"]:
        await execute_rollback("challenger", reason="fpr_threshold_exceeded")
        return True

    # Check approval rate
    apr_decrease = champion_metrics["approval_rate"] - challenger_metrics["approval_rate"]
    if apr_decrease > triggers["approval_rate_decrease"]:
        await execute_rollback("challenger", reason="approval_rate_threshold")
        return True

    return False

6. Failure & Attack Response Controls

6.1 Safe Mode Configuration

SAFE_MODE_CONFIG = {
    "enabled": False,
    "auto_triggers": {
        "model_error_rate_threshold": 0.05,
        "model_latency_p99_threshold_ms": 500,
        "redis_error_rate_threshold": 0.10,
        "consecutive_failures_threshold": 10
    },
    "fallback_rules": {
        "default_decision": "ALLOW",

        "hard_blocks": [
            {"condition": "card_on_blocklist", "decision": "BLOCK"},
            {"condition": "device_on_blocklist", "decision": "BLOCK"},
            {"condition": "ip_on_blocklist", "decision": "BLOCK"},
            {"condition": "amount_usd > 5000", "decision": "REVIEW"},
            {"condition": "country_high_risk AND amount_usd > 500", "decision": "FRICTION"}
        ],

        "velocity_limits": {
            "card_attempts_1h": {"limit": 5, "action": "FRICTION"},
            "device_attempts_1h": {"limit": 10, "action": "FRICTION"},
            "ip_attempts_1h": {"limit": 20, "action": "REVIEW"}
        }
    }
}

6.2 Threshold Rotation (Anti-Gaming)

THRESHOLD_ROTATION_CONFIG = {
    "enabled": True,
    "rotation_interval_hours": 4,
    "jitter_percentage": 0.10,  # ±10% random variation

    "base_thresholds": {
        "criminal_fraud_block": 0.85,
        "criminal_fraud_friction": 0.60,
        "friendly_fraud_friction": 0.70,
        "review_queue": 0.50
    },

    "constraints": {
        "criminal_fraud_block": {"min": 0.75, "max": 0.95},
        "criminal_fraud_friction": {"min": 0.50, "max": 0.70},
        "friendly_fraud_friction": {"min": 0.60, "max": 0.80},
        "review_queue": {"min": 0.40, "max": 0.60}
    }
}

7. Latency Budget Breakdown

Target: < 200ms end-to-end for online decisioning

Component	Budget	P50	P99	Notes
API Gateway + Auth	10ms	3ms	8ms	JWT validation
Idempotency Check (Redis)	10ms	2ms	7ms	Single GET
Event Normalization	5ms	1ms	3ms	In-memory
Feature Assembly	50ms	25ms	45ms	Parallel Redis calls
Model Inference	30ms	12ms	25ms	Seldon/KServe
Policy Evaluation (OPA)	15ms	5ms	12ms	Rego eval
Decision Logging	5ms	2ms	4ms	Async Kafka produce
Evidence Capture	10ms	4ms	8ms	Async S3 + PG
Response Serialization	5ms	2ms	4ms	-
TOTAL	140ms	56ms	128ms	60ms buffer

Key Optimizations:

Parallel Redis calls for all entity profiles
Async evidence capture (does not block response)
Cached BIN data with 24h TTL
Pre-warmed connection pools

8. Sprint-1 Architecture Decisions Summary

Decision	Choice	Rationale
Streaming	Kafka + Flink	Exactly-once semantics, stateful processing
Feature Store	Redis (online) + Feast + Delta Lake (offline)	<10ms online, point-in-time correct offline
Policy Engine	OPA + custom wrapper	Hot-reloadable, audit logging, experiments
Model Serving	Seldon Core	Canary deployments, traffic splitting
Evidence Storage	PostgreSQL + S3	Immutable, compliant, cost-effective
PCI Boundary	Tokenization at PSP	Fraud platform never sees raw PAN
Failure Mode	Safe mode with fallback rules	Graceful degradation

Next: Part 2: External Entities, Data Schemas & Feature Engineering

1. Battle-Hardened Technology Stack​

2. Architecture Diagram​

3. Event & Idempotency Design​

3.1 Event Types​

3.2 Idempotency Key Design​

3.3 Exactly-Once Business Effects​

3.4 Out-of-Order Event Handling​

4. Feedback Loop & Label Hygiene​

4.1 Label Taxonomy​

4.2 Reason Code Mapping​

4.3 Label Maturity Windows​

5. Model Update & Rollback Paths​

5.1 End-to-End Model Lifecycle​

5.2 Rollback Mechanism​

6. Failure & Attack Response Controls​

6.1 Safe Mode Configuration​

6.2 Threshold Rotation (Anti-Gaming)​

7. Latency Budget Breakdown​

8. Sprint-1 Architecture Decisions Summary​

1. Battle-Hardened Technology Stack

2. Architecture Diagram

3. Event & Idempotency Design

3.1 Event Types

3.2 Idempotency Key Design

3.3 Exactly-Once Business Effects

3.4 Out-of-Order Event Handling

4. Feedback Loop & Label Hygiene

4.1 Label Taxonomy

4.2 Reason Code Mapping

4.3 Label Maturity Windows

5. Model Update & Rollback Paths

5.1 End-to-End Model Lifecycle

5.2 Rollback Mechanism

6. Failure & Attack Response Controls

6.1 Safe Mode Configuration

6.2 Threshold Rotation (Anti-Gaming)

7. Latency Budget Breakdown

8. Sprint-1 Architecture Decisions Summary