Systems

Case studies by Efe Ertugrul. Real constraints, architecture decisions, and what actually shipped.

Simmunome — AI Bioinformatics Platform

Full-stack + DevOps

What It Does

Platform for protein-protein interaction (PPI) network analysis, drug target validation, and biomarker discovery. Core workloads:

Target validation: Complex pathfinding algorithms on PPI networks to identify therapeutic targets and predict drug efficacy.
Biomarker discovery: ML classifiers (XGBoost) trained on multi-omics data to identify patient responders and stratify clinical populations.
Network simulations: Graph-based algorithms (PageRank variants, network diffusion) for scoring protein interactions and pathway enrichment.

Ingests proteomics, genomics, and clinical metadata (hundreds of GBs per analysis). Exposes interactive tools for researchers via React frontend.

Infrastructure Design

Node groups: GKE/EKS with 3 node pools: (1) general-purpose (n2-standard-4) for API/DB, (2) compute-optimized (c2-standard-8) for graph algorithms, (3) spot instances (c2-standard-16) for batch ML training. Spot saves ~60% on compute-heavy jobs where failure is tolerable.

When to use what: API traffic → general-purpose (stable, low-latency). PPI pathfinding → compute-optimized (CPU-bound, need consistent performance). XGBoost training → spot (long-running, checkpointed every 15min, can tolerate evictions).

Key Decisions

K8s over serverless: Jobs run 10min–2hr, too long for Lambda. K8s gives workload isolation, autoscaling, and per-service resource tuning.

MongoDB: Hierarchical data model (orgs → projects → analyses → results) with nested metadata. Scientific teams add fields weekly. Indexed on user IDs, project IDs, timestamps. Queries < 100ms with millions of documents.

Storage tiering: Hot data (user queries, recent results) in MongoDB. Cold data (multi-GB PPI networks, raw omics files) in S3/GCS with signed URLs. Keeps queries fast and costs low.

CI/CD: GitHub Actions for linting, tests, Docker builds. Auto-deploy to staging on merge. Deployment time: ~1 hour → 8 minutes.

Issues Fixed

Connection pooling: Initial: new DB connection per job → hit connection limits. Fix: pool with max size tuned to memory. Latency -40%.

Spot evictions: Lost 3 hours of compute when GCP reclaimed nodes. Fix: checkpoint every 15min. Evictions now cost ~2min instead of full restart.

Observability: Unindexed query caused latency spike, took 2 hours to find. Added Prometheus metrics (latency, queue depth, job duration, errors). Alerts fire before users notice.

Tech Stack

Python (Flask, FastAPI), React, MongoDB, Redis, Docker, Kubernetes (GKE/EKS), GCP, AWS, Auth0, GitHub Actions, Prometheus, S3/GCS

Platform Company

Compute Orchestration for Scientific Workloads

Simmunome platform component | Infra-heavy system design

Problem

Batch system for CPU-bound graph algorithms on PPI networks: PageRank-style scoring (100K+ nodes), Monte Carlo pathway enrichment, network diffusion for target ranking. Jobs take 10min–2hr, arrive in bursts (5–20 at once, then nothing for an hour). Budget: ~$200/mo compute.

Design

Queue: MongoDB-backed (already in stack for job metadata). Status field: `pending` → `running` → `complete`/`failed`. Workers poll with TTL lock. <50ms latency.

Autoscaling: K8s HPA scales workers 2–10 based on queue depth (custom metric from MongoDB). Min=2 (low cold-start latency), Max=10 (cost control).

Resilience: Exponential backoff retries (1min, 5min, 15min). Jobs >30min checkpoint to S3 every 15min. Spot evictions resume from checkpoint (~40% compute savings on failures).

Isolation: Each job in own subprocess with memory/CPU limits. One OOM doesn't crash worker or affect other jobs.

Issues Fixed

Polling storms: 10 workers × 2 QPS (500ms poll) = MongoDB CPU spike. Fix: long-polling (5s) + Redis "job available" flag. Polling load -90%.

Memory leaks: NetworkX/NumPy leak on large graphs. Workers OOM after ~50 jobs. Fix: auto-restart after 20 jobs.

Duplicate results: Worker crashes after writing results but before updating status → next worker re-runs. Fix: check result existence (by job_id) before starting.

Tradeoffs

MongoDB vs SQS: Simpler (one less service) but loses visibility guarantees. Good for <1K jobs/day, wouldn't scale to 100K+ without re-arch.

Checkpointing cost: +2% runtime overhead, -40% wasted compute on failures. Net win.

Custom queue vs Airflow/Celery: Less operational complexity for single-step jobs. Would switch if DAGs needed.

Core Banking Services — BNP Paribas

Software Architect Assistant

Problem

Core banking at BNP Paribas: account operations, transactions, funds transfers, loan routing. Millions of retail customers. Strict regulatory compliance (financial audits, PCI-DSS, EU data residency). Performance target: P95 < 200ms for reads, < 1s for transactions. Zero downtime deploys.

Design

Microservices: Separate Spring Boot apps for accounts, transactions, loans. Each owns its DB schema (no shared DB) for independent deploys and failure isolation.

Security: JWT auth + RBAC (customer/teller/admin). Custom @RequiresRole annotations. Failed auth → security audit trail.

Transactions: @Transactional with isolation levels (READ_COMMITTED for queries, SERIALIZABLE for transfers). Multi-step workflows (debit → credit → notify) wrapped in single transaction.

Validation: 3 layers: (1) OpenAPI schema, (2) Spring Validator business rules, (3) DB constraints. Blocks injection attacks and malformed requests.

Monitoring: Spring Actuator + Prometheus. Alerts on error rate > 1% or P95 > 500ms.

Services Built

Funds transfer: REST API with 3-layer validation (balance check, account active, user authorized). DB transaction prevents race conditions. Idempotency keys prevent duplicates on retry.

Loan routing: GraphQL API. Routes by amount (small → auto-approve, large → underwriter) + credit score (external service via RestTemplate + circuit breaker).

Audit log: Immutable append-only table. Every transaction logged: user ID, timestamp, operation, before/after state.

Lessons

DB performance: One missing index on "account_id" caused P95 50ms → 800ms. EXPLAIN ANALYZE + indexes on FKs + batch queries essential.

Defensive programming: Assume malicious input, failing services, edge cases. Multi-layer validation, explicit exception handling, tests for failure modes.

Code review rigor: 2+ reviewers check security, correctness, maintainability. Prevents bugs reaching production.

Tech Stack

Java 11, Spring Boot, Spring Data JPA, Spring Security, Hibernate, GraphQL Java, Maven, SQL (PostgreSQL), Docker, JUnit/Mockito, Git