Posts

What Big-Bank Engineering Taught Me About System Design

I joined the large financial institution expecting to find bureaucracy that slowed down engineering. I did find that. I also found something I didn’t expect: certain constraints imposed by regulation, scale, and risk aversion produced genuinely better engineering decisions than I’d been making at the smaller trading firm. This is about the non-obvious lessons. ...

Event Sourcing in Financial Systems: Real Benefits, Real Costs

Financial systems are natural candidates for event sourcing. Regulators want to know the state of positions at any point in time. Audit trails are not optional. The need to replay a day’s events to debug a pricing anomaly comes up regularly. These requirements — which other domains treat as optional — map directly onto event sourcing’s core properties. That said, event sourcing in production has costs that the enthusiast literature systematically underplays. Here’s an honest accounting. ...

Backpressure in Practice: Keeping Fast Producers from Killing Slow Consumers

The system that prompted this post was a trade enrichment pipeline. The input was a Kafka topic receiving ~50,000 trade events per minute during market hours. The enrichment step required a database lookup — pulling counterparty and instrument data — that averaged 2ms per trade. 50,000 trades/minute = ~833 trades/second. At 2ms per lookup, a single thread can handle 500 lookups/second. To keep up, we needed at least two threads and ideally a small pool. We had six threads and a queue in front of them. During a market event that pushed the rate to 200,000 trades/minute, the queue grew without bound, memory climbed, and the service eventually OOM’d. Classic backpressure failure. ...

Project Loom Preview: Virtual Threads and What They Mean for Server Code

Java’s threading model has a fundamental scalability problem: OS threads are expensive. Creating thousands of them consumes gigabytes of stack memory and causes significant scheduling overhead. This is why reactive programming (Netty, Project Reactor, RxJava) became popular — it avoids the thread-per-request model by using event loops and async callbacks. Project Loom, announced in 2017 with early previews arriving in 2018, proposed a different solution: make threads cheap. Virtual threads — JVM-managed threads that are not 1:1 with OS threads — could make the thread-per-request model scalable again. ...

Two Years of Clojure in Production: Honest Retrospective

Two years. Long enough that the novelty is gone and what’s left is the actual experience of living with the decision. Here’s the retrospective I’d want to have read before starting. ...

Distributed Transactions Are a Lie (And What to Do Instead)

Every discussion of distributed systems eventually reaches the question: “can we just wrap this in a transaction?” The answer is technically yes and practically no. Understanding why — and what to do instead — is one of the more important shifts in distributed systems thinking. ...

From Java 8 to Java 11 in a Regulated Environment: What Actually Broke

Java 11 was the first long-term support release after Java 8. Oracle’s announcement that commercial Java 8 support would end pushed the bank’s architecture committee to approve a migration. In theory: update the JDK, update the build files, done. In practice: six months of discovery. This is a frank account of what broke. ...

Building MiFID II Trade Reporting Infrastructure: An Engineer's View

MiFID II went live on January 3, 2018. The preparation started in 2016. Two years for a set of regulatory requirements that, from the outside, looked straightforward: report each trade to a trade repository within 15 minutes of execution. From the inside, “report each trade” requires answering: which trades? From which systems? In what format? To which trade repository? What constitutes a trade for the purposes of reporting vs. booking vs. settlement? What do you do when the reporting service is unavailable? What happens when the trade repository rejects a report? This is the engineering story of building a system to answer those questions. ...

Stream Processing with Kafka Streams vs Flink: A Real Comparison

By mid-2017, the institution had two competing proposals on the table for the next generation of real-time analytics infrastructure: one team advocating Kafka Streams, another advocating Apache Flink. Both solve the same problem. Both use Kafka as input and output. Both provide stateful stream processing with windowing and exactly-once semantics. The evaluation took eight weeks. Here’s what we found. ...

Persistent Data Structures Are Not Just for Functional Purists

When I joined the bank’s risk team, Clojure was already in production for risk calculation. The code I inherited used Clojure’s persistent maps and vectors everywhere — not as a philosophical statement but because the team had found them practically useful in a specific way. The specific way: concurrent reads and occasional writes to a shared state snapshot, with no locks. ...