Java Chronicle: Off-Heap Persistence Without Serialisation Overhead

Every trade needs to be journaled. You need a durable, ordered record of every order, fill, and state change — for risk, reconciliation, and regulatory purposes. The naive solution is a database write on the hot path. That’s a roundtrip to an external process, a network call, and often a disk fsync. It’s also hundreds of milliseconds of latency per event. Chronicle Queue gave us persistent journaling at sub-microsecond overhead. Here’s how. ...

September 5, 2013 · 4 min · MW

JVM JIT Compilation: What the C2 Compiler Does to Your Loops

Java’s “write once, run anywhere” promise is kept by the JVM. Its performance is kept by the JIT compiler. The gap between “Java is slow” (the 1998 opinion) and “Java is competitive with C++ for many workloads” (the 2013 reality, and more so now) is almost entirely the C2 compiler. Understanding what C2 does — and when it stops doing it — matters if you’re writing performance-sensitive Java. ...

July 30, 2013 · 6 min · MW

Market Connectivity: Building a Low-Latency Feed Handler

The feed handler is where the external world becomes internal data. A FIX or binary protocol stream arrives over the network, gets parsed into typed events, and gets handed to the internal processing pipeline. Nothing downstream can be faster than the feed handler’s latency. This is the design I evolved over three iterations at the trading firm. ...

June 11, 2013 · 6 min · MW

Comparing ArrayBlockingQueue to the Disruptor: Numbers Don't Lie

After writing about the Disruptor’s design, the obvious question is: how much faster is it, really? “Faster” is not a useful answer. Let’s look at actual numbers under controlled conditions. This is a benchmarking exercise, not a recommendation. The right data structure depends on your use case. The goal here is to understand the performance characteristics of each under different contention patterns. ...

May 22, 2013 · 4 min · MW

Disruptor Deep Dive: Memory Layout, Cache Lines, and False Sharing

The Disruptor’s performance isn’t magic. It’s the consequence of a set of deliberate memory layout decisions, each targeting a specific cache coherency problem. This post goes through those decisions one by one. ...

April 9, 2013 · 5 min · MW

The LMAX Disruptor: How a Ring Buffer Changed My Mental Model of Queues

In mid-2013 we replaced our internal LinkedBlockingQueue-based event bus with the LMAX Disruptor. Median latency dropped by 30%. The 99th percentile dropped by more than half. The change touched about 400 lines of code. This post is about the conceptual model you need to understand why the Disruptor is fast — not just “it uses a ring buffer,” but what that actually means for your hardware. ...

February 28, 2013 · 4 min · MW

Introduction to Lock-Free Programming in Java

Locks work. synchronized in Java is correct, well-understood, and wrong for our use case. A lock that’s contested causes a thread to block — the OS parks it, context-switches to something else, and eventually context-switches back. Each of those transitions costs microseconds. When your SLA is sub-millisecond and your hot path is called 200,000 times per second, locks are not an option. Lock-free programming replaces locks with atomic CPU instructions. The CPU handles the synchronisation at the hardware level, without OS involvement. ...

January 17, 2013 · 5 min · MW

Mechanical Sympathy: Writing Java That Respects the Hardware

Martin Thompson coined the term “mechanical sympathy” — the idea that to write fast software you need to understand the machine it runs on. Not at the assembly level necessarily, but well enough to reason about what the CPU, memory hierarchy, and OS are actually doing with your code. This post is what that looks like in practice, writing Java for a system where microseconds matter. ...

December 4, 2012 · 4 min · MW

Stop-the-World GC Pauses Killed Our SLA — And What We Did About It

The incident happened at 08:31 on a Tuesday — Frankfurt open, high volatility session. Our tick-to-quote latency spiked to 340ms for about 2 seconds. The SLA was 1ms at p99. Trading desk noticed before our monitoring did. The culprit: a full GC triggered by a promotion failure. We had 12GB heap, CMS collector, and no one had looked at GC logs since the initial deployment. ...

November 13, 2012 · 3 min · MW

Latency vs Throughput: The False Dichotomy I Learned the Hard Way

In my first performance review at the trading firm, I described a component I’d optimised as “high throughput.” My manager asked what the p99 latency was. I didn’t know. He asked what happened to latency during peak throughput. I didn’t know that either. The conversation went downhill from there. That exchange forced me to be precise about what I was actually optimising for — and why throughput and latency, while related, are fundamentally different properties. ...

September 25, 2012 · 5 min · MW
Available for consulting Distributed systems · Low-latency architecture · Go · LLM integration & RAG · Technical leadership
[email protected]