Bits, Trades & Systems

Reading GC Logs Like a Detective

GC logs are always-on, low-overhead diagnostic data that the JVM will produce for you. They tell you the timing, cause, duration, and effect of every collection — if you know how to read them. Most Java engineers can tell you what GC does. Far fewer can look at a GC log and immediately see why the p99 latency spiked at 14:37 last Tuesday. ...

Column Stores for Analytics: Why Row-Based Is Wrong for This Problem

The analytics team’s query: “Give me total notional, average spread, and fill rate for every instrument over the last 90 days, broken down by hour.” On our Postgres trade history table with ~2 billion rows: 4 hours, 23 minutes. After the columnar rewrite: 8 seconds. This post is about why, not how to install Parquet. ...

Spec-Driven Development in Clojure: Validating Financial Data at the Edge

Before clojure.spec, our FIX message parser had a test suite with 40 hand-written test cases. We’d been running it in production for 18 months without incident. After we added spec and ran the property-based tests overnight, it found 7 edge cases we hadn’t written tests for — including one where a negative zero value (-0.0) in a price field caused the downstream risk calculation to produce NaN, which propagated silently through the pipeline and ended up in the regulatory report as a blank field. That was the end of hand-written validation tests for external data. ...

Kafka in Finance: What 'Exactly Once' Actually Costs You

Kafka 0.11 landed with exactly-once semantics and a lot of marketing. We were running trade event pipelines in a regulated environment and the promise was appealing: no duplicate trades in the downstream risk system, no idempotency logic sprinkled through consuming services. After three months with it in production, the honest summary is: EOS (exactly-once semantics) works as advertised within its scope, and that scope is narrower than it sounds. ...

Clojure Data Pipelines: Transducers in Production Risk Processing

The risk calculation pipeline processed end-of-day positions: take all the day’s trades, aggregate them to net positions, apply mark-to-market prices, and compute risk metrics. The input was ~800,000 trade records; the output was ~12,000 position records with P&L and Greeks. The initial implementation used standard Clojure sequence operations: 1 2 3 4 5 6 (->> trades (filter open-trade?) (map enrich-with-market-data) (group-by :currency-pair) (map-vals aggregate-position) (map-vals compute-risk-metrics)) Clean. Readable. And it created four intermediate collections of 800,000 elements each before producing the final output. That’s 3.2M intermediate objects for a 12K result. Transducers changed this. ...

Threading Models in Java: Which One Does Your System Actually Need?

The move from a small trading firm to a large financial institution meant working with codebases an order of magnitude larger, maintained by dozens of engineers across multiple teams. It also meant encountering the full spectrum of Java threading models in production — some appropriate, some inherited from a different era, and some that were actively causing problems. This is a survey of what those models look like, what they’re good at, and how you tell which one a system needs. ...

KDB+/Q for Java Developers: Reading the Matrix

KDB+ is used in risk analytics, trade surveillance, and market data storage across most tier-1 financial institutions. If you work in finance long enough, you will encounter it. Nothing in your Java background prepares you for it. ...

Heap Dumps and Flight Recorder: Diagnosing JVM Memory Problems in Production

At the large financial institution where I worked from 2016, the JVM services were larger and longer-running than anything I’d dealt with in the previous role. Old generation sizes in the hundreds of gigabytes. Services running for months between restarts. Memory problems that took days or weeks to manifest. The debugging approach that worked in trading — small heaps, frequent restarts, aggressive allocation control — didn’t apply here. You had to diagnose production JVM state without stopping it. ...

Time-Series Data at a Bank: Why Relational Databases Break and What Comes Next

When I moved to the large financial institution, the team I joined managed the market data and trade data storage layer. The engineering problem was deceptively simple to state: store every price tick, every trade execution, and every risk calculation — billions of records per day — and answer analytical queries over them quickly. The existing system was PostgreSQL. It worked, technically. Queries that needed to run in seconds for trading decisions took minutes. Operational costs for storage were climbing. The database team was spending more time running VACUUM than building features. Understanding why required understanding what time-series data actually is and why it’s different. ...

Why the Risk Team Chose Clojure (And Why It Made Sense)

The first time someone told me the risk calculation system was written in Clojure, I assumed it was a prototype or a skunkworks project. It wasn’t. It processed end-of-day risk for a significant portion of the firm’s trading book and had been in production for two years. Here’s why that decision made sense, and what it was actually like to work in. ...