Data-at-Scale on Bits, Trades & Systems

Data-at-Scale on Bits, Trades & Systems https://blog.turboawesome.win/series/data-at-scale/ Recent content in Data-at-Scale on Bits, Trades & Systems Hugo en-us Wed, 16 Apr 2025 13:12:00 +0000 Distributed Consistency Models: What Your Service Actually Guarantees https://blog.turboawesome.win/2025/04/distributed-consistency-models-what-your-service-actually-guarantees/ Wed, 16 Apr 2025 13:12:00 +0000 https://blog.turboawesome.win/2025/04/distributed-consistency-models-what-your-service-actually-guarantees/ Linearisability, serializability, eventual consistency, causal consistency — these terms are used loosely and understood imprecisely. Knowing what your data store actually guarantees determines whether your distributed system is correct. Tail-Based Trace Sampling: Why Head Sampling Is Usually Wrong https://blog.turboawesome.win/2024/10/tail-based-trace-sampling-why-head-sampling-is-usually-wrong/ Wed, 09 Oct 2024 13:00:00 +0000 https://blog.turboawesome.win/2024/10/tail-based-trace-sampling-why-head-sampling-is-usually-wrong/ Head-based sampling decides whether to trace a request at the start. Tail-based sampling decides after the request completes. For finding latency outliers and errors, tail-based sampling is almost always what you want — and almost never what gets implemented. Observability at Scale: What 'Good' Looks Like When You Have Too Much Data https://blog.turboawesome.win/2024/05/observability-at-scale-what-good-looks-like-when-you-have-too-much-data/ Wed, 29 May 2024 09:47:00 +0000 https://blog.turboawesome.win/2024/05/observability-at-scale-what-good-looks-like-when-you-have-too-much-data/ Observability problems at large scale are different from small-scale ones. Too little signal is replaced by too much signal, and the engineering challenge inverts. Cache Design as a Reliability Practice, Not an Optimisation https://blog.turboawesome.win/2024/03/cache-design-as-a-reliability-practice-not-an-optimisation/ Wed, 27 Mar 2024 11:47:00 +0000 https://blog.turboawesome.win/2024/03/cache-design-as-a-reliability-practice-not-an-optimisation/ Most engineers add caches to make things faster. At scale, the more important reason to design caches carefully is reliability — a cache failure should not cascade into a system failure. The patterns that prevent that are different from the patterns that optimise for speed. ClickHouse for Application Analytics: Fast Aggregations Without Spark https://blog.turboawesome.win/2023/05/clickhouse-for-application-analytics-fast-aggregations-without-spark/ Wed, 17 May 2023 14:08:00 +0000 https://blog.turboawesome.win/2023/05/clickhouse-for-application-analytics-fast-aggregations-without-spark/ When you need sub-second aggregations over billions of rows and don't want to run a Spark cluster, ClickHouse is often the answer. Notes from a year in production. Kafka at Startup Scale https://blog.turboawesome.win/2022/05/kafka-at-startup-scale/ Wed, 18 May 2022 14:00:00 +0000 https://blog.turboawesome.win/2022/05/kafka-at-startup-scale/ Running Kafka at a startup is different from running it at enterprise scale. The operational complexity is real, the defaults are wrong for small clusters, and the failure modes are different from what the documentation implies. Choosing a Time-Series Database in 2020 https://blog.turboawesome.win/2020/11/choosing-a-time-series-database-in-2020/ Wed, 18 Nov 2020 11:00:00 +0000 https://blog.turboawesome.win/2020/11/choosing-a-time-series-database-in-2020/ In 2020 the time-series database landscape had fragmented into specialised options with very different tradeoffs. InfluxDB, TimescaleDB, ClickHouse, Prometheus — each suited to different access patterns, retention requirements, and operational models. Schema Evolution in Avro: The Hard Lessons from Production https://blog.turboawesome.win/2018/10/schema-evolution-in-avro-the-hard-lessons-from-production/ Thu, 04 Oct 2018 11:29:00 +0000 https://blog.turboawesome.win/2018/10/schema-evolution-in-avro-the-hard-lessons-from-production/ Avro's schema evolution rules sound simple. In production with multiple services and a regulated data retention requirement, the edges are sharp. Here are the cases that burned us and the practices that prevented future ones. Event Sourcing in Financial Systems: Real Benefits, Real Costs https://blog.turboawesome.win/2018/07/event-sourcing-in-financial-systems-real-benefits-real-costs/ Wed, 11 Jul 2018 11:03:00 +0000 https://blog.turboawesome.win/2018/07/event-sourcing-in-financial-systems-real-benefits-real-costs/ Event sourcing is a natural fit for financial systems that require audit trails and point-in-time reconstruction. The costs are real too — projections, eventual consistency, and the event schema evolution problem. Backpressure in Practice: Keeping Fast Producers from Killing Slow Consumers https://blog.turboawesome.win/2018/06/backpressure-in-practice-keeping-fast-producers-from-killing-slow-consumers/ Thu, 14 Jun 2018 10:33:00 +0000 https://blog.turboawesome.win/2018/06/backpressure-in-practice-keeping-fast-producers-from-killing-slow-consumers/ Every system has components that produce faster than consumers can handle under some conditions. Backpressure is the mechanism by which fast producers are slowed rather than dropping data or consuming unbounded memory. Here's what the options look like in practice. Distributed Transactions Are a Lie (And What to Do Instead) https://blog.turboawesome.win/2018/01/distributed-transactions-are-a-lie-and-what-to-do-instead/ Wed, 17 Jan 2018 10:55:00 +0000 https://blog.turboawesome.win/2018/01/distributed-transactions-are-a-lie-and-what-to-do-instead/ Two-phase commit promises ACID semantics across distributed systems. In practice it's slow, fragile, and blocks under failure. The patterns that actually work — sagas, idempotency, and compensating transactions — are more complex but more reliable. Building MiFID II Trade Reporting Infrastructure: An Engineer's View https://blog.turboawesome.win/2017/10/building-mifid-ii-trade-reporting-infrastructure-an-engineers-view/ Tue, 03 Oct 2017 11:45:00 +0000 https://blog.turboawesome.win/2017/10/building-mifid-ii-trade-reporting-infrastructure-an-engineers-view/ MiFID II required every trade to be reported within 15 minutes of execution. Building the infrastructure to meet that requirement across a large, heterogeneous estate taught us about the gap between regulatory requirements and production reality. Stream Processing with Kafka Streams vs Flink: A Real Comparison https://blog.turboawesome.win/2017/09/stream-processing-with-kafka-streams-vs-flink-a-real-comparison/ Wed, 27 Sep 2017 14:02:00 +0000 https://blog.turboawesome.win/2017/09/stream-processing-with-kafka-streams-vs-flink-a-real-comparison/ We evaluated Kafka Streams and Apache Flink for a real-time trade enrichment and aggregation pipeline. The technical comparison produced a clear result; the operational comparison was more nuanced. Column Stores for Analytics: Why Row-Based Is Wrong for This Problem https://blog.turboawesome.win/2017/04/column-stores-for-analytics-why-row-based-is-wrong-for-this-problem/ Wed, 05 Apr 2017 14:33:00 +0000 https://blog.turboawesome.win/2017/04/column-stores-for-analytics-why-row-based-is-wrong-for-this-problem/ When the data team started complaining about multi-hour Postgres queries on trade history, we rewrote the analytics layer around columnar storage. The why is interesting. Kafka in Finance: What 'Exactly Once' Actually Costs You https://blog.turboawesome.win/2017/01/kafka-in-finance-what-exactly-once-actually-costs-you/ Tue, 10 Jan 2017 11:22:00 +0000 https://blog.turboawesome.win/2017/01/kafka-in-finance-what-exactly-once-actually-costs-you/ Kafka's exactly-once semantics arrived in 0.11 with significant caveats. Using it in a regulated financial context forced a clear-eyed view of what the guarantee actually covers and what it doesn't. KDB+/Q for Java Developers: Reading the Matrix https://blog.turboawesome.win/2016/10/kdb-/q-for-java-developers-reading-the-matrix/ Tue, 11 Oct 2016 14:17:00 +0000 https://blog.turboawesome.win/2016/10/kdb-/q-for-java-developers-reading-the-matrix/ KDB+ is the database of choice for time-series analytics in investment banks. It's fast, alien, and worth understanding. A Java developer's field guide. Time-Series Data at a Bank: Why Relational Databases Break and What Comes Next https://blog.turboawesome.win/2016/07/time-series-data-at-a-bank-why-relational-databases-break-and-what-comes-next/ Wed, 06 Jul 2016 11:34:00 +0000 https://blog.turboawesome.win/2016/07/time-series-data-at-a-bank-why-relational-databases-break-and-what-comes-next/ Financial institutions generate millions of time-stamped data points every day. The relational database model, designed for transactional workloads, breaks down spectacularly for this use case — here's why, and what replaces it.