A Software Engineer Plays Quant: Building a Market-Data Research Stack in Python

For most of my career I built the plumbing of finance from the engineering side — feed handlers, order books, low-latency messaging, the systems that carry trading decisions without ever making them. I knew the microstructure cold and the strategy not at all. That division of labour is normal and mostly correct: the quant researches the signal, the engineer builds the system that executes it at scale.

I’m starting a series of side projects that deliberately cross that line. Not to become a quant — I have no illusions about competing with people who do this full-time with better data and better maths. The goal is narrower and more interesting to me: what does market-data analysis look like when a software engineer does it? What does the engineering discipline I’ve spent fourteen years accumulating buy you when you point it at financial time series in Python?

This first post is about the part nobody writes about because it isn’t exciting: the research foundation. Get this wrong and everything built on top is sand.

Why Start With the Boring Layer

Every quant tutorial opens with a strategy. Moving-average crossover, mean reversion, a Sharpe ratio that looks great. Almost all of them are quietly broken, and they’re broken in the foundation, not the strategy — lookahead bias, survivorship bias, a backtest that silently uses information it couldn’t have had at the time.

This is exactly the class of bug an engineer is trained to hate: the kind that doesn’t throw an exception, doesn’t fail a test, and produces a beautiful, completely wrong result. So I’m doing what I’d do in any system where correctness is subtle and silent: build the foundation carefully, make the invariants explicit, and instrument the things that lie.

The stack is deliberately boring and standard: numpy, pandas, pyarrow for storage. No exotic libraries yet. The discipline is the point, not the tooling.

Invariant 1: A Bar Is Closed Before You Can Use It

The single most common bug in amateur backtests is lookahead: using information from a bar to make a decision at that bar, when in reality you’d only know the bar’s values after it closed.

Concretely. You have daily OHLC data. You compute a 20-day moving average and decide to buy when price crosses above it. The naive code:

1
2
3
4
df["ma20"] = df["close"].rolling(20).mean()
df["signal"] = df["close"] > df["ma20"]
df["return"] = df["close"].pct_change()
df["strategy_return"] = df["signal"] * df["return"]   # BUG

The bug is in the last line. signal on row t is computed from close[t] — which you only know at the end of day t. But return on row t is the return from t-1 to t, which already happened. You’re using the day-t signal to capture the day-t return. You’re trading on information from the future.

The fix is one shift, and it’s the most important shift in the file:

1
2
# The signal you can act on today was decided using yesterday's close.
df["strategy_return"] = df["signal"].shift(1) * df["return"]

I don’t want this to be a thing I “remember to do.” I want it to be structurally hard to get wrong. So the foundation has a rule: signals and returns live in separate, explicitly-aligned frames, and the alignment is one function with a name.

1
2
3
4
5
6
def realised_return(signal: pd.Series, returns: pd.Series, lag: int = 1) -> pd.Series:
    """Return the strategy return series with the signal lagged so it only
    uses information available before the return period. lag=1 means the
    signal is decided on the prior bar's close."""
    aligned = signal.shift(lag)
    return aligned * returns

A named function with a docstring stating the invariant is worth more than a comment, because it shows up at every call site and it’s the thing you grep for in review. This is just “make illegal states unrepresentable” applied to time-series alignment.

Invariant 2: The Index Is Time, and Time Is Not Negotiable

pandas will happily let you do arithmetic between two series whose indices don’t line up, filling the gaps with NaN, and it will not warn you. For market data — where a missing day, a holiday, or a mismatched timezone is routine — this is a landmine.

The rules I enforce in the loader:

The index is a DatetimeIndex, timezone-aware, always. Naive timestamps are banned. A US equity bar timestamped 16:00 means nothing until you know it’s America/New_York. I localise at the boundary and never let a naive timestamp into the analysis layer.
The index is monotonic and unique. A duplicate timestamp or an out-of-order row is a data-quality bug, and I assert against it at load time rather than discovering it as a weird result three transformations later.

1
2
3
4
5
6
7
8
9
def load_bars(path: str, tz: str = "America/New_York") -> pd.DataFrame:
    df = pd.read_parquet(path)
    df.index = pd.to_datetime(df.index, utc=True).tz_convert(tz)
    # Fail loud at the boundary, not silently three steps downstream.
    assert df.index.is_monotonic_increasing, "bars not sorted by time"
    assert df.index.is_unique, "duplicate timestamps in bar data"
    assert not df[["open", "high", "low", "close"]].isna().any().any(), \
        "NaNs in OHLC — handle gaps explicitly before analysis"
    return df

This is just precondition-checking — the same assert-at-the-boundary discipline I’d put on any function that takes external data. Financial data sources are dirty; the assertions are how the dirt announces itself instead of corrupting a result.

Invariant 3: Vectorise, But Know What the Vector Hides

The engineering instinct from other domains is “write the loop, make it correct, optimise later.” In pandas that instinct is wrong in both directions: the explicit Python loop over rows is so slow it’s unusable on real data, and it tends to be more bug-prone, not less, because you hand-manage indices.

The idiom is vectorised operations over whole columns. Computing returns, rolling statistics, and signals as array operations is both faster (numpy does it in C) and clearer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import numpy as np

# Log returns — additive over time, symmetric, the quant default.
df["log_ret"] = np.log(df["close"]).diff()

# Rolling annualised volatility (252 trading days), vectorised.
df["vol_20d"] = df["log_ret"].rolling(20).std() * np.sqrt(252)

# A z-score of price vs its own recent mean — vectorised, no loop.
roll = df["close"].rolling(60)
df["zscore"] = (df["close"] - roll.mean()) / roll.std()

But — and this is the engineer’s contribution — vectorisation hides the time-ordering, which is exactly where the bugs are. rolling(20) looks backward by default, which is correct. But df["close"].rolling(20, center=True) looks forward too, which in a backtest is lookahead bias wearing a respectable function name. A .expanding() window that accidentally includes the current row, a fillna with method="bfill" that pulls future values backward — all of these are vectorised, fast, clean-looking, and wrong.

So the rule on top of “vectorise everything” is: every windowed or fill operation gets reviewed for direction. Does it look only at the past? bfill, center=True, and any forward-looking fill are banned in the analysis layer and require an explicit, commented exception. This is the kind of checklist that belongs in a review, and since I’m reviewing my own side-project code, it belongs in a written rule I can’t argue my way out of at 11pm.

Invariant 4: Storage Is Columnar and Immutable

The data layer reuses something I’m certain about from the day job: columnar storage. Market data is the canonical columnar workload — you scan one column (close) across millions of rows far more often than you read whole rows. Parquet via pyarrow is the obvious choice:

1
df.to_parquet("data/SPY_1d.parquet", engine="pyarrow", compression="zstd")

Two engineering habits carry straight over. First, raw data is immutable — the downloaded bars are written once and never mutated; every transformation produces a new artifact. This makes the pipeline reproducible and makes “did my cleaning step corrupt something” a diff rather than a mystery. Second, the schema is explicit — dtypes are pinned, not inferred, so a column that arrives as object because one value was a string fails at load rather than silently poisoning a mean().

It’s the same Arrow/Parquet substrate I’ve written about from the data-engineering side — here it’s just running on my laptop against a few gigabytes instead of in a cluster against petabytes. The principles don’t change with the scale.

What This Is Groundwork For

This foundation — lagged-signal alignment, asserted time index, direction-reviewed vectorisation, immutable columnar storage — is deliberately strategy-agnostic. It’s the substrate the rest of the series builds on, and I wanted it written down and defensible before doing anything that produces a tempting-looking equity curve.

Where this is heading, across the coming posts:

Honest backtesting — transaction costs, slippage, and why a backtest without them is a fantasy; building a vectorised backtester that can’t accidentally see the future.
The data problems that break everything — survivorship bias, look-back on fundamentals, point-in-time correctness for anything that gets restated after the fact.
Where ML actually helps and where it’s cargo cult — feature engineering on time series without leaking the label, walk-forward validation instead of random k-fold (which is lookahead bias with a respectable name), and being honest about how little signal there is.
The engineering-to-quant translation — what my background genuinely transfers (correctness discipline, reproducibility, data infrastructure) and what it absolutely does not (the maths, the market intuition, the humility about how hard this is).

I’m under no illusion that careful data hygiene generates alpha. It doesn’t. What it does is make sure that when I see a result, I can trust that it’s real before I get excited about it — which, given how many backtests are quietly broken at the foundation, is most of the battle. That’s the bet of this whole series: that the unglamorous engineering virtues are worth more in this domain than they get credit for.

The boring layer is the whole point. A quant researcher and a software engineer looking at the same broken backtest see different things — the researcher reaches for a better signal, the engineer reaches for the invariant that’s being violated. I only have the second instinct, so that’s the one I’m leaning on. Next in the series: building a backtester that charges you for your trades and refuses to look at the future.

Why Start With the Boring Layer#

Invariant 1: A Bar Is Closed Before You Can Use It#

Invariant 2: The Index Is Time, and Time Is Not Negotiable#

Invariant 3: Vectorise, But Know What the Vector Hides#

Invariant 4: Storage Is Columnar and Immutable#

What This Is Groundwork For#