Market Data and Feed Software

Market Data and Feed Software

You can trade for years and still be stuck waiting on someone else’s data pipeline. The numbers arrive late, formats change without warning, and suddenly your “simple” import script turns into a small job. Building your own market data and feed software fixes that—at least partially—because you control collection, normalization, storage, and delivery to downstream systems.

This article walks through what “making your own market data and feed software” actually means in practical terms: architecture options, the data you’ll need, how to process it, how to deliver it reliably, and what tends to break once you move from a proof-of-concept to something that runs daily. The tone is intentionally practical. If you’ve written a few ETL jobs or built a streaming consumer before, you’ll feel at home; if not, you’ll still get enough structure to start planning without guesswork.

What “market data and feed software” covers

Before you start writing code, it helps to separate responsibilities. People often lump everything into one bucket—data ingestion, cleaning, storage, and distribution—but the pieces behave differently and fail differently.

Data ingestion

This is the part that talks to a source. Depending on your setup, sources can include:

– Exchange-provided streaming feeds (often via FIX, WebSocket, or proprietary gateways)
– Vendor data APIs (REST or streaming)
– Public feeds (less common for production use)
– Your own internal events (for example, trades you execute, fills, or reference prices)

Ingestion needs to handle authentication, reconnection, message ordering (if promised), and backpressure.

Normalization and enrichment

Even if a source claims to be “clean,” it rarely matches your internal schema. Normalization turns incoming messages into a consistent structure: timestamps, symbol mapping, instrument identifiers, numeric types, and fields with consistent naming.

Enrichment adds what your downstream consumers expect but the source may not provide directly. Examples: mapping vendor symbols to your internal instrument IDs, tagging venue codes, or deriving additional metrics.

Storage and replay

You’ll eventually want to replay. Whether you’re debugging a strategy, backtesting, investigating a mispricing, or running audits, replay is where “owning your feed” starts to matter.

Storage usually means both raw message storage (sometimes) and normalized snapshots or event logs. You also need retention policies that won’t bankrupt you.

Distribution to consumers

Your feed software should provide data to strategies, dashboards, risk systems, and perhaps other services. Distribution options range from pushing to in-process components to exposing a lightweight internal API or streaming service.

This layer is often where performance and failure handling show up. Consumers subscribe and assume certain semantics: ordering, completeness, and delivery timing.

A good mental model

Think of it like a small power plant. The ingestion side turns raw “fuel” (messages) into usable electricity (normalized events). Storage is the battery that lets you replay. Distribution is the power grid that reaches consumers without frying them.

Decide what your system must guarantee

If you don’t specify guarantees early, you’ll end up rewriting major parts after you learn the hard way. Different guarantees imply different design choices.

Latency vs. correctness

Low latency is a common goal, but it can conflict with correctness guarantees like de-duplication, strict ordering, or waiting for late-arriving messages.

A typical compromise: keep the pipeline fast, but allow “event-time correction” where possible. For example, you can accept quotes immediately while computing a slightly corrected view later.

Ordering semantics

Ask what ordering matters:

– Ordering per instrument (most common)
– Ordering across instruments (less common)
– Ordering per message type (quotes vs trades vs order book deltas)

If the exchange provides sequence numbers, you can use them. If it doesn’t, you’ll rely on timestamp ordering and sequence stability where you can.

Delivery semantics

In distributed systems, “at least once” and “exactly once” are the classics. In practice, you’ll aim for:

At least once with idempotent processing (duplicates possible, but harmless)
Exactly once only for limited subsets (usually harder)

You want your feed consumers to cope with retries and duplicates without corrupting state.

Time semantics (event time vs processing time)

Many bugs hide in the difference between the time a quote occurred and the time your system processed it. You should store both:

event_time: from the message source (or derived)
ingest_time / processing_time: when your system received/processed it

This matters for backtesting too. If you only store processing timestamps, replay becomes inaccurate.

Choose your ingestion approach

Your ingestion design depends on the feed you’re consuming.

Streaming feeds

Streaming is the usual choice for real-time trading. Common patterns include:

– Persistent TCP/WebSocket connections
– Heartbeats and reconnection logic
– Sequence number tracking (when available)
– Snapshot + delta model (especially for order books)

With snapshot + delta, your software must detect when it’s out of sync and refresh from a new snapshot.

Polling APIs

If you use vendor REST APIs, polling introduces gaps unless you poll frequently and handle pagination carefully. Most teams end up building a hybrid: polling for occasional reference data, streaming for high-frequency updates.

Polling also tends to produce “time jumps” due to rate limits and request batching. If you do this, store raw responses so you can interpret the timeline later.

Bulk history ingestion

If your feed software also supports backfills, you need a careful plan for how you align historical data with live streams.

A common setup:
– Bulk load historical snapshots and deltas (if available)
– Start live stream at a known time boundary
– Merge with event-time based rules
– Keep a “gap detector” that flags missing sequences

Design your internal data model

This is the part that determines whether your system stays maintainable after the first “small change” request.

Symbols and instruments: don’t treat them casually

External symbol formats change. Corporate actions happen. Vendors use slightly different tickers. Your internal model should separate:

Instrument ID (your internal stable identifier)
Venue (exchange or trading venue code)
External symbol mappings (one-to-many possible across vendors)

You’ll also want instrument metadata cached locally: tick size, lot size, contract multiplier, and any trading session rules you care about.

A practical trick: version your symbol mapping. When mappings change, you don’t want old data to “change under your feet” during replay.

Message types and event schemas

Define a small set of event types you actually use downstream. For most systems:

– Trade events
– Quote events (top-of-book)
– Order book updates (delta updates or levels)
– Reference events (instrument status, corporate actions, listing/delisting)

Keep schemas consistent. If you represent an order book level, be explicit about fields: price, size, side, and level index if you use it.

Numeric types and precision

For prices and sizes, floating point types can cause subtle rounding differences. You can use decimals (where performance permits) or fixed-point integers.

A common approach:
– Store prices as integer ticks (price / tick_size)
– Store sizes as integer units (or lot-adjusted units)

If you only know tick size after initialization, fetch it early—before the first real-time processing window.

Timestamp precision and normalization

Normalize timestamps to a single internal format. Store them as integers (e.g., nanoseconds since epoch) or as 64-bit epoch milliseconds with enough precision for your needs.

If the source time resolution is lower (common), don’t invent precision; just store what you got and handle ordering accordingly.

Implement normalization and cleaning

Normalization is where systems quietly accumulate tech debt. You can either handle the edge cases now—or your on-call person will handle them later. (No offense, but they’ll have a lot of work if you don’t.)

De-duplication

Duplicates happen due to reconnections and retry logic. De-duplication can use:

– Sequence numbers per instrument or feed
– Message hashes (trade_id + price + size + timestamp bucket)
– Vendor-provided unique IDs when present

If you select duplicates based on time alone, you’ll sometimes keep multiple near-identical messages. Use IDs or sequence markers when they exist.

Out-of-order handling

You’ll see messages arrive late. Reasons include network jitter, reconnection replay, and upstream buffering.

Approaches:
– Maintain a small reorder buffer per instrument for a fixed time window (e.g., tens of milliseconds to a few seconds)
– Use event-time ordering for historical replays
– Use processing-time ordering for live streams when event-time is unreliable (but label it)

You need to decide what “late” means for each event type.

Gap detection and recovery

If your feed provider includes sequence numbers, gaps are detectable. Without them, gaps are harder—the system can only infer missing data by checking monotonic fields or expected frequency.

When you detect a gap:
– For order books, resync from latest snapshot
– For trades and quotes, decide whether to request backfill or wait for next valid segment

This is where you should log enough data to diagnose later without spelunking through raw logs.

Storage: choose what you actually need

Storage is the part that seems simple until you estimate your volume. Market data grows fast, and “we’ll handle retention later” becomes a lifestyle.

Raw vs normalized storage

A common pattern:
– Keep raw messages for a short period (for debugging and audits)
– Keep normalized events longer (for replay and analysis)

If you only store normalized events, you may struggle to explain discrepancies when a vendor later changes formatting or you mis-modeled something.

Event logs vs time series databases

Market data is event-driven. An event log (append-only) fits nicely for replay.

Time series databases can work too, especially for metrics like OHLCV bars or top-of-book snapshots. But if your consumers need level-by-level book reconstruction, you’ll likely prefer event storage with explicit deltas or reconstructed snapshots.

There’s no universal winner. The decision usually comes down to what queries you run:
– Query by time range and reconstruct state? Event logs or snapshots.
– Query aggregated bars and indicators? Time series DB works.

Indexing and query patterns

If you plan to replay:
– Index by instrument and event_time
– Keep a fast way to retrieve contiguous ranges
– Store sequence numbers if available

For live usage:
– Storage might not be touched frequently (you stream to consumers directly)
– But it’s your safety net when consumers fail or you need to backfill

Compression and schema evolution

Data compression matters. Most message payloads compress well, especially for repeated fields and consistent schemas.

Also plan for schema evolution:
– Version your event schemas
– Write migration logic or adapt consumers to multiple versions for a while
– Store the schema version with each event

When you change a field type, don’t pretend it didn’t happen. Track it.

Feed delivery: how consumers get the data

Once you have normalized events, you need to deliver them to strategies and analytics tools. This layer is where “small mistakes” become “expensive incidents.”

In-process vs distributed delivery

If you run strategies inside the same application:
– You can use in-memory queues
– You avoid serialization overhead
– You share state faster

But if you want multiple services and independent scaling:
– You need inter-process communication (IPC)
– You need a messaging layer (or an HTTP/gRPC streaming interface)

Distributed delivery adds complexity. You also gain resilience, but you must handle serialization cost and failure semantics.

Streaming protocol choices

Common options:
– WebSocket for external clients (fine for dashboards)
– gRPC streaming for internal services
– Kafka-like systems for durable event streaming
– Custom UDP/TCP where you control everything (high performance, high responsibility)

If you need durable replay and multiple consumers, a durable log-style system often makes sense. If you need minimal latency inside one machine, shared memory or in-process queues can be simpler.

Backpressure and slow consumers

A fast producer plus a slow consumer will eventually mean one of three things:
– You buffer until you run out of memory
– You drop messages
– You slow down the producer

Your strategy likely expects certain update frequency. If you drop too much, it becomes effectively blind. If you buffer too much, latency rises.

So you should define policies per consumer:
– “Drop oldest and continue” for dashboards
– “Block / throttle” for a risk system that can’t tolerate missed updates
– “Snapshot + catch up” for strategies, if you can reconstruct state quickly

Heartbeats and health signals

Your feed software must expose health status:
– Connection status
– Last received sequence number (optional but helpful)
– Last event_time processed
– Consumer lag (if using a log system)

This prevents the classic scenario: everything is “running,” but nobody noticed it stopped receiving quotes an hour ago.

Order book reconstruction and book state management

If your feed includes level 2 or full order book levels, you need to reconstruct book state from snapshots and deltas.

Snapshot + delta flow

Most order book feeds work like:
1) Receive a full snapshot
2) Receive incremental updates (deltas)
3) Apply deltas until next resync

Your software needs to:
– Apply deltas deterministically
– Track snapshot version or sequence
– Detect when you missed updates and resync

If deltas include price-level updates (change size, add/remove levels), you need consistent rules for deletions and zero-size levels.

Representation: levels and their ordering

Choose a representation that reflects how you’ll query:
– A sorted structure for best bid/ask retrieval
– A map from price->size for quick updates
– A cached best levels list for fast reads

If you constantly recompute sorted order after each delta, you’ll waste CPU. Cache best levels and update them incrementally.

How to handle partial refresh and drift

Feeds sometimes behave imperfectly. You can see drift due to:
– missed deltas
– out-of-order messages
– provider bugs

So it’s wise to periodically validate:
– Compare reconstructed book against occasional snapshots
– Detect level count mismatches or best price mismatches beyond tolerance

Validation is not free, but it can save you from subtle strategy degradation.

Backtesting and replay: build it as a first-class feature

A surprising number of teams treat replay as an afterthought. Then they discover that the “backtest” they get is not the one they wanted.

Replay needs deterministic behavior

If you want replay to match live behavior, your replay engine should:
– Recreate timing semantics (at least event-time ordering)
– Handle duplicates the same way
– Apply normalization and reconstruction the same way

Even a small mismatch in timestamp interpretation can change your results if you use “time since last trade” logic.

Bridging history and live

To test a strategy that assumes a certain state at start time:
– Load historical events up to the start timestamp
– Reconstruct state (order book, last trades, rolling indicators)
– Start applying live events from there

This means your internal state update logic should be shared between live and replay where possible. Copy/paste logic is where bugs breed.

Event-time vs simulated time

For backtesting you usually simulate using event-time, not processing-time, because that represents market reality better.

But if you want to model latencies, you can add delay to delivery and see how your strategy reacts when updates arrive late. If you care about this, store both event_time and ingest_time so you can compute a realistic delay distribution.

Performance and scaling realities

Market data feed systems often look straightforward—until you hit peak volume and CPU usage climbs like a bad habit.

Where time goes

Common bottlenecks:
– Serialization/deserialization overhead
– Excessive object allocation in high-frequency processing
– Logging too much at debug/info levels
– Re-sorting data structures after each update
– Synchronous I/O in the hot path

You want a hot path that does the minimum work per event and pushes slower work out to background threads.

Threading model and scheduling

A simple approach:
– One ingestion thread/process per connection
– A normalization pipeline stage
– A distribution stage
– Separate storage writer workers

But be careful: locking shared structures (especially order book state) can kill throughput. Use partitioning by instrument where possible: one thread owns state for a subset of instruments, reducing contention.

Memory management

Buffering and queues can hide memory growth. Track:
– queue sizes
– average and max lag
– retention buffers for reorder windows
– snapshot caches

If you use reorder buffers for out-of-order messages, put a hard cap on buffer size.

Reliability: the boring part that keeps you employed

Market data systems don’t just fail. They fail in specific patterns that look harmless until your P&L takes the day off.

Retries and reconnections

Implement reconnection with:
– exponential backoff
– clear logging
– state reset policies

When you reconnect, decide whether you:
– resume from last known sequence (preferred if provider supports it)
– request a fresh snapshot
– temporarily mark data as “degraded” until you catch up

Idempotent consumers

If your system can ever redeliver events, consumers need idempotency. That means each event should have a stable ID or sequence key consumers can use to ignore duplicates.

At minimum, you should ensure that applying the same delta twice doesn’t mess up the order book. This is less about perfection and more about not letting “retry storms” corrupt state.

State persistence and restart recovery

On restart, you should be able to recover without waiting for a long warmup. You can:
– persist last processed sequence numbers per instrument/connection
– persist snapshots of order book state periodically
– store enough to determine whether you’re safe to process deltas immediately

Otherwise, every restart turns into a slow catch-up exercise.

Security and operational hygiene

This section won’t win you awards, but it prevents avoidable incidents.

Authentication and secrets

Store API keys and feed credentials in a secrets manager, not in config files checked into Git. Use rotation policies with clear runbooks.

Auditability

Record:
– which feed sources you connected to
– when you connected/disconnected
– the version of your normalization code used
– any schema/field mapping versions

If something goes wrong, you’ll want to know whether it’s a data issue or a pipeline issue.

Monitoring and alerting

At minimum monitor:
– message rate
– last received time
– consumer lag
– gap detection events
– error rates in parsing/normalization
– CPU and memory usage

You don’t need fancy dashboards at first. You need alerts that match real failure modes.

Testing: how to avoid shipping a feed that mostly lies

Testing market data software requires different strategies than typical web apps.

Replay-based tests

Use recorded feeds to run integration tests:
– feed the raw messages into your pipeline
– compare output events to expected normalized results
– validate order book reconstruction against known snapshots

This works well because it turns nondeterministic live behavior into repeatable test runs.

Property-based tests for normalization

You can test invariants instead of exact event outputs. Examples:
– prices always non-negative and correctly ticked
– sizes never negative
– order book best bid <= best ask (for sane feeds) - event IDs are unique within defined scope This catches edge cases without enumerating every scenario.

Chaos tests for reconnections

Simulate dropped connections, partial message sequences, and delayed delivery. Ensure your system recovers and resumes correctly without corrupting state.

If you never test this, you’ll learn it on a live trading day. That’s a bad school.

Choosing an implementation stack

There isn’t a single correct language or framework. The best stack is the one you can properly operate and debug under pressure.

Common language choices

Java / Scala: strong ecosystem for streaming, robust concurrency tools
C#: solid performance and tooling, good for .NET shops
Python: great for prototyping and transformation logic, but use care for ultra-low-latency hot paths
Go: good concurrency model and deployment simplicity
C++: fastest option, most demanding to implement safely

Even in high-performance systems, teams often split responsibilities: Python or Go for ingestion glue and normalization logic; a faster service for book reconstruction and distribution.

Data processing patterns

You’ll likely implement one of these patterns:

– Single service with internal stages (simplest)
– Multi-service pipeline (more maintainable, more operational overhead)
– Hybrid: one ingestion service, one processing service, one distribution/storage service

Using a message log vs direct streaming

A dedicated durable log makes replay easier and decouples producers and consumers. Direct streaming is lower complexity for small deployments.

If you’re expecting multiple consumers and want long-term replay, durable logging often pays off.

Practical architecture examples

This section gives a few plausible architectures. They aren’t “the one true design,” but they reflect how teams typically start and then evolve.

Example 1: Single-node feed for strategy prototyping

– Ingestion: stream connection in one process
– Normalization: in the same process
– Storage: rolling file storage for raw + normalized events
– Distribution: in-process pub/sub to strategies

Pros: fast to build, minimal operational overhead. Cons: limited scaling, restart handling can be annoying.

Example 2: Two-stage pipeline with durable event log

– Stage A (ingestion/normalization): consumes vendor feed, normalizes events, writes to durable log
– Stage B (distribution/storage): reads from log, performs order book reconstruction if needed, serves consumers

Pros: decoupling, replay-friendly, easier to debug. Cons: extra serialization and infrastructure.

Example 3: Microservice feed with consumer-specific subscriptions

– Ingestion service per source
– Normalization service shared
– Order book reconstruction service per instrument partition
– Consumer gateway that exposes subscriptions

Pros: scalability and customizable delivery. Cons: more operational plumbing and monitoring needs.

Common pitfalls (and how to avoid them)

These are the issues that repeatedly show up in real projects.

Assuming “timestamp equals reality”

Event-time can be delayed, mislabeled, or inconsistent across sources. Store both event_time and processing_time. Use event_time for replay; use processing_time for live latency metrics.

Trying to reconstruct full books without resync logic

If your order book reconstruction lacks strict resync triggers, you’ll drift from reality quietly. Then your strategies start trading based on a book that isn’t the one the market is making.

Always implement snapshot recovery and gap detection where possible.

Over-optimizing before validation

Performance tuning is useful, but not as useful as correctness. Start by ensuring:
– normalized event fields are correct
– symbol mappings are stable
– duplicates are handled safely
– reconstructed order books match snapshots

Only then measure latency hotspots.

Not versioning your transformations

If you change normalization logic, replay results can change. Version your normalization code and schema mapping, and record versions with produced events.

Logging too much in the hot path

Debug logs inside per-message processing can destroy throughput and distort timing. Use sampling and structured logging in non-hot paths.

How to measure whether your feed is “good”

“Good” is a measurable concept. At a minimum, you want correctness checks and operational checks.

Correctness checks

– Best bid/ask alignment with snapshots (within tolerance)
– Trade counts and volume totals for a time window
– Sequence gap detection rate (should be rare; if frequent, resync logic is failing)
– Duplicate handling: duplicates exist but should not change state

Operational metrics

– Ingestion reconnect frequency
– End-to-end latency distribution (event_time to consumer delivery time)
– Consumer lag (if using durable logs)
– CPU usage per pipeline stage
– Memory usage and queue sizes
– Error counts grouped by error type (parsing, mapping, storage writes)

Cost planning: storage, compute, and staffing

You don’t need to be a fortune teller, but you do need a rough budget.

Compute requirements

Compute scales with:
– message rate
– number of instruments
– complexity of normalization and reconstruction
– serialization overhead
– storage writes

Even if messages are small, rates are high. Plan for sustained processing, not just average throughput.

Storage requirements

Storage depends on:
– raw vs normalized retention
– whether you store full order book snapshots frequently
– compression ratios
– indexing overhead

Many teams end up storing:
– raw for short retention
– normalized for longer
– reconstructed snapshots for specific checkpoints needed for quick replay

Staffing reality

Building a feed is one thing. Operating it is another. You’ll spend time on:
– source changes
– schema changes
– reconnection edge cases
– incident response
– version upgrades

If you don’t have operational coverage, keep scope smaller at first. Build the parts you can maintain.

Example workflow: from prototype to production

Here’s a workflow that avoids the most common traps.

Step 1: Pick a narrow scope

For example:
– one data source
– one asset class (or one exchange)
– quotes + trades only at first
– top-of-book reconstruction instead of full depth

Prototypes should be small enough that you can trust them after a day of watching logs.

Step 2: Build normalization with a schema version

Even one-to-one mapping benefits from versioning. When you validate, you’ll know exactly which transformation logic produced your results.

Step 3: Add replay and state reconstruction

Record a feed segment and build a replay tool that uses the same normalization and reconstruction logic as live.

Then compare outputs to expectations:
– order book best levels
– trade events within time windows
– duplicate behavior

Step 4: Add durable storage and consumer delivery

Introduce durable logging if you need multiple consumers or long replay windows.

Keep consumer interfaces stable. If you change event formats, do it in a versioned way.

Step 5: Operational hardening

That means alerts, backoff logic, reconnection tests, and restart recovery. Also add gap detection.

After hardening, run it for at least a week with a known-good set of data and check metrics.

When you should not build this (or not build all of it)

Sometimes “build your own market data and feed software” becomes “invent new problems” when you actually need a backtest or a dashboard.

Consider using vendor software or a third-party feed when:
– you only need simple aggregated data
– you don’t need replay accuracy
– your strategy tolerates occasional missing updates
– you can’t justify operational support

That said, even then you might build parts of the pipeline: symbol normalization, storage for replay, or a consumer gateway. You don’t have to rebuild the entire feed origin story to get value.

Frequently asked questions

Do I need full raw message storage?

Not always. If you can fully trust your normalization and reconstruct state accurately from normalized events, you can store normalized events longer and raw messages for debugging only. However, raw storage helps when vendors change payload details or when you need to answer “why did this trade end up here?”

Should my feed software run on one machine or multiple?

Start with one machine for a prototype. If you need high scale, low latency across many instruments, or multiple independent consumers, split stages. Keep the system manageable—distributed complexity costs real time.

How do I ensure consumers don’t break when the feed restarts?

Use idempotent event IDs and state recovery logic. Expose health metrics so consumers can detect degraded feed state and pause safely if needed.

What’s the most common “first real bug”?

Normalization mistakes: symbol mapping errors, timestamp conversion bugs, and price tick rounding. These are rarely obvious until you compare your reconstructed outputs against reference snapshots or basic aggregates like volume-per-minute.

Final thoughts

Making your own market data and feed software is less about writing a fancy stream handler and more about building a dependable pipeline that does three things well: convert inconsistent incoming data into a stable internal format, replay with correct semantics, and deliver predictable updates to consumers.

If you respect time semantics, sequence/gap issues, and operational monitoring from the start, you can end up with a system that saves you time and fewer headaches than the average “just one script” approach. If you skip those parts, you’ll still build something—but it’ll mostly teach you new ways to get surprised during trading hours. Which, to be fair, is a learning experience, just not one you’d spreadsheet on purpose.