Data Pipelines

A business's marketing and sales data lives in a half-dozen dashboards that don't talk to each other. These pipelines are the ingestion layer that fixes that. They pull each source into one place automatically, every day, and feed the client dashboards I build on top. This is infrastructure. It runs unattended for months, and everything downstream depends on it being right.

Systems

The problem

The data exists. Nobody can see it in one place.

A typical local business has Google showing how people find them, Meta showing how their posts perform, and a point-of-sale system showing what actually sold. Three logins, three formats, three different ideas of what a "day" even means. Pulling it together by hand is the report that's always a week behind, or never gets made at all.

The fix isn't another dashboard to log into. It's an ingestion layer that does the pulling automatically and lands clean, deduplicated data in one place the dashboards can read. Each source gets its own pipeline, but they're all built the same way. So the system is repeatable. A new client or a new source is just a new instance of a pattern I've already proven, not a from-scratch build. The dashboards are only ever as trustworthy as the data underneath them. That's why this layer gets the engineering attention it does.

The three sources below (Google Business, Meta, and Square) are the ones I've built and shipped for this. They're not the limit of it. They're here because each one breaks differently, so together they show how the same pattern bends to fit whatever it's pulling from. The one rule that doesn't bend: I can only pull what a platform actually exposes through an API. If the data's reachable, this is how I reach it.

Systems

The pattern

One architecture, reused per source

Every pipeline follows the same spine. Credentials live in the platform's secret store, never in the code, and they're checked up front, so a missing key fails right away with a clear message instead of halfway through a run. A single fetch wrapper per source handles authentication and surfaces any API error instead of swallowing it. A scheduled trigger runs the whole thing once a day. And every write to the destination is idempotent. Running it twice produces the same result as running it once.

Three sources feed one ingestion layer, which lands clean data in storage and feeds the client dashboards. Spreadsheets today; the same pipelines feed a BigQuery warehouse as the client count grows.

Zoomed in, every individual pipeline follows the same internal spine:

source API fetch + auth parse + shape idempotent write storage

That last step is the load-bearing one. Pulling data on a schedule means overlapping windows and retries are going to happen, so the writes are built to dedupe rather than duplicate. How each pipeline pulls that off depends on the quirks of its source. That's where the three below get interesting.

Systems

Source 01 · Google Business Profile

Built around a source that reports late

Google Business Profile data arrives on a lag, somewhere between one and seven days, and not consistently. A naive daily pull would grab yesterday, see nothing yet, and record a zero that's actually just "not in yet." A week later the real number lands and the zero never gets corrected.

So this pipeline re-pulls a 14-day rolling window every run. The write logic does something specific for each date. If there's no row yet, it appends one. If there's a row that's all zeros, it overwrites it with the fresh data Google has since released. And if there's a row with real values already, it trusts it and skips. The lag no longer matters. The pipeline just keeps filling in the gaps until the truth arrives, and never replaces a real number with a stale one. It also normalizes whatever date format a spreadsheet cell throws at it, so a hand-edited cell can't break the dedupe.

Systems

Source 02 · Meta (Facebook + Instagram)

Two platforms, one normalized table

Meta is really two sources behind one login: a Facebook Page and an Instagram business account. Each has its own metrics, its own quirks, and per-post insights that have to be fetched one post at a time and flattened from a deeply nested response into flat rows. The pipeline pulls both, tags each row by platform, and lands them in one table with a shared shape.

Posts are deduped by post ID. A re-run updates the metrics on an existing post rather than adding a duplicate, because engagement keeps climbing for days after something is published. For the same reason the daily run uses a two-day window, to catch late numbers on yesterday's posts. Account-level rows are deduped on a composite date-and-platform key, and follower change is computed by diffing against the most recent prior row. So a single daily number quietly becomes a trend line.

Systems

Source 03 · Square (POS + Appointments)

A backfill that survives the clock

Square is the source of truth for money: orders, refunds, tips, and appointments. The daily pull aggregates a day of orders into one row (converting every value from cents to dollars, because that's how the API stores it) and resolves bookings against staff and service catalogs, so a booking reads as a real service and staff name instead of a pair of opaque IDs.

The engineering I'm most proud of here is the 90-day backfill. The platform kills any script that runs longer than six minutes, and ninety days of orders, bookings, and catalog lookups doesn't finish in six. So the backfill processes one day at a time and checkpoints after each. It records the last date it completed, watches its own clock, and stops cleanly before the limit. The next run reads the checkpoint and picks up exactly where it left off. A job too big to run in one shot finishes safely across several, and it's safe to rerun if it ever gets interrupted, because it always knows where it was.

Systems

Reliability

Built to run unattended

The whole point is that no one has to watch these. That only works if they fail loud instead of silent. Credentials are checked before any work starts. Each source's master run isolates its steps, so a failure in one pull doesn't take down the others, and any failure sends an email with the specific error instead of disappearing into a log nobody reads. Idempotent writes mean a retry after a hiccup is always safe. Never a double-count, never a corrupted row.

The result is infrastructure that earns trust by being boring. It runs at the same time every day, it tells you the moment something breaks, and the number it wrote last month still means exactly what it said. That reliability is the whole foundation. The client dashboards read straight from this layer, so when the ingestion is right, everything above it is right by default. The next step, as the client count grows, is changing where the data lands. Instead of a spreadsheet per client, the same pipelines feed one central warehouse, so the data can be queried at scale without changing how it's collected. The ingestion layer stays. The destination grows up.

Systems

The synthesis

What it demonstrates

The judgment to treat each messy API on its own terms (a reporting lag, a late-metrics window, a runtime limit) and the engineering to fold all three into one repeatable pattern that runs itself. This isn't a one-off script. It's a service. Idempotent, fail-loud, and built so the next client or the next source is just a new instance of something I've already proven.

And it's the same pattern under most of what I build, not just these three sources and not just dashboards. Point it at whatever API a project needs, handle that source's quirks, write the result somewhere safe. The destination changes. The discipline doesn't.

Work with me → Or hire me full-time →