Silencing the Chaos: Unifying Paid Media Data Across 20+ Ad Accounts

June 22, 202611 min read

Picture this: 20+ ad accounts spread across TikTok, Meta, and YouTube. Each with its own naming conventions, metric definitions, and reporting cadences. Add a handful of third-party tools that promised "single source of truth" but delivered yet another silo.

The marketing team needed to answer a simple question: Which channels are driving awareness and engagement?

But simple questions rarely have simple answers when your data lives in five different places with five different schemas. Platform A calls it "impressions." Platform B calls it "views." Platform C has three different metrics that all sound like "reach" but measure slightly different things.

Meanwhile, the agency was spending more time wrestling with Excel pivots than actually deriving insights. They had the data. They just couldn't use it.

The Partner That Made It Work

Before we talk about tech, we need to talk about people.

We partnered with one key regional partner who became the backbone of this entire initiative. They didn't just execute. They absorbed. Different stakeholders across local markets had different voices, different priorities, different timelines. Our regional partner took on the burden of consolidating that noise. They translated between camps. They fed us clean, structured requirements instead of us chasing ten different inboxes.

They took feedback from local teams, pressure-tested it against what was actually feasible, and pushed back when something didn't make sense. That's not a small thing. In a project where the politics were harder than the data engineering, having one trusted partner who could speak both languages, technical and business, was the difference between stalling and shipping.

Nadiah and Daniel deserve special recognition. When leadership wanted status updates, they had them. When the agency needed clarification on data definitions, they got it same day. When a local market pushed back, Nadiah and Daniel absorbed that friction so we could focus on building instead of navigating politics.

This article is as much about that partnership as it is about the pipeline.

The Politics of Data

The technical challenge was real, but the politics were harder than any API quirk.

Every stakeholder had a different definition of "success" tied to their own objectives. Some cared about CPM. Others fixated on engagement rates. The agency had their own formulas, the internal teams had theirs. Two groups looking at the same campaign sometimes got different numbers.

We went through 12 rounds of alignment before we locked down:

Dimension	The Debate	What We Settled On
Key Metrics	"Which 5 metrics do we report?"	Awareness and engagement KPIs, segmented by funnel stage
Dimensions	"Do we break down by campaign, ad set, or creative?"	All three, with drill-down hierarchy
Goals	"What does 'good' look like?"	Platform-specific benchmarks + internal targets
Usage	"Who sees what?"	Leadership wants the 30,000-foot view. Working teams and the agency need the deep dive. Same data, different layers.
Naming	"Why is this called 'spend' here and 'cost' there?"	Harmonized taxonomy across all sources
Geography	"Country-level or city-level?"	Depends on the market. Some needed regional and city breakdowns. Others didn't. We built both.

Twelve rounds. That's not a typo.

The hardest part wasn't the data modeling. It was getting everyone to agree on a shared vocabulary. Our regional partner was instrumental here. They facilitated alignment sessions, pressure-tested assumptions, and kept the conversations moving when they threatened to stall.

The Framework

Here's the mental model that made it work:

Three layers, three purposes:

Layer 1: Raw Ingestion Native APIs pulled data from TikTok, Meta, and YouTube. Each platform returned different formats, different granularities, different refresh rates. We normalized everything into a consistent staging layer.

Layer 2: Harmonization This is where the real work happened. PySpark jobs matched columns, aligned time zones, converted currencies, and applied business rules. Regex patterns caught platform-specific quirks. ML classification handled the messy edge cases where manual rules couldn't keep up.

Layer 3: Decision Layer Clean data fed BI dashboards for daily review, while the same pipeline prepared feature tables for ML modeling and LLM-powered analysis. One pipeline, two outputs.

Why These Tools

Nothing else fit:

Tool	Why It Won	Why Alternatives Lost
Airflow	DAG orchestration with built-in retry logic	Prefect was overkill for our scale
Native APIs	Direct access, real-time data, full fidelity	Third-party connectors lagged or charged per account
PySpark	Scales across 20+ accounts, DataFrame familiarity	Pandas choked on partition counts
Streamlit	Quick prototype for agency Excel uploads with validation	Flask was too slow to iterate
BI Dashboards	Pre-designed templates aligned with stakeholders	Custom frontends were too slow to maintain
Chat Channel	Real-time issue resolution during reporting windows	Email was too slow for urgent data breaks

The Streamlit piece deserves special mention. The agency had third-party platforms with no API access, requiring manual Excel exports. Rather than ignore those sources, we built a Streamlit app where the agency uploads their Excel sheets. Our validation logic checks the data, scrapes it with custom rules we developed on the fly, and pushes it into our data lake to blend with the automated pipelines. When the agency uploads, it triggers a group of Airflow DAGs automatically. They can refresh our data pipelines on demand. Not elegant, but it closed the gap.

The chat channel also deserves credit. During reporting windows, data issues pop up fast. Having a dedicated communication channel where the agency, our team, and stakeholders could resolve issues in real time instead of waiting on email threads saved us hours every cycle. Our regional partner kept that channel active and structured, which made all the difference.

Where the Work Actually Happened

The Ingestion Layer

# Logic core: Pull from native platforms, normalize, stage
def ingest_platform_data(platform, account_ids, date_range):
    """Pull raw data, align schemas, push to staging."""
    raw_data = platform.api.fetch(account_ids, date_range)
    normalized = apply_schema_mapping(raw_data, PLATFORM_SCHEMAS[platform])
    quality_check(normalized)
    return normalized

Each platform had its quirks. Meta's API returned nested JSON. TikTok's metrics naming changed quarterly. YouTube's reporting latency meant we were always a day behind.

We handled it with:

Regex patterns to catch naming drift
ML classification for edge cases where rules broke
Delta tables to track changes without full reloads

The Harmonization Layer

This is where 20+ ad accounts became a single, queryable dataset.

Challenges:

Different currencies (SGD, MYR, USD, IDR) across Southeast Asian markets
Different time zones across the region
Different attribution windows (1-day, 7-day, 28-day)
Different naming conventions (campaign vs. adset vs. creative)
Inconsistent geographic granularity: some markets needed regional and city-level breakdowns, others only needed country-level. We built the schema to handle both.

Solution: PySpark jobs applied business rules sequentially at scale, processing all 20+ accounts in under 15 minutes:

Currency conversion — A broadcast join with daily FX rates converted every row in one pass while spend data scaled across partitions:
```
spend_df.join(broadcast(fx_rates), "currency", "left") \
        .withColumn("spend_usd", col("spend") * col("rate"))
```
Time zone normalization to UTC — Each account's time zone was stored as metadata. A single from_utc_timestamp call aligned every timestamp in parallel.
Attribution window standardization — 1-day, 7-day, and 28-day windows were collapsed into a consistent 7-day window after comparing volume patterns across platforms.
Naming resolution — Regex patterns caught platform-specific drift. TikTok renames metrics quarterly, so patterns like r'^(impression|reach|unique_reach|total_reach)' flagged mismatches before they broke the pipeline. A lightweight ML classifier (trained on historical naming conventions using TF-IDF features) resolved edge cases regex couldn't match.
Geographic hierarchy mapping based on market-specific requirements.

The result: a unified fact table where "reach" means the same thing regardless of source.

The Dashboard Layer

Convincing the agency to abandon manual PowerPoint decks was its own battle. Every reporting cycle meant hours of copying charts and formatting slides. Our regional partner handled the handholding — coached them through the first few cycles, showed them how to export reports directly from the dashboard. It took patience. But it worked.

We empowered the agency and working teams with dashboards built on designed, fixed templates. These templates were agreed upon with stakeholders during those 12 alignment rounds. No ad-hoc SQL. No "pull your own data." The structure is set. The flexibility is in the dates and the ability to slice and dice by platform, campaign, market, and creative.

But here's the key: the dashboard gives you the numbers. Deriving human insight is still the team's responsibility. We built a bespoke tool where analysts can directly type their observations and recommendations alongside the data. The numbers tell you what happened. The human tells you why.

And we're building something better. An LLM-powered insights tool is in progress that will surface anomalies, trends, and recommendations automatically. Until then, the bespoke tool keeps the human in the loop where they belong.

The backend pipeline runs on Airflow. The frontend runs on the BI platform. The Streamlit app handles the agency uploads. Three tools, one truth.

Results

Metric	Before	After
Time to Report	3-5 days	Same day (4-hour refresh)
Data Reconciliation	Manual (~30 hrs/week)	Automated (<0.5% error)
Stakeholder Alignment	Endless debates	Locked taxonomy
Platform Coverage	Fragmented views	20+ accounts unified
Insight Quality	Backward-looking	Forward-looking

The real win wasn't the speed. It was the trust. When everyone looks at the same numbers, conversations shift from "are these right?" to "what do we do about it?" Those 30 engineering hours reclaimed weekly went directly into analysis and optimization work that was previously impossible.

What's Next

The data layer we built isn't just for dashboards. It's the foundation for:

LLM-Powered Insights: Anomaly detection, trend identification, and automated recommendations — weekly briefs without a human combing through 20 ad accounts. The prototype surfaces spend anomalies and engagement outliers in plain English. We're iterating on accuracy before taking it live.
ML Feature Store: Harmonized tables feed audience scoring and spend optimization models. No more feature engineering from scratch per model.
Automated Alerting: When a campaign's CPA spikes or impressions drop below threshold, Slack gets notified within the next refresh cycle instead of waiting for a Monday morning debrief.

The boring plumbing is done. Now we get to see what the data is really telling us.

Key Takeaways

Partner smart. The right regional partner absorbs stakeholder noise so you can focus on building.
Align before you build. Twelve rounds of stakeholder alignment saved months of rework.
Change management matters as much as the tech. Convincing the agency to abandon manual decks required handholding, not documentation.
Close gaps with human-in-the-loop. A Streamlit app for agency uploads beat ignoring data sources.
One pipeline, multiple outputs. Dashboards and ML features from the same source of truth.
Real-time communication wins. A chat channel during reporting resolved issues faster than any ticket system.
Boring plumbing enables exciting possibilities. The data layer is the foundation for everything that comes next.

Acknowledgements

The Partner That Made It Work: Our regional partner, who consolidated ten different voices into one clear signal. They translated between technical and business teams, pressure-tested requirements, and kept the project moving when alignment threatened to stall.

Nadiah and Daniel (Social Media Manager and Specialist): For taking every initiative to communicate both up and down the chain. Their effort to relay stakeholder feedback and keep everyone aligned made it possible for us to focus on the technicals instead of the politics.

Strategic Visionaries: Haikal, Head of Digital, who championed this initiative from the start and helped the leadership team recognise the problem we were solving. His support gave us the mandate to build.

The Engine Room: The data engineers who built the pipeline, the analysts who validated the outputs, and the Streamlit devs who closed the gaps no API could reach.

Tags: paid-media, data-integration, airflow, pyspark, harmonization, stakeholder-alignment, data-politics, regional-partner, southeast-asia

Comments (0)

Loading comments...