2026-06-12 posthog.tracks duplicate event_id

Report date: 2026-06-12 · Last update: 2026-06-12
Duplicate event_ids in posthog.tracks failing the source unique test in the bq-marketing DAG. Investigation + dedup plan.

TL;DR. posthog.tracks carries a legacy pool of ~28.2M duplicate event_ids (~56.9M rows, ~28.8M excess, 0.92%) from historical backfills. The hourly dbt MERGE does not create new duplicates — it refreshes the existing pairs (re-delivered UUIDs match both copies → both updated → re-dated into a recent partition), which makes old dupes look new. Fix = one-time dedup + stop re-dating created_at on re-delivery + re-enable the owning model's unique test.

Summary

On 2026-06-11 the source unique test on posthog.tracks.event_id (dbt/sources/sources.yml, scoped created_at >= '2026-03-04') started failing in the bq-marketing DAG. It surfaced because the new marketing model installs_subscribers_by_channel_platform (tag:marketing) added a source('posthog','tracks') dependency, which pulled source_unique_posthog_tracks_event_id into dbt test --select +tag:marketing for the first time — the test count went 90 → 93 and ERROR 0 → 2. The job had been green daily before that: the data was always duplicated, it was simply untested in that DAG (and the owning model's unique test is commented out).

Measured in production: ~28.2M duplicate event_ids across ~56.9M rows, i.e. ~28.8M excess rows (0.92%) to remove. Current status of tracks: Degraded / under review — dedup in progress.

Impact

Lineage / failure point

graph LR ph["PostHog re-delivers
old event UUIDs"] raw["posthog.raw_tracks"] merge["posthog_tracks (dbt MERGE)
ON event_id (global)"] legacy["legacy dup pairs
2025-12 backfill + Mar merges"] tracks["posthog.tracks
~28.8M excess rows"] test["source unique(event_id)
bq-marketing"] ph --> raw --> merge legacy --> tracks merge -->|"matches BOTH copies
updates both → re-dated"| tracks tracks --> test classDef broken fill:#ffd6d6,stroke:#c00,color:#000 classDef partial fill:#fff2cc,stroke:#c90,color:#000 class legacy,tracks broken class merge,test partial

Root cause

Two distinct parts, previously conflated:

Part 1 — how the duplicates were created: legacy backfills (NOT the live merge)

posthog.tracks carries a large legacy pool of duplicated event_ids written by historical backfills. By bq_ingested_timestamp month of the duplicate rows:

Ingested monthDuplicate rowsSource
2025-12~37.5MBulk backfill on 2025-12-18 (the dominant pool)
2026-03~17.2Mmerge_backfill_to_tracks.py runs on 2026-03-03 / 2026-03-16
2026-01 / 02smallerOther backfill activity

merge_backfill_to_tracks.py contributed because both its source dedup and its MERGE ... ON are scoped to a single created_at day:

ON t.event_id = s.event_id
   AND t.created_at >= @start_ts AND t.created_at < @end_ts   -- one day

so the same event_id re-sent on a later day was not matched and was inserted again. But the bulk of the pool is the 2025-12 backfill, not the March per-day issue.

Part 2 — why they look "recent and ongoing": the live merge refreshes old pairs

The hourly model does not create duplicates. It is an incremental merge, unique_key='event_id', ON (event_id) only (global — verified from the compiled MERGE, no partition predicate). PostHog re-delivers old event UUIDs constantly (~370–555 WHEN MATCHED updates per hourly merge). When a re-delivered UUID is one of the legacy pairs:

  1. ON event_id matches both existing target rows.
  2. BigQuery legally updates multiple target rows from one source row — both copies are overwritten with the source values → two byte-identical rows with microsecond-equal bq_ingested_timestamp.
  3. created_at is rewritten to the re-delivery date, so the pair teleports into a recent partition and surfaces in "recent duplicate" queries.
  4. Row count stays 2 — nothing is inserted.

Hence the table-wide duplicate count is flat week-over-week while ~15 "new" dupes/day appear in recent partitions: they migrate, they are not created.

Pitfall: bq_ingested_timestamp is a column value, not an insert time

bq_ingested_timestamp is a column value copied on UPDATE — load_posthog_tracks.py stamps one datetime.now() per loader run and writes it to every row in that batch. It is not a row insertion time. The assumption "identical bq_ingested ⇒ inserted together" is false and initially misled the investigation (it looks like the two copies were freshly inserted as a pair, when in fact they are an old pair that was updated together in one merge).

Secondary data-quality bug — re-dating created_at across partitions

The same WHEN MATCHED THEN UPDATE rewrites created_at on every re-delivery, silently re-dating historical events across partitions — an Oct-2025 event can move into a June-2026 partition. This corrupts time-based analysis independent of the duplicate question, and is what Part 2 of the fix addresses.

Evidence

MetricValue
Total rows3,115,774,101
Distinct event_id3,087,004,577
Duplicate rows / excess to remove56,925,724 / 28,769,524 (0.92%)
Duplicate event_ids~28,156,200
Null event_id0
Genuine collisions (one event_id → >1 (event_name, user_id))0

Because there are 0 genuine collisions, every duplicate is a true re-send and deduping on event_id is lossless.

Fix and follow-ups

Part 1 — one-time dedup (dedup_tracks.py)

Zero-copy snapshot (14-day rollback), then a single atomic CREATE OR REPLACE preserving partition / cluster / description, keeping:

ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY bq_ingested_timestamp DESC) = 1

Removes ~28.8M rows; asserts rows == distinct event_ids == distinct-before and 0 nulls. Validated on staging with synthetic dupes (exact copies + cross-day re-sends): 7000 → 5000, 2000 removed, each of 1000 dup ids collapsed to the latest-ingest winner. Production dry-run confirms counts, generated DDL, and a ~7.2 TB scan; expected after = 3,087,004,577 rows. This work is read-only here — no dedup was run.

Part 2 — stop re-delivery re-dating history (models/posthog_tracks.sql)

Add merge_exclude_columns=['created_at'] so a matched (re-delivered) row keeps its original created_at and never changes partition. Ends the teleporting and keeps any future legacy straggler from resurfacing as a "recent" dupe. Safe to ship before the dedup.

Part 3 — re-enable the owning test

After dedup, re-enable the model-level unique test in models/posthog_tracks.yml (scoped created_at >= '2026-03-04') and fix the misleading comment, so the owning posthog-tracks DAG catches any recurrence — not just bq-marketing.

Cautions & upstream follow-up

References