2026-06-12 posthog.tracks duplicate event_id
TL;DR.posthog.trackscarries a legacy pool of ~28.2M duplicateevent_ids (~56.9M rows, ~28.8M excess, 0.92%) from historical backfills. The hourly dbt MERGE does not create new duplicates — it refreshes the existing pairs (re-delivered UUIDs match both copies → both updated → re-dated into a recent partition), which makes old dupes look new. Fix = one-time dedup + stop re-datingcreated_aton re-delivery + re-enable the owning model'suniquetest.
Summary
On 2026-06-11 the source unique test on
posthog.tracks.event_id (dbt/sources/sources.yml, scoped
created_at >= '2026-03-04') started failing in the
bq-marketing DAG. It surfaced because the new marketing model
installs_subscribers_by_channel_platform (tag:marketing) added a
source('posthog','tracks') dependency, which pulled
source_unique_posthog_tracks_event_id into
dbt test --select +tag:marketing for the first time — the
test count went 90 → 93 and ERROR 0 → 2. The job had
been green daily before that: the data was always duplicated, it was simply
untested in that DAG (and the owning model's unique test is commented out).
Measured in production: ~28.2M duplicate event_ids across
~56.9M rows, i.e. ~28.8M excess rows (0.92%) to remove.
Current status of tracks: Degraded / under review — dedup in progress.
Impact
- Affected test:
source_unique_posthog_tracks_event_id— FAIL (~69,880 failing rows in the scoped window at detection). - Affected DAG:
bq-marketing/test_marketing_tables(daily 06:00 UTC). The owningposthog-tracksDAG does not fail because its model-leveluniquetest is commented out. - Affected models: any consumer of
source('posthog','tracks')— marketinginstalls_subscribers_by_channel_platform, growthint_events_growth_union_events. Duplicates can double-count events in aggregates that don't dedup onevent_id. - Severity: ~0.92% of rows table-wide; not a single-day spike. Plus a secondary correctness bug (re-dating
created_at) that corrupts time-based analysis.
Lineage / failure point
old event UUIDs"] raw["posthog.raw_tracks"] merge["posthog_tracks (dbt MERGE)
ON event_id (global)"] legacy["legacy dup pairs
2025-12 backfill + Mar merges"] tracks["posthog.tracks
~28.8M excess rows"] test["source unique(event_id)
bq-marketing"] ph --> raw --> merge legacy --> tracks merge -->|"matches BOTH copies
updates both → re-dated"| tracks tracks --> test classDef broken fill:#ffd6d6,stroke:#c00,color:#000 classDef partial fill:#fff2cc,stroke:#c90,color:#000 class legacy,tracks broken class merge,test partial
Root cause
Two distinct parts, previously conflated:
Part 1 — how the duplicates were created: legacy backfills (NOT the live merge)
posthog.tracks carries a large legacy pool of duplicated
event_ids written by historical backfills. By bq_ingested_timestamp
month of the duplicate rows:
| Ingested month | Duplicate rows | Source |
|---|---|---|
| 2025-12 | ~37.5M | Bulk backfill on 2025-12-18 (the dominant pool) |
| 2026-03 | ~17.2M | merge_backfill_to_tracks.py runs on 2026-03-03 / 2026-03-16 |
| 2026-01 / 02 | smaller | Other backfill activity |
merge_backfill_to_tracks.py contributed because both its source dedup
and its MERGE ... ON are scoped to a single created_at day:
ON t.event_id = s.event_id
AND t.created_at >= @start_ts AND t.created_at < @end_ts -- one day
so the same event_id re-sent on a later day was not matched and was
inserted again. But the bulk of the pool is the 2025-12 backfill, not the
March per-day issue.
Part 2 — why they look "recent and ongoing": the live merge refreshes old pairs
The hourly model does not create duplicates. It is an incremental
merge, unique_key='event_id', ON (event_id) only
(global — verified from the compiled MERGE, no partition predicate). PostHog re-delivers
old event UUIDs constantly (~370–555 WHEN MATCHED updates per hourly merge).
When a re-delivered UUID is one of the legacy pairs:
ON event_idmatches both existing target rows.- BigQuery legally updates multiple target rows from one source row — both copies are overwritten with the source values → two byte-identical rows with microsecond-equal
bq_ingested_timestamp. created_atis rewritten to the re-delivery date, so the pair teleports into a recent partition and surfaces in "recent duplicate" queries.- Row count stays 2 — nothing is inserted.
Hence the table-wide duplicate count is flat week-over-week while ~15 "new" dupes/day appear in recent partitions: they migrate, they are not created.
Pitfall: bq_ingested_timestamp is a column value, not an insert time
bq_ingested_timestamp is a column value copied on UPDATE —
load_posthog_tracks.py stamps one datetime.now() per loader run
and writes it to every row in that batch. It is not a row insertion time.
The assumption "identical bq_ingested ⇒ inserted together" is false
and initially misled the investigation (it looks like the two copies were freshly inserted
as a pair, when in fact they are an old pair that was updated together in one merge).
Secondary data-quality bug — re-dating created_at across partitions
The same WHEN MATCHED THEN UPDATE rewrites created_at on every
re-delivery, silently re-dating historical events across partitions — an Oct-2025 event can
move into a June-2026 partition. This corrupts time-based analysis independent of the
duplicate question, and is what Part 2 of the fix addresses.
Evidence
| Metric | Value |
|---|---|
| Total rows | 3,115,774,101 |
Distinct event_id | 3,087,004,577 |
| Duplicate rows / excess to remove | 56,925,724 / 28,769,524 (0.92%) |
Duplicate event_ids | ~28,156,200 |
Null event_id | 0 |
Genuine collisions (one event_id → >1 (event_name, user_id)) | 0 |
- Time-travel proof (UPDATE, not INSERT) on
0199fda0-4b6e-7cdb-baa1-2c34cd39c4c6:- AS OF
2026-06-07 22:00+00(pre-merge): 2 rows, not identical —created_at 2025-10-19(ingested 2025-12-18) +created_at 2025-12-10(ingested 2026-03-16). - After the 06-07 merge: 2 rows, both
created_at 2026-06-07 21:38:46, both ingested2026-06-07 22:42:35.650218. Count 2 → 2; values rewritten.
- AS OF
- DML stats: every hourly merge shows
inserted ~300–480k, updated ~370–555, deleted 0— re-delivery + update is constant background behaviour. - Flat invariant: table-wide duplicate count identical now vs a 7-day-prior snapshot — zero net duplicate creation.
Because there are 0 genuine collisions, every duplicate is a true re-send
and deduping on event_id is lossless.
Fix and follow-ups
Part 1 — one-time dedup (dedup_tracks.py)
Zero-copy snapshot (14-day rollback), then a single atomic CREATE OR REPLACE
preserving partition / cluster / description, keeping:
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY bq_ingested_timestamp DESC) = 1
Removes ~28.8M rows; asserts rows == distinct event_ids == distinct-before and
0 nulls. Validated on staging with synthetic dupes (exact copies + cross-day
re-sends): 7000 → 5000, 2000 removed, each of 1000 dup ids collapsed to the latest-ingest
winner. Production dry-run confirms counts, generated DDL, and a ~7.2 TB scan; expected after
= 3,087,004,577 rows. This work is read-only here — no dedup was run.
Part 2 — stop re-delivery re-dating history (models/posthog_tracks.sql)
Add merge_exclude_columns=['created_at'] so a matched (re-delivered) row keeps
its original created_at and never changes partition. Ends the teleporting and
keeps any future legacy straggler from resurfacing as a "recent" dupe. Safe to ship before
the dedup.
Part 3 — re-enable the owning test
After dedup, re-enable the model-level unique test in
models/posthog_tracks.yml (scoped created_at >= '2026-03-04') and
fix the misleading comment, so the owning posthog-tracks DAG
catches any recurrence — not just bq-marketing.
Cautions & upstream follow-up
- Do not re-run
merge_backfill_to_tracks.pyas-is — its per-dayONmust match onevent_idglobally, or it reintroduces dupes. - Pause the DAG for the rewrite:
CREATE OR REPLACEreads a start-time snapshot; an hourly merge committing mid-rewrite could be overwritten. Rows loaded toraw_tracksduring the pause are merged by the next run (watermark onbq_ingested_timestamp) — no loss. Recommended window 08:00–09:00 UTC (low traffic, clear of 06:00/18:00 test runs). - Upstream ticket: PostHog batch export re-emits months-old UUIDs with a re-stamped
created_at; worth investigating at the source.
References
- Branch / PR:
NOREF/posthog-tracks-dedup - Migration runbook + write-up:
dags/data_platform/posthog_tracks/migrations/2026-06-12-dedup-event-id/README.md(on the dedup branch) - Scripts:
dedup_tracks.py,validate_dedup_staging.sql(dedup branch);merge_backfill_to_tracks.py,models/posthog_tracks.sql - Investigation by Claude Code (Opus 4.8) with an independent Claude Fable 5 verification session.
- qmb archived query sessions:
posthog-tracks-dedup-2026-06-12,fable-posthog-dup - posthog.tracks source page · PostHog Tracks home