GA4 + GSC Joins in BigQuery: Building a Query-to-Revenue View for Programmatic SEO
Most programmatic SEO teams track impressions and revenue in separate silos. Here's how we stitch GA4 and GSC together in BigQuery to get a real query-to-revenue view that actually drives roadmap decisions.

Every programmatic SEO team we've worked with hits the same wall around month six: GSC shows queries, GA4 shows revenue, and nobody can answer "which queries actually made us money this quarter." The keyword-not-provided era never really ended — it just moved to a different layer of the stack. Here's the BigQuery pattern we use to stitch the two sides back together without lying to ourselves about attribution.
Why the join is hard in the first place
Google deliberately doesn't give you a query field on the GA4 session. GSC has the query but not the user, the conversion, or the revenue. The two systems also disagree on almost everything: time zones, sampling, deduplication, what counts as a "click" versus a "session," and how late-arriving data gets reconciled.
So any join you build is a probabilistic bridge, not a primary key match. The honest goal is: for a given landing page on a given day, distribute GSC query impressions and clicks across the GA4 sessions and revenue that actually landed there. That's the model. Anyone promising you a true 1:1 query→user join is selling something.
What you need turned on
Before any of this works:
- GA4 → BigQuery export enabled (daily, ideally streaming too)
- GSC → BigQuery bulk data export enabled (this gives you
searchdata_url_impressionandsearchdata_site_impression) - Both datasets in the same region, or you'll pay egress on every query
- A
dim_pagestable you control, with canonical URL, template type, and content cluster
If you don't have GSC bulk export on yet, stop reading and go turn it on. The sampled API will not get you there at programmatic scale.
The data model: three layers, not one giant view
We build this as three stacked layers in BigQuery. Mixing them in one query is how you end up with 40-minute dashboards and wrong numbers.
Layer 1: normalised daily facts
Two tables, both keyed on (date, page_url):
fct_gsc_daily— impressions, clicks, avg position, and an array of top queries per URL per dayfct_ga4_daily— sessions, engaged sessions, conversions, revenue, by landing page per day
The trick is URL normalisation. GSC reports the URL Google indexed; GA4 reports page_location as the browser saw it. Trailing slashes, query strings, fragments, AMP variants, and locale prefixes all break the join. We run both through the same SQL UDF.
CREATE OR REPLACE FUNCTION `analytics.normalize_url`(u STRING) AS (
REGEXP_REPLACE(
REGEXP_REPLACE(
LOWER(SPLIT(u, '#')[OFFSET(0)]),
r'\?.*$', ''
),
r'/$', ''
)
);
Not fancy. It just needs to be the same function called on both sides. Version it, and never "improve" it without a backfill.
Layer 2: the query-to-page allocation
This is where the modelling happens. For each (date, page_url), GSC gives you N queries with their own click counts. GA4 gives you M sessions on that page. We allocate sessions and revenue across queries proportionally to GSC clicks for that page on that day.
WITH gsc AS (
SELECT
data_date AS date,
`analytics.normalize_url`(url) AS page_url,
query,
SUM(clicks) AS gsc_clicks,
SUM(impressions) AS gsc_impressions
FROM `project.searchconsole.searchdata_url_impression`
WHERE is_anonymized_query = FALSE
GROUP BY 1,2,3
),
page_clicks AS (
SELECT date, page_url, SUM(gsc_clicks) AS total_clicks
FROM gsc GROUP BY 1,2
),
ga4 AS (
SELECT date, page_url, sessions, revenue, conversions
FROM `project.analytics.fct_ga4_daily`
)
SELECT
g.date,
g.page_url,
g.query,
g.gsc_clicks,
g.gsc_impressions,
SAFE_DIVIDE(g.gsc_clicks, p.total_clicks) AS click_share,
ga.sessions * SAFE_DIVIDE(g.gsc_clicks, p.total_clicks) AS allocated_sessions,
ga.revenue * SAFE_DIVIDE(g.gsc_clicks, p.total_clicks) AS allocated_revenue,
ga.conversions * SAFE_DIVIDE(g.gsc_clicks, p.total_clicks) AS allocated_conversions
FROM gsc g
JOIN page_clicks p USING (date, page_url)
JOIN ga4 ga USING (date, page_url);
A few things to flag here, because they bite people:
is_anonymized_query = FALSE— GSC hides low-volume queries. Those rows have noqueryvalue. If you include them, your click_share denominators get weird. Decide once: do you allocate only across known queries, or do you bucket the anonymised tail as(anonymized)and treat it as its own "query"? We prefer the second; it keeps the totals honest.- Direct/referral sessions on the same URL get included. If a page gets organic and paid traffic, you're over-attributing revenue to organic queries. Filter
ga4tosession_source = 'google'andsession_medium = 'organic'before the join. Most teams forget this and inflate SEO revenue by 20–40% in our experience. - Click share is a model, not a measurement. Document it. Put it in the table description.
Layer 3: the rollups people actually look at
Nobody opens BigQuery to scan a query-level table with 80M rows. Build the aggregates the team will use:
agg_revenue_by_query_cluster— group queries into intent clusters (transactional, informational, branded)agg_revenue_by_template— join todim_pagesand roll up by template typeagg_query_opportunity— high impressions, low click share, decent allocated revenue per click → expansion targets
This third one is where programmatic SEO teams find their next 90 days of work.
Reconciling with the GA4 UI (so finance doesn't yell)
Your SUM(allocated_revenue) for organic Google will not exactly match GA4's "Organic Search" channel revenue. Expected reasons:
- GA4 thresholding and modelling on small segments
- Late conversions arriving after the daily export cut
- Cross-device modelled sessions that don't have a
page_locationin the raw export - Your organic filter being stricter than GA4's channel grouping
We aim for within 3–5%. If you're off by more, the URL normaliser is usually the culprit. Log the unmatched URLs from both sides daily — that diff catalogue is the single most useful debugging tool in this whole pipeline.
What this unlocks
Once the table exists, the questions change. Instead of "which pages get traffic," the product team starts asking:
- Which query clusters have a revenue-per-click above our content production cost?
- Which templates convert organic clicks at half the site average — and should we deindex them?
- Which queries are growing in impressions but flat in allocated revenue (intent mismatch)?
- Which pages have one breakout query carrying 80% of the revenue (fragility risk)?
That last one matters more than people think. A single algorithm update on a single query can erase a quarter. The query-to-revenue view makes that concentration visible before it bites.
Things we got wrong the first few times
- We allocated by impressions instead of clicks. Position 8 impressions don't generate sessions. Use clicks.
- We didn't lock the URL normaliser. Someone "fixed" it, the join recall jumped, and dashboards moved 12% overnight with no real-world change. Version it like code.
- We let the query table run unpartitioned. Costs spiralled. Partition by
date, cluster bypage_url. Set a 25-month expiration to match GSC's own retention. - We treated the allocation as truth in exec decks. Always show the model assumption inline. "Revenue allocated proportionally to GSC clicks per URL per day" — one sentence, every chart.
Where we'd start
If you're standing this up next week: turn on both BigQuery exports today, write the URL normaliser as a UDF before you write any SQL, and build Layer 1 with a hard organic filter on the GA4 side. Don't touch Layer 3 dashboards until Layer 2 reconciles within 5% of the GA4 UI for four consecutive weeks. The teams that skip that reconciliation step end up rebuilding the whole pipeline a quarter later — we've seen it three times, and we've done it once ourselves.
If you want a hand wiring this into an existing data warehouse, our data and analytics engineering team does this kind of work, and we've written more on the upstream pieces in the programmatic SEO posts on the blog.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

Content Freshness Signals at Scale: When to Actually Re-Publish Programmatic Pages
Bulk-updating dateModified on a million pages is a great way to get ignored — or worse. Here's how we decide which programmatic pages deserve a real refresh, and how to wire the signal cleanly.

Faceted Navigation on Programmatic SEO Sites: Rules That Keep Google Sane
Facets are where programmatic SEO sites quietly bleed crawl budget and rank signals. Here's the rule set we use to decide which combinations earn a URL, which get noindex, and which never see a link.

GSC Bulk Data Export to BigQuery: A Practical Setup for Programmatic SEO Teams
The Search Console UI tops out at 1,000 rows and 16 months. If you run programmatic SEO, that's not enough. Here's how we wire GSC's BigQuery export into a query workflow that actually drives decisions.
