All articles
SEO & GrowthMay 12, 2026 6 min read

Building a Programmatic SEO Data Model That Doesn't Collapse at 50,000 Pages

Most programmatic SEO sites die at the 10k-page mark — not from Google penalties, but from data rot. Here's the data model and content pipeline we use to keep large pSEO sites indexable, fast, and useful.

Building a Programmatic SEO Data Model That Doesn't Collapse at 50,000 Pages

Programmatic SEO sounds simple on a whiteboard: take a template, multiply it by a dataset, ship 30,000 pages. The problem is what happens six months later, when half those pages have stale facts, your sitemap is 80MB, and Google Search Console quietly stops indexing anything new. We've inherited enough of these projects to recognise the pattern — and most of the damage is upstream, in the data model.

This is how we structure pSEO projects so they keep earning traffic past the 10k-URL mark instead of becoming a liability.

Start With the Entity, Not the URL

The first mistake teams make is designing pSEO pages as templates filled by a flat spreadsheet. city, service, price, render. It works for the first 500 URLs. It breaks the moment your dataset has relationships — a city has neighbourhoods, a service has sub-services, a product has variants — and suddenly your "template" is a tangle of conditional Liquid.

Think in entities and relationships instead. A pSEO site is a small knowledge graph that happens to render as HTML.

// Core entities, not page templates
type Location = {
  id: string;          // stable, never changes
  slug: string;        // can change with redirects
  name: string;
  parentId?: string;   // country > region > city
  population?: number;
  geo: { lat: number; lng: number };
};

type Service = {
  id: string;
  slug: string;
  name: string;
  parentId?: string;   // service hierarchy
  appliesTo: string[]; // entity types it pairs with
};

type PageBinding = {
  id: string;          // hash of (template + entity ids)
  template: 'service-in-location' | 'service' | 'location';
  entities: { locationId?: string; serviceId?: string };
  status: 'draft' | 'live' | 'noindex' | 'redirected';
  qualityScore: number; // 0-1, computed
  lastFactCheck: string; // ISO date
};

The PageBinding is the unit that becomes a URL. Notice what's not there: the actual content. Content is generated from the binding plus the underlying entity data, never stored as a frozen blob you'll forget to update.

Stable IDs Save You From Yourself

Give every entity a stable internal ID that never changes, separate from its slug. When the marketing team decides "plumbers-london" should become "emergency-plumbers-london", you change the slug, write a 301 to the binding ID, and the underlying graph is untouched. We've seen six-figure traffic losses caused by teams that used slugs as primary keys and lost the redirect trail during a re-platform.

The Quality Gate Is Non-Negotiable

Google's 2024–2025 spam updates made one thing clear: "scaled content abuse" is a category, and the bar is whether each page provides distinct value. A pSEO page that's 90% template and 10% variable data will get classified as thin, no matter how clever your headings are.

We gate every page on a computed qualityScore before it's allowed into the sitemap. The score combines:

  • Data density: how many non-null, page-specific facts are bound to this entity combo
  • Uniqueness delta: text similarity against sibling pages (we use MinHash on shingles)
  • Evidence: does the page link to at least one primary source or first-party dataset
  • Intent match: does a real query exist for this combination (pulled from GSC or a keyword API)

Pages below threshold get noindex and a status: 'draft' flag. They still exist for internal navigation if needed, but they don't enter the sitemap and they don't dilute crawl budget.

function computeQualityScore(binding: PageBinding, ctx: BuildContext): number {
  const density = ctx.factsForBinding(binding).length / ctx.expectedFactCount(binding.template);
  const uniqueness = 1 - ctx.maxSimilarityToSiblings(binding);
  const hasEvidence = ctx.evidenceCount(binding) > 0 ? 1 : 0;
  const hasIntent = ctx.queryVolume(binding) > 0 ? 1 : 0;

  return (
    0.35 * Math.min(density, 1) +
    0.30 * uniqueness +
    0.20 * hasEvidence +
    0.15 * hasIntent
  );
}

The weights are arguments, not gospel. Tune them per project. The point is that quality is a number, computed at build time, and pages below the threshold don't ship.

Content Velocity vs. Freshness Decay

Velocity — how many new pages you ship per week — gets all the attention. Freshness — how stale your existing pages are — quietly kills sites. A page about "average rent in Manchester" that hasn't been touched since 2023 is worse than no page at all.

Build a freshness budget into the pipeline. Every entity gets a staleAfter window based on how fast its underlying facts change:

  • Pricing or availability data: 7–30 days
  • Statistical or demographic data: 90–180 days
  • Evergreen reference data: 365 days

A nightly job flags bindings whose underlying entity hasn't been refreshed within its window. Stale pages get re-rendered if the source data has changed, or flipped to noindex if the source is itself stale. In our experience, sites that enforce this stay in growth mode for years; sites that don't peak around month nine and decline.

Don't Confuse Re-rendering With Updating

Re-running the template doesn't make a page fresh. If your underlying dataset hasn't moved, you've just rewritten the same HTML with a new lastmod and trained Google to ignore your sitemap signals. Only bump lastmod when the facts changed, not when the template did.

Internal Linking as a Graph Problem

Once you have entities and bindings, internal linking stops being a manual chore. Every page should link to:

  • Its parent in the hierarchy (city → region, sub-service → service)
  • 3–8 sibling entities (other cities in the region, related services)
  • 1–2 cross-axis pages (the city-only page, the service-only page)

Generate these at build time from the graph. Avoid randomising the picks — Google's renderer caches, and stable internal links help crawl prioritisation. Pick siblings by a deterministic relevance function (geographic proximity, semantic similarity, shared parent) and only churn them when the underlying graph changes.

This is the layer where most pSEO sites under-invest. We've taken sites from "indexed but not ranking" to "ranking" purely by rebuilding the internal link graph, no new content.

Schema, Sitemaps, and Crawl Budget

Large pSEO sites need to make Google's job cheap. Three concrete things:

  1. Sharded sitemaps, max 50k URLs or 50MB each, grouped by template type. sitemap-service-location.xml, sitemap-location.xml, etc. When indexing drops on one shard, you know exactly which template is the problem.
  2. Conditional schema.org markup. Only emit LocalBusiness schema on pages where you actually have an address. Emitting empty or templated schema is worse than emitting none — it gets flagged as misleading.
  3. If-Modified-Since support on your HTML responses. Cheap to implement, dramatically reduces wasted crawl on unchanged pages. If you're on a static host, set proper Last-Modified headers from your build manifest.

Measuring What Actually Matters

GSC is the source of truth, not your rank tracker. Pipe the Search Console API into your warehouse daily and join it against your binding table. Now you can answer questions like:

  • Which templates have the worst impressions-to-clicks ratio? (Probably a title tag problem.)
  • Which bindings have impressions but zero clicks for 30 days? (Candidates for noindex or rewrite.)
  • Which entities drive 80% of clicks? (Invest more data into those.)

GA4 is useful for engagement signals — scroll depth, time on page, outbound clicks — but don't optimise for it directly. Optimise for the GSC query-to-click funnel and let GA4 confirm the page is actually useful once people land.

A Note on AdSense

If the site monetises through AdSense, the quality gate matters double. Thin pSEO pages with ads stacked above the fold are exactly what the policy team flags. Keep ad density proportional to content depth — pages with qualityScore < 0.6 should probably not carry ads at all, even if they're indexed for navigational reasons.

Where We'd Start

If you're inheriting a pSEO site that's plateaued, don't write more pages. In order:

  1. Audit the data model. If slugs are primary keys, fix that first.
  2. Compute a quality score for every existing page and noindex the bottom 20–30%. Traffic usually goes up within 6–8 weeks.
  3. Add a freshness job before you add a generation job.
  4. Rebuild internal links from the entity graph.
  5. Only then, think about new templates or new entities.

Programmatic SEO at scale is a database problem with an HTML output. Treat it like one and the content engine becomes maintainable. Treat it like a content marketing exercise and you'll be rewriting the same site every 18 months. If you want a second pair of eyes on a pSEO project that's not scaling, our growth engineering team does exactly this kind of audit.

#Programmatic SEO#Data Modeling#Content Engineering#Indexing

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project
Programmatic SEO Data Model That Scales to 50k Pages · 72Technologies