SEO & GrowthMay 22, 2026 6 min read

Content Velocity Without Thin Pages: An Engineering Playbook for Programmatic SEO

Publishing 10,000 pages a month is easy. Publishing 10,000 pages that don't get classified as thin content is the actual engineering problem. Here's how we approach it.

Anyone can generate 50,000 location pages over a weekend. The hard part is shipping them without Google's quality systems quietly capping your indexation at 8% and leaving the rest in Crawled – currently not indexed purgatory. Content velocity is a throughput problem, but the constraint is almost always quality, not generation speed.

This is how we think about velocity on programmatic SEO builds: as a pipeline with quality gates, not a publish button.

The real bottleneck isn't generation, it's uniqueness

When teams brag about velocity, they usually mean "how fast can our template render rows from a database". That number is meaningless. The number that matters is how many of those rendered pages survive a uniqueness and usefulness check.

In our experience working on directory, marketplace, and comparison sites, the failure modes are predictable:

Pages that differ only by a city name and a swapped noun
Pages where 80%+ of visible tokens are boilerplate (nav, footer, CTAs, FAQ blocks)
Pages with zero unique data — just a templated paragraph wrapping a name
Pages targeting queries that don't actually exist in search demand

A thin-content classifier doesn't need to be clever to catch these. Shingling and simple n-gram overlap will do it. So will a human reviewer on a manual action.

The fix isn't writing better templates. It's gating publication on data coverage.

Data coverage as the primary gate

Before a page is even allowed into the render queue, we score the underlying row on how much unique, structured data it carries. A rough heuristic we use:

def coverage_score(row, required_fields, optional_fields):
    required_hits = sum(1 for f in required_fields if row.get(f))
    optional_hits = sum(1 for f in optional_fields if row.get(f))

    if required_hits < len(required_fields):
        return 0  # hard fail

    # Weight optional fields, cap at 1.0
    return min(1.0, 0.6 + 0.4 * (optional_hits / len(optional_fields)))

# Example: a "plumber in city" page
required = ["business_name", "address", "phone", "hours"]
optional = ["reviews", "services", "photos", "certifications", "price_range"]

If coverage_score is below ~0.75, the row doesn't get a page. It gets rolled into a parent listing instead. This is the single most effective lever we've found for keeping the index-to-publish ratio healthy.

Build a publish queue, not a publish button

Most programmatic SEO systems treat publication as a build step: regenerate the site, push, done. That's fine at 500 pages. At 50,000 it's reckless, because you lose the ability to observe what each cohort of pages does.

We model publication as a queue with cohorts:

rows_eligible  →  cohort_builder  →  staged  →  published  →  monitored

Each cohort is a batch of pages sharing a template, a data shape, and a publish date. Cohorts are typically 500 – 2,000 pages. Why batches?

You can measure indexation rate per cohort in GSC after 14 – 28 days
You can roll back a bad cohort without nuking the site
You can A/B template variants across cohorts
You avoid the "100k URLs appeared on Tuesday" signal that tends to age badly

A cohort that hits less than ~40% indexation after 30 days is a signal to pause the next batch and investigate. Sometimes it's a template issue, sometimes it's a query-demand issue, sometimes the data simply isn't dense enough.

What cohort metadata should look like

{
  "cohort_id": "plumbers-tx-2026-03-batch-04",
  "template_version": "v7",
  "row_count": 1240,
  "avg_coverage_score": 0.82,
  "published_at": "2026-03-11",
  "sitemap_partition": "sitemap-plumbers-tx-04.xml",
  "gsc_property": "sc-domain:example.com",
  "hold_next_batch_if_index_rate_below": 0.4
}

That metadata is the bridge between your content pipeline and your analytics pipeline. It's what lets you ask "which template version is indexing best?" instead of "why is traffic flat?"

Uniqueness scoring before publish, not after

The classic mistake: render the page, ship it, then discover three months later that 60% of your pages share a 400-word intro paragraph.

Run a shingled similarity check at build time. MinHash with a Jaccard threshold around 0.7 is cheap and good enough for most catalogs. If a candidate page's body shingles overlap more than the threshold with any already-published page in the same template family, it fails the gate.

A few practical notes from doing this in production:

Strip nav, footer, and repeating UI before shingling. Otherwise everything looks 95% similar.
Hash on 5 – 7 word shingles. Smaller and you get noise, larger and you miss paraphrases.
Store shingle signatures, not the full text. A 128-permutation MinHash signature is ~1KB per page.
Compare within template family, not across the whole site. A plumber page and a dentist page should look different anyway.

Treat search demand as an input, not a hope

There is no point generating a page for a query nobody searches. This sounds obvious, but most pSEO templates are built on a data shape (rows in a table) without ever checking whether those rows correspond to real queries.

Our rule: every template needs a demand source before it gets approved. That can be:

GSC impressions on existing similar pages
Keyword tool data with a minimum monthly volume floor (we usually pick something modest like 20 – 50 searches/month per page)
Autocomplete and "People also ask" scraping for the head term
Internal site search logs

Rows that don't map to any demand signal still get a page — but as a child of a parent hub, not as a standalone URL. This keeps the long tail accessible to crawlers without inflating your URL count with zero-demand pages.

If you want to see how we think about the underlying data model that makes this possible, our services pages cover the build side.

The brand-safety layer (yes, even for B2B)

If you're running AdSense or any programmatic ad network on these pages, brand safety becomes a publish gate too. User-generated fields are the usual culprit — business names, review snippets, free-text descriptions.

We run candidate text through a classifier pass before publish. Not anything exotic; a small model checking for:

Adult, gambling, weapons, hate categories
Health claims that look like medical advice
Financial claims that look like investment advice
Obvious profanity in user-submitted fields

Anything flagged either gets the field stripped or the page demoted to noindex. The cost of one policy strike across a 30,000-page property is much higher than the cost of running a classifier over each row.

A reasonable gate stack

In order, before a row becomes a published URL:

Data coverage score ≥ threshold
Demand signal present
Uniqueness check vs template family
Brand-safety classifier pass
Internal link plan (parent hub + 2 – 3 sibling links minimum)
Schema validation
Cohort assignment

Miss any one of these and the row is parked, not published. Parked rows get re-evaluated weekly as the underlying data improves.

Measuring velocity honestly

The metric we care about isn't "pages published per week". It's indexed, ranking pages per week. Those are very different numbers.

A dashboard worth having tracks, per cohort:

Pages submitted in sitemap
Pages discovered (GSC Coverage)
Pages indexed
Pages with at least one impression in the last 28 days
Pages with at least one click in the last 28 days
Median position for the page's primary query

The drop-off between each stage tells you where your pipeline is leaking. If discovery is fine but indexation is bad, your quality gates aren't strict enough. If indexation is fine but impressions are zero, your demand signal was wrong. If impressions are fine but clicks are zero, your titles and meta descriptions need work.

More on the GA4 and GSC side of this on the blog.

Where we'd start

If you've got an existing pSEO property that's underperforming, don't add more pages. Do this instead, in order:

Export your URL inventory and join it against GSC. Find the cohort of pages with zero impressions in 90 days. That's your dead weight.
Score each of those pages on data coverage using the heuristic above. The low-coverage ones either get enriched, consolidated into hubs, or noindexed.
Pick one template family and run a uniqueness audit on it. If average pairwise similarity is above ~0.5, your template is the problem, not the data.
Only after the existing index is healthy, turn the publish queue back on — in cohorts of 500 – 2,000, with the gate stack above enforced.

Velocity is a function of how much junk you're willing to not ship. Most teams find their effective velocity goes up after they tighten the gates, because the survivors actually rank.

#programmatic-seo#content-engineering#quality-gates#seo-ops

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Internal Linking for Programmatic SEO: Building a Link Graph That Survives 100k Pages

Most programmatic sites die from flat, random internal linking. Here's how we model the link graph as a data problem so PageRank actually flows where it should.

June 26, 2026 6 min

Content Freshness Signals at Scale: When to Actually Re-Publish Programmatic Pages

Bulk-updating dateModified on a million pages is a great way to get ignored — or worse. Here's how we decide which programmatic pages deserve a real refresh, and how to wire the signal cleanly.

June 21, 2026 7 min

Faceted Navigation on Programmatic SEO Sites: Rules That Keep Google Sane

Facets are where programmatic SEO sites quietly bleed crawl budget and rank signals. Here's the rule set we use to decide which combinations earn a URL, which get noindex, and which never see a link.

June 18, 2026 6 min