Content Velocity Without Thin Pages: An Engineering Playbook for Programmatic SEO
Publishing 10,000 pages a month is easy. Publishing 10,000 pages that don't get classified as thin content is the actual engineering problem. Here's how we approach it.

Anyone can generate 50,000 location pages over a weekend. The hard part is shipping them without Google's quality systems quietly capping your indexation at 8% and leaving the rest in Crawled – currently not indexed purgatory. Content velocity is a throughput problem, but the constraint is almost always quality, not generation speed.
This is how we think about velocity on programmatic SEO builds: as a pipeline with quality gates, not a publish button.
The real bottleneck isn't generation, it's uniqueness
When teams brag about velocity, they usually mean "how fast can our template render rows from a database". That number is meaningless. The number that matters is how many of those rendered pages survive a uniqueness and usefulness check.
In our experience working on directory, marketplace, and comparison sites, the failure modes are predictable:
- Pages that differ only by a city name and a swapped noun
- Pages where 80%+ of visible tokens are boilerplate (nav, footer, CTAs, FAQ blocks)
- Pages with zero unique data — just a templated paragraph wrapping a name
- Pages targeting queries that don't actually exist in search demand
A thin-content classifier doesn't need to be clever to catch these. Shingling and simple n-gram overlap will do it. So will a human reviewer on a manual action.
The fix isn't writing better templates. It's gating publication on data coverage.
Data coverage as the primary gate
Before a page is even allowed into the render queue, we score the underlying row on how much unique, structured data it carries. A rough heuristic we use:
def coverage_score(row, required_fields, optional_fields):
required_hits = sum(1 for f in required_fields if row.get(f))
optional_hits = sum(1 for f in optional_fields if row.get(f))
if required_hits < len(required_fields):
return 0 # hard fail
# Weight optional fields, cap at 1.0
return min(1.0, 0.6 + 0.4 * (optional_hits / len(optional_fields)))
# Example: a "plumber in city" page
required = ["business_name", "address", "phone", "hours"]
optional = ["reviews", "services", "photos", "certifications", "price_range"]
If coverage_score is below ~0.75, the row doesn't get a page. It gets rolled into a parent listing instead. This is the single most effective lever we've found for keeping the index-to-publish ratio healthy.
Build a publish queue, not a publish button
Most programmatic SEO systems treat publication as a build step: regenerate the site, push, done. That's fine at 500 pages. At 50,000 it's reckless, because you lose the ability to observe what each cohort of pages does.
We model publication as a queue with cohorts:
rows_eligible → cohort_builder → staged → published → monitored
Each cohort is a batch of pages sharing a template, a data shape, and a publish date. Cohorts are typically 500 – 2,000 pages. Why batches?
- You can measure indexation rate per cohort in GSC after 14 – 28 days
- You can roll back a bad cohort without nuking the site
- You can A/B template variants across cohorts
- You avoid the "100k URLs appeared on Tuesday" signal that tends to age badly
A cohort that hits less than ~40% indexation after 30 days is a signal to pause the next batch and investigate. Sometimes it's a template issue, sometimes it's a query-demand issue, sometimes the data simply isn't dense enough.
What cohort metadata should look like
{
"cohort_id": "plumbers-tx-2026-03-batch-04",
"template_version": "v7",
"row_count": 1240,
"avg_coverage_score": 0.82,
"published_at": "2026-03-11",
"sitemap_partition": "sitemap-plumbers-tx-04.xml",
"gsc_property": "sc-domain:example.com",
"hold_next_batch_if_index_rate_below": 0.4
}
That metadata is the bridge between your content pipeline and your analytics pipeline. It's what lets you ask "which template version is indexing best?" instead of "why is traffic flat?"
Uniqueness scoring before publish, not after
The classic mistake: render the page, ship it, then discover three months later that 60% of your pages share a 400-word intro paragraph.
Run a shingled similarity check at build time. MinHash with a Jaccard threshold around 0.7 is cheap and good enough for most catalogs. If a candidate page's body shingles overlap more than the threshold with any already-published page in the same template family, it fails the gate.
A few practical notes from doing this in production:
- Strip nav, footer, and repeating UI before shingling. Otherwise everything looks 95% similar.
- Hash on 5 – 7 word shingles. Smaller and you get noise, larger and you miss paraphrases.
- Store shingle signatures, not the full text. A 128-permutation MinHash signature is ~1KB per page.
- Compare within template family, not across the whole site. A plumber page and a dentist page should look different anyway.
Treat search demand as an input, not a hope
There is no point generating a page for a query nobody searches. This sounds obvious, but most pSEO templates are built on a data shape (rows in a table) without ever checking whether those rows correspond to real queries.
Our rule: every template needs a demand source before it gets approved. That can be:
- GSC impressions on existing similar pages
- Keyword tool data with a minimum monthly volume floor (we usually pick something modest like 20 – 50 searches/month per page)
- Autocomplete and "People also ask" scraping for the head term
- Internal site search logs
Rows that don't map to any demand signal still get a page — but as a child of a parent hub, not as a standalone URL. This keeps the long tail accessible to crawlers without inflating your URL count with zero-demand pages.
If you want to see how we think about the underlying data model that makes this possible, our services pages cover the build side.
The brand-safety layer (yes, even for B2B)
If you're running AdSense or any programmatic ad network on these pages, brand safety becomes a publish gate too. User-generated fields are the usual culprit — business names, review snippets, free-text descriptions.
We run candidate text through a classifier pass before publish. Not anything exotic; a small model checking for:
- Adult, gambling, weapons, hate categories
- Health claims that look like medical advice
- Financial claims that look like investment advice
- Obvious profanity in user-submitted fields
Anything flagged either gets the field stripped or the page demoted to noindex. The cost of one policy strike across a 30,000-page property is much higher than the cost of running a classifier over each row.
A reasonable gate stack
In order, before a row becomes a published URL:
- Data coverage score ≥ threshold
- Demand signal present
- Uniqueness check vs template family
- Brand-safety classifier pass
- Internal link plan (parent hub + 2 – 3 sibling links minimum)
- Schema validation
- Cohort assignment
Miss any one of these and the row is parked, not published. Parked rows get re-evaluated weekly as the underlying data improves.
Measuring velocity honestly
The metric we care about isn't "pages published per week". It's indexed, ranking pages per week. Those are very different numbers.
A dashboard worth having tracks, per cohort:
- Pages submitted in sitemap
- Pages discovered (GSC Coverage)
- Pages indexed
- Pages with at least one impression in the last 28 days
- Pages with at least one click in the last 28 days
- Median position for the page's primary query
The drop-off between each stage tells you where your pipeline is leaking. If discovery is fine but indexation is bad, your quality gates aren't strict enough. If indexation is fine but impressions are zero, your demand signal was wrong. If impressions are fine but clicks are zero, your titles and meta descriptions need work.
More on the GA4 and GSC side of this on the blog.
Where we'd start
If you've got an existing pSEO property that's underperforming, don't add more pages. Do this instead, in order:
- Export your URL inventory and join it against GSC. Find the cohort of pages with zero impressions in 90 days. That's your dead weight.
- Score each of those pages on data coverage using the heuristic above. The low-coverage ones either get enriched, consolidated into hubs, or noindexed.
- Pick one template family and run a uniqueness audit on it. If average pairwise similarity is above ~0.5, your template is the problem, not the data.
- Only after the existing index is healthy, turn the publish queue back on — in cohorts of 500 – 2,000, with the gate stack above enforced.
Velocity is a function of how much junk you're willing to not ship. Most teams find their effective velocity goes up after they tighten the gates, because the survivors actually rank.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading
Internal Linking for Programmatic SEO: Building a Link Graph That Survives 100k Pages
Most programmatic sites die from flat, random internal linking. Here's how we model the link graph as a data problem so PageRank actually flows where it should.

Content Freshness Signals at Scale: When to Actually Re-Publish Programmatic Pages
Bulk-updating dateModified on a million pages is a great way to get ignored — or worse. Here's how we decide which programmatic pages deserve a real refresh, and how to wire the signal cleanly.

Faceted Navigation on Programmatic SEO Sites: Rules That Keep Google Sane
Facets are where programmatic SEO sites quietly bleed crawl budget and rank signals. Here's the rule set we use to decide which combinations earn a URL, which get noindex, and which never see a link.
