SEO & GrowthJune 26, 2026 6 min read

Internal Linking for Programmatic SEO: Building a Link Graph That Survives 100k Pages

Most programmatic sites die from flat, random internal linking. Here's how we model the link graph as a data problem so PageRank actually flows where it should.

Programmatic SEO sites usually fail at the same place: internal linking. The templates ship, the pages index, traffic climbs for a quarter, then plateaus — because every page links to the same 12 things in the footer and a random grab-bag of "related" rows that were never engineered. Here's how we treat the internal link graph as a first-class data model, not a template afterthought.

Why internal linking breaks at scale

On a 200-page site you can hand-curate links. On 100,000 pages you cannot, so most teams reach for one of two crutches:

A related block populated by "same category, random 10"
A massive HTML sitemap or alphabetical index dumped into the footer

Both approaches produce a flat graph. Every page is roughly two clicks from every other page, anchor text is repetitive, and crawl signals tell Google that no page is more important than any other. PageRank — or whatever Google's modern equivalent is internally — has nothing to concentrate on, so rankings stay mediocre across the board.

The fix is not more links. It's intentional links, modelled the same way you'd model a recommendation system.

Model the link graph before you template it

Before touching a Next.js component, sketch the graph on paper. A useful programmatic site has three or four tiers:

Tier 0 — homepage, a handful of pillar pages
Tier 1 — category hubs (e.g. /plumbers/london)
Tier 2 — sub-hubs or filters (e.g. /plumbers/london/emergency)
Tier 3 — leaf pages (the long tail: individual listings, comparisons, location+service combos)

Links should flow down the tiers via navigation and breadcrumbs, up the tiers via canonical hub links on every leaf, and sideways only between semantically related siblings — not random ones.

The link budget per template

Give each template a fixed link budget and stick to it. A typical breakdown we use on leaf pages:

1 link up to the parent hub (breadcrumb)
1 link up to the grandparent hub (breadcrumb)
5 – 8 sibling links (semantically similar, not random)
3 – 5 contextual in-body links to related leaves
1 link to a relevant pillar

That's roughly 12 – 17 internal links per leaf, all earned by a rule. Compare that to the typical mega-footer site shipping 200+ identical internal links per page — Google has to discount most of them.

Build a sibling selector that isn't random

The single biggest win on most programmatic sites is replacing ORDER BY RANDOM() in the related block with a real similarity function. You don't need ML for this. Cosine similarity on a handful of structured features is enough.

-- Postgres: pick 8 semantically close siblings for a given leaf page
WITH target AS (
  SELECT id, category_id, city_id, price_tier, service_tags
  FROM pages WHERE slug = $1
)
SELECT p.id, p.slug, p.title,
  (
    (p.category_id = t.category_id)::int * 3 +
    (p.city_id     = t.city_id)::int     * 2 +
    (p.price_tier  = t.price_tier)::int  * 1 +
    cardinality(p.service_tags & t.service_tags)
  ) AS score
FROM pages p, target t
WHERE p.id <> t.id
  AND p.status = 'indexable'
ORDER BY score DESC, p.click_depth ASC
LIMIT 8;

Two details matter here. First, p.status = 'indexable' — never link to noindexed or thin pages from a sibling block, because you're spending crawl budget on dead ends. Second, the ORDER BY ... click_depth ASC tie-breaker pushes Google toward pages that are currently buried, which is exactly where you want link equity to go.

Anchor text without the spam smell

Anchor text is where programmatic sites usually self-report as templates. If every link from a London plumber page reads "Plumbers in London", you're waving a flag. Vary it by pulling from a small pool of patterns:

{Service} in {City}
{City} {service} (lowercase, casual)
{Service} near {Neighbourhood}
{Adjective} {service} {City} — only when adjective is structured data, never invented

Rotate deterministically based on the target page's ID, not randomly per request, or the anchors will flap on every crawl and confuse Google.

Hub pages do the heavy lifting

A programmatic site without strong hubs is just a haystack. Hubs are how you tell search engines which clusters of leaves belong together and which ones matter most.

Three rules we apply to every hub:

A hub must rank for its own keyword. If /plumbers/london doesn't rank for "plumbers London", linking from it carries little weight. Treat hubs as editorial content with real copy, not just a grid of leaves.
Hubs link to no more than ~100 leaves directly. Beyond that, paginate or sub-categorise. Linking to 5,000 leaves from one page dilutes each link to near zero.
Every leaf links back to exactly one canonical hub. Multiple parent hubs split the equity and create category confusion.

When to introduce sub-hubs

If a hub has more than ~200 viable leaves, split it. The split should follow a user facet (price, neighbourhood, urgency) not an arbitrary alphabet bucket. Sub-hubs need their own copy, their own title patterns, and their own sibling logic — otherwise they're just paginated noise.

Measuring whether your link graph actually works

This is where most teams stop, which is why most teams plateau. You need three measurements:

Click depth distribution. Run a crawl (Screaming Frog, Sitebulb, or your own) and plot click depth vs. page count. If 80% of your pages sit at depth 5+, your hubs aren't doing their job. Aim for the long tail of leaves to be reachable within 3 – 4 clicks of the homepage.

Internal PageRank proxy. You can approximate this with a simple PageRank calculation on your crawl graph. NetworkX does it in five lines:

import networkx as nx

G = nx.DiGraph()
G.add_edges_from(edges)  # (source_url, target_url) tuples from your crawl

pr = nx.pagerank(G, alpha=0.85)

# Cross-reference with GSC clicks
for url, score in sorted(pr.items(), key=lambda x: -x[1])[:50]:
    print(url, score, gsc_clicks.get(url, 0))

The pages with high internal PageRank but low GSC clicks are either over-linked low-value pages (stop linking to them) or genuinely high-potential pages that need content work. Pages with low PageRank but decent clicks are under-linked — boost them in sibling blocks.

Anchor text diversity. Group anchors by target URL and look at the distribution. If a single anchor accounts for more than ~60% of incoming links to a page, vary your pattern pool.

The orphan and near-orphan problem

Programmatic sites accumulate orphans constantly. A new template ships, a sitemap entry disappears, a category gets renamed — and suddenly 4,000 pages have zero internal links. They limp along on sitemap discovery alone, which Google increasingly treats as a weak signal.

Run an orphan check weekly:

SELECT p.slug
FROM pages p
LEFT JOIN internal_links l ON l.target_id = p.id
WHERE p.status = 'indexable'
GROUP BY p.slug
HAVING COUNT(l.id) < 3;

Any indexable page with fewer than 3 incoming internal links needs intervention — either delete it, noindex it, or surface it in a sibling block somewhere. Three is a rough floor, not a magic number; the point is that a single link is fragile.

Common mistakes we've cleaned up

A short, honest list from sites we've audited:

Footer mega-menus with 300+ links. Cut to 20 high-value destinations. The rest belong in HTML sitemaps that are themselves linked from the footer, not the footer itself.
"Popular searches" widgets that never update. If it's static, it's noise. Either drive it from real GSC query data or remove it.
Breadcrumbs that skip levels. Every level in the URL should be a level in the breadcrumb. Skipping creates orphaned sub-hubs.
Linking to paginated URLs (?page=4) from sibling blocks. Always link to canonical pages, never to deep pagination.
Same anchor text across every template instance. Even a 3-pattern rotation looks dramatically more natural to crawlers than a single hardcoded string.

Where we'd start

If you've got an existing programmatic site and you're staring at a plateau, do this in order: crawl it and plot click depth, build the orphan query above, then replace your random related block with a similarity-scored one. Those three moves take a sprint and usually shift rankings within a crawl cycle or two.

If you're greenfield, model the tiers before you write a single template, and give each template a written link budget that lives in the spec. The link graph is architecture — retrofitting it later is always more painful than building it right. If you want a hand designing one, our SEO and growth engineering team does this for a living.

#Programmatic SEO#Technical SEO#Site Architecture#Growth Engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Content Freshness Signals at Scale: When to Actually Re-Publish Programmatic Pages

Bulk-updating dateModified on a million pages is a great way to get ignored — or worse. Here's how we decide which programmatic pages deserve a real refresh, and how to wire the signal cleanly.

June 21, 2026 7 min

Faceted Navigation on Programmatic SEO Sites: Rules That Keep Google Sane

Facets are where programmatic SEO sites quietly bleed crawl budget and rank signals. Here's the rule set we use to decide which combinations earn a URL, which get noindex, and which never see a link.

June 18, 2026 6 min

GA4 + GSC Joins in BigQuery: Building a Query-to-Revenue View for Programmatic SEO

Most programmatic SEO teams track impressions and revenue in separate silos. Here's how we stitch GA4 and GSC together in BigQuery to get a real query-to-revenue view that actually drives roadmap decisions.

June 15, 2026 6 min