SEO & GrowthMay 28, 2026 6 min read

Internal Linking for Programmatic SEO: A Graph-Based Approach That Survives Page Churn

Most programmatic sites either over-link or under-link. Here's how to model internal links as a graph, score candidates, and keep the structure healthy as pages get added, merged, or killed.

Internal linking is the single most underrated lever on a programmatic SEO site, and the easiest one to break. Once you cross a few thousand pages, hand-curated links become fiction and naive "related items" widgets turn into either a link farm or a desert. The fix is to treat your internal link structure as a graph problem, not a templating problem.

This is the approach we use on programmatic builds: model pages as nodes, links as weighted edges, score candidates, and let the graph decay gracefully as pages get added, merged, or retired.

Why template-based linking falls apart

The usual pattern on a pSEO site looks like this: every city page links to the 10 nearest cities, every category links to its parent and a fixed set of siblings, and a footer dumps the "top 50" pages site-wide. It works at 500 URLs. At 50,000 it produces three failure modes we see constantly:

Orphan clusters. Long-tail pages that match no template rule end up with one or two inbound links, usually from a sitemap index.
Reciprocal loops. A links to B, B links back to A, and Googlebot wastes crawl budget bouncing between near-duplicates.
Stale anchors. When a page's title changes or it gets merged, every template that hardcoded its anchor text now lies.

The root cause is the same: links are computed at render time from local rules, with no view of the global structure. You can't optimise what you can't see.

Model the site as a directed weighted graph

Start by giving every indexable URL a stable node ID — not the slug, because slugs change. We use a content hash of the canonical entity plus a UUID stored alongside the page record.

Each edge has at minimum:

source_id, target_id
anchor_text (or a template that resolves to one)
placement (body, sidebar, footer, breadcrumb, related)
weight (computed, see below)
created_at, last_verified_at

Store the graph in whatever you already have. Postgres with a links table and a couple of indexes is fine up to a few million edges. Beyond that, a dedicated graph store or a columnar warehouse for analytics plus Postgres for serving works well.

CREATE TABLE pages (
  id UUID PRIMARY KEY,
  url TEXT UNIQUE NOT NULL,
  entity_type TEXT NOT NULL,
  entity_id TEXT NOT NULL,
  status TEXT NOT NULL, -- live | merged | redirected | killed
  cluster_id UUID,
  embedding VECTOR(384)
);

CREATE TABLE links (
  source_id UUID REFERENCES pages(id),
  target_id UUID REFERENCES pages(id),
  placement TEXT NOT NULL,
  anchor_template TEXT NOT NULL,
  weight REAL NOT NULL,
  last_verified_at TIMESTAMPTZ,
  PRIMARY KEY (source_id, target_id, placement)
);

CREATE INDEX ON links (target_id);
CREATE INDEX ON pages (cluster_id) WHERE status = 'live';

With that in place, every link decision becomes a query, not a template hardcode.

Clusters, not just categories

Categories are editorial. Clusters are computed. We assign each page to a cluster using embeddings of the page's core entity fields (title, primary attributes, location if relevant) and a clustering pass that runs nightly. Clusters are what "related" links should respect, because they reflect actual semantic proximity, not whatever the taxonomy team decided 18 months ago.

Scoring link candidates

For any source page, the candidate set for outbound links is, in principle, every other live page. In practice you can prune it to the same cluster plus a handful of bridges. Score each candidate with something like:

score(s, t) =
    w1 * semantic_similarity(s, t)
  + w2 * topical_authority(t)
  + w3 * inbound_deficit(t)
  - w4 * existing_link_density(s, t)
  - w5 * distance_penalty(s, t)

A quick tour of each term:

Semantic similarity: cosine distance between page embeddings. Cheap and effective.
Topical authority: pages with strong GSC impressions in the cluster's query space get a boost. This is where your GA4 + GSC pipeline starts paying off.
Inbound deficit: how far the target sits below the median inbound link count for its cluster. This is the term that fixes orphans automatically.
Existing link density: penalise candidates already linked from this page or its immediate neighbours. Stops the reciprocal loop problem.
Distance penalty: discourage links across unrelated clusters unless the bridge score is unusually high.

The weights are not universal. Tune them per site using a holdout: pick 500 pages, hand-rate the top-10 candidates as good/bad, and grid-search the weights against that judgement set. In our experience this beats any "industry default" by a wide margin.

Handling churn: the part everyone gets wrong

Programmatic sites are not static. Entities get merged when duplicates are found, killed when source data dries up, or split when a category gets too broad. Each of these events has to update the graph atomically, or you end up with links pointing at 301 chains and 404s.

We run a small event bus where any page lifecycle change emits a message, and a worker reconciles edges:

def on_page_event(event):
    if event.type == "merged":
        # Re-point inbound links to the survivor
        db.execute("""
            UPDATE links
            SET target_id = %s, last_verified_at = NOW()
            WHERE target_id = %s
        """, (event.survivor_id, event.merged_id))
        # Drop outbound links from the merged page
        db.execute("DELETE FROM links WHERE source_id = %s", (event.merged_id,))

    elif event.type == "killed":
        db.execute("DELETE FROM links WHERE source_id = %s OR target_id = %s",
                   (event.page_id, event.page_id))
        enqueue_recompute(neighbours_of(event.page_id))

    elif event.type == "created":
        enqueue_recompute([event.page_id])
        enqueue_inbound_search(event.page_id)

The key idea: link recompute is a queue, not a cron. When a page changes, recompute its outbound set and the inbound candidate set for its cluster. Don't recompute the whole site every night — at 50k+ pages that's wasteful and produces churn in the SERPs.

Link decay and verification

Every edge has a last_verified_at. A background job re-scores a slice of edges daily (say, the oldest 5%). If an edge's score has dropped below a replacement threshold, it gets swapped for a better candidate. This is how you keep the graph fresh without nuking stable, well-performing links every week.

There's a tension here worth being explicit about: Google likes link stability. Don't churn anchors and targets for the sake of it. We only replace an edge when the delta is meaningful — typically 20%+ score improvement — and we cap replacements at a few percent of edges per week per page.

Observability: make the graph legible

If you can't see the graph, you can't trust it. The dashboards we actually use:

Inbound distribution per cluster. A histogram. You want a roughly log-normal shape; a long flat tail means orphans.
Crawl depth from homepage. Computed via BFS on the live graph. Anything beyond depth 5 on a pSEO site is suspect.
Anchor diversity per target. If 90% of inbound anchors are identical, you've over-templated.
Edge age distribution. Spikes mean a recompute ran wild; flat-lines mean the decay job is broken.

Pair these with GSC's Links report and you'll spot drift before it becomes a traffic story.

What about nofollow, sponsored, and brand-safety constraints?

If your site mixes editorial and AdSense-monetised content, some link placements need to respect brand-safety rules — you don't want a sensitive topic page linked from a high-traffic commercial hub in a way that pulls ad inventory into awkward adjacency. Tag pages with safety classes and add a hard constraint to the scorer that forbids edges across incompatible classes. It's one extra WHERE clause and it saves a lot of pain. We wrote more about pre-publish safety gates on the 72Technologies blog if you want the upstream view.

Where we'd start

If you've got a programmatic site groaning under bad internal linking, don't rebuild everything. In order:

Export your current link graph from a crawl. Just CSV is fine.
Compute inbound counts per cluster and find the orphan tail. That's your first 80% win.
Stand up the pages and links tables, backfill from the crawl, and start computing scores for the orphan set only.
Ship a single new "related" module driven by the scorer, A/B it against the template version, and measure with GSC impressions per cluster over 4–6 weeks.
Only then expand to body links, breadcrumbs, and hub pages.

The trap is treating this as a one-shot migration. It isn't. It's a system that runs forever, gets better with feedback, and pays back every time you add a thousand pages. If you want help designing one for your stack, that's the kind of work we do on our engineering services side.

#SEO#Programmatic SEO#Engineering#Growth

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Internal Linking for Programmatic SEO: Building a Link Graph That Survives 100k Pages

Most programmatic sites die from flat, random internal linking. Here's how we model the link graph as a data problem so PageRank actually flows where it should.

June 26, 2026 6 min

Content Freshness Signals at Scale: When to Actually Re-Publish Programmatic Pages

Bulk-updating dateModified on a million pages is a great way to get ignored — or worse. Here's how we decide which programmatic pages deserve a real refresh, and how to wire the signal cleanly.

June 21, 2026 7 min

Faceted Navigation on Programmatic SEO Sites: Rules That Keep Google Sane

Facets are where programmatic SEO sites quietly bleed crawl budget and rank signals. Here's the rule set we use to decide which combinations earn a URL, which get noindex, and which never see a link.

June 18, 2026 6 min