# Configuring My Site for AI Discoverability

Published: April 20, 2026
Tags: geo, seo, cloudflare, llms, web-development

How I set up this site for GEO. Raw Markdown, llms.txt, Content-Signal, and the Cloudflare bits that tie it all together.

A growing share of web traffic doesn't come from people anymore. It comes from models reading on their behalf. ChatGPT, Claude, Perplexity, Copilot. They fetch a handful of pages, summarize, and ship the answer back. If your site isn't readable by those agents, you don't exist to them.

People are calling this [GEO](https://wikipedia.org/wiki/Generative_engine_optimization), short for Generative Engine Optimization. It overlaps with SEO but the priorities are different. Agents don't care about your layout. They care about your prose, your metadata, and how many tokens it costs them to read you.

This post covers how I configured this site for GEO. The first half is framework-agnostic. The second half is specific to my setup on Cloudflare, and includes a deliberate choice that fails a popular GEO audit. I'll explain why.

## Part 1: general GEO techniques

### Serve raw Markdown alongside HTML

The single biggest GEO win is giving agents a version of each page without the navigation, styling, and scripts. HTML is designed for browsers. Markdown is designed for readers, human or otherwise. Agents spend their context window on your prose, not your DOM.

Every blog post on this site has a mirror URL with a `.md` suffix:

- `/blog/my-post` is the full HTML page for humans
- `/blog/my-post.md` is the raw Markdown, served as `text/markdown`

In Astro, this is a two-line route at `src/pages/blog/[slug].md.ts`:

```ts {4}
export const GET = async ({ params }) => {
  const post = await getPostById(params.slug);
  return new Response(formatPostMarkdown(post), {
    headers: { "Content-Type": "text/markdown; charset=utf-8" },
  });
};
```

Both variants are pre-generated at build time. Same content, **roughly half the tokens** for an agent to consume.

### Advertise the Markdown version in `<head>`

Agents landing on the HTML need to know the Markdown exists. A single `<link>` in the head does it:

```html
<link rel="alternate" type="text/markdown" href="/blog/my-post.md" />
```

Browsers ignore this tag. Agents that parse the head follow it.

### Publish an `llms.txt` index

[`llms.txt`](https://llmstxt.org/) is a convention for a Markdown file at the root of your site listing your content with short descriptions and links. Think of it as a sitemap an LLM can actually read.

I ship two variants:

- `/llms.txt` is the index. Title, description, one line per post with a link to its `.md` version.
- `/llms-full.txt` is the full corpus. Every post body concatenated into a single response.

Why both? An agent researching a specific topic can fetch `llms.txt`, pick the relevant links, and pull them. An agent doing deep research on the site as a whole fetches `llms-full.txt` once and has everything it needs in one request. Either way there's no crawling.

### Declare your AI stance in `robots.txt`

`robots.txt` now carries a `Content-Signal` directive for AI use. Mine reads:

```txt {2}
User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /
Sitemap: https://morello.dev/sitemap-index.xml
```

Three independent knobs:

- `search=yes` lets search engines index
- `ai-train=no` says my content is not for training data
- `ai-input=yes` says my content _can_ be retrieved and used as input for AI answers

This is the stance I'm comfortable with. I want to show up when someone asks Claude about something I've written; I just don't want my posts absorbed into the next base model.

> Whether any given operator actually honors this is another question. The signal's there regardless, and I'd rather be on record than silent about it.

### Add structured data that actually describes the content

Most blogs ship JSON-LD schema by reflex. Few of them include the fields that help a generative engine decide whether your article is worth fetching.

On each post I emit a `BlogPosting` graph with:

- `wordCount` and `timeRequired` (ISO 8601 duration), so an agent can estimate how much context it'll spend before fetching
- `articleBody`, the full text machine-readable, with no HTML parsing required
- `author` linked to a `Person` node with `knowsAbout` so the entity is grounded in real topics
- `BreadcrumbList` for site hierarchy

All of it goes into a single `@graph` per page rather than scattered `<script>` tags, which makes it cheaper for an engine to walk from post to author to site without cross-referencing.

### A sitemap that actually tracks freshness

If you regenerate your sitemap once and never look at it again, you're wasting a signal. Every URL in mine carries a `lastmod` timestamp pulled from the post's `updatedDate` frontmatter, falling back to `pubDate`. When I edit an old post, its `lastmod` moves forward and crawlers reprioritize it.

### Validate with real tools

Two tools I found useful while iterating on all of the above:

- [isitagentready.com](https://isitagentready.com/) audits across five categories: discoverability, content accessibility, bot access control, protocol discovery, and commerce. The bot access control checks (`Content-Signal`, Web Bot Auth, AI bot rules) are the part that actually influences how agents treat your content.
- [acceptmarkdown.com](https://acceptmarkdown.com/) has a narrower focus. It checks whether your site responds to `Accept: text/markdown` with a Markdown body, includes `Vary: Accept`, returns `406` for unsupported types, and parses q-values correctly.

I'll come back to the second one at the end of the post, because my site deliberately fails it.

## Part 2: the Cloudflare-specific setup

General GEO gets you most of the way there. The rest is delivery. How fast you respond, whether the edge caches correctly, and how you advertise your agent-facing resources without waiting for someone to parse your HTML.

### Static assets, zero Worker invocations

My `wrangler.jsonc` points a `./dist` directory at [Cloudflare's assets deployment](https://developers.cloudflare.com/workers/static-assets/), with no `main` entry:

```jsonc
{
  "name": "morellodev",
  "compatibility_date": "2026-04-18",
  "assets": {
    "directory": "./dist",
    "html_handling": "drop-trailing-slash",
    "not_found_handling": "404-page",
  },
}
```

Every request goes straight from the edge asset cache. HTML, Markdown, `llms.txt`, sitemap, RSS. Same path for all of them, and no Worker ever runs. On the Workers Free tier this matters. A crawler sweep that would otherwise eat into 100k daily invocations now costs me nothing. Agents, for better or worse, don't fingerprint politely.

### Advertise discovery endpoints in a `Link` header

Cloudflare's [`_headers` file](https://developers.cloudflare.com/workers/static-assets/headers/) lets you ship response headers without any server code. I use it to tell every response, not just HTML ones, where the agent-facing files live:

```txt
/*
  Link: </sitemap-index.xml>; rel="sitemap",
        </rss.xml>; rel="alternate"; type="application/rss+xml"; title="RSS",
        </llms.txt>; rel="describedby"; type="text/plain",
        </llms-full.txt>; rel="describedby"; type="text/plain"
```

A crawler doing a `HEAD` against any URL on the site sees all four links before it parses a single byte of HTML. **One round-trip, no body, full discovery.**

### Long-lived cache for hashed assets

Astro emits fingerprinted filenames under `/_astro/`, so those can sit in cache for a year:

```txt
/_astro/*
  Cache-Control: public, max-age=31536000, immutable
```

Faster first paint for humans, cheaper crawls for agents. Same lever.

### Why I skipped `Accept: text/markdown` content negotiation

[acceptmarkdown.com](https://acceptmarkdown.com/) will tell you this site doesn't do content negotiation. No `Vary: Accept`, no `406`, no Markdown from the canonical URL. That's not an oversight. I tried it, shipped it briefly, and rolled it back.

The reason is Cloudflare's free plan. Custom cache keys are Enterprise-only, and [their docs are explicit](https://developers.cloudflare.com/cache/concepts/cache-control/) that `Vary: Accept` is ignored for caching decisions. The edge collapses every variant of `/blog/my-post` into one cache entry, so the first requester's format **poisons the cache for everyone else** until TTL expires.

The workaround is a Worker that bypasses the edge cache. But now every `/blog/*` request burns a Worker invocation, humans included, and the [Workers Free plan](https://developers.cloudflare.com/workers/platform/pricing/) gives you 100k per day and 10ms of CPU each. That's a real budget to share across humans and bots, for no functional gain over a static `.md` URL.

So I deleted the Worker. The only thing I lost is `curl -H "Accept: text/markdown" …/blog/my-post` returning Markdown. Between `llms.txt`, `<link rel="alternate">`, and the `/blog/[slug].md` convention, no mainstream agent I've seen actually needs `Accept:` negotiation. It's the more elegant protocol; alternate URLs are the more robust one on a free-tier CDN. On a paid plan I'd probably do both.

## Where this leaves things

Every page exists in two forms, both served from the edge. Agent-facing resources are advertised in response headers on every request, before any HTML gets parsed. Structured data tells engines what the article is and how much context it takes to read. `robots.txt` says what I'll allow and what I won't.

GEO is still very new. The standards are half-drafted, the tools disagree with each other, and half the signals I described above didn't exist two years ago. I fully expect to be rewriting parts of this post within six months, probably with a different opinion about Accept-based negotiation, once I've either moved off the free plan or found a workaround that doesn't involve a Worker. But for now: serve agents a version they can cheaply consume, be explicit about what you'll allow, and accept that the defaults aren't on your side.

If you're reading this via a summary from some assistant, hi. Thanks for the traffic.
