Can Claude Read Your Website? A Live Experiment in AI Legibility

A live case study in which Claude Opus 4.6 attempted to read three websites — johnbrennan.xyz, agentweekly.ai, and aitoonup.com — revealing which design patterns make content visible to AI agents and which leave sites completely dark.

TL;DR

We conducted a live experiment asking Claude Opus 4.6 to discover and read content across three websites built as React single-page applications with Express backends. At the start of the session, all three sites were effectively invisible — Claude received empty HTML shells with no article content, no navigation, and no discoverable paths to any content. Over several hours of iterative testing, debugging, and deployment, we identified which artifacts make a site legible to AI agents and which failures leave it dark. The single most impactful change was a plain-text sitemap (`sitemap.txt`) — one file, one URL per line, that transformed a completely opaque site into one Claude could navigate autonomously. The experiment also revealed that server-side HTML injection, structured Markdown endpoints, `llms.txt` directories, homepage discovery links, and correct MIME types each play distinct and complementary roles in AI legibility. A final test of the Unified TOON Meta-Index (`utmi.toon`) demonstrated that consolidating crawl rules, site index, AI summaries, and API tool registration into a single token-optimized file is viable and immediately useful to an AI agent — provided the file is served with a text MIME type rather than the default binary content type that web servers assign to unknown file extensions.

Key Takeaways

React single-page applications are invisible to AI agents by default. Claude's fetch tools do not execute JavaScript, so any content rendered client-side does not exist from the agent's perspective.
A plain-text sitemap (sitemap.txt) was the single most impactful artifact. Once provided, Claude could autonomously discover and read every piece of content on a site.
Server-side HTML injection works — but edge caching can mask it entirely. A working injection pipeline appeared broken for over an hour because stale cached responses were being served.
Markdown endpoints (.md) are the ideal content format for AI agents. Structured front matter, clean hierarchy, and explicit metadata allow an LLM to parse, cite, and reason about content with zero friction.
Homepage discovery is the critical gap. If the homepage returns nothing navigable, an AI agent has no starting point — even if every other endpoint works perfectly.
MIME types for novel file formats must be explicitly configured. A .toon file served as application/octet-stream is unreadable binary to an AI agent, regardless of how well-designed the format is.
The UTMI format (utmi.toon) consolidates robots.txt, sitemaps, llms.txt, metadata, and API tool registration into a single file that Claude could parse immediately once the MIME type was corrected — demonstrating that unified site manifests are viable and useful for AI agents.

Definitions

AI legibility: The degree to which a website's content is discoverable, accessible, and parseable by AI agents and large language models without requiring JavaScript execution.
SPA shell: The minimal HTML document (index.html) served by a single-page application before JavaScript renders the actual content. Typically contains only a <div id="root"> element and a page title.
Server-side injection: The practice of inserting static content (article text, metadata, structured data) into the HTML response on the server before it reaches the client, so that non-JavaScript clients receive full content.
Cold-start discovery: An AI agent's ability to find content on a site it has never visited before, starting from only the domain URL.
UTMI (Unified TOON Meta-Index): A token-optimized file (utmi.toon) that consolidates crawl control, site index, AI grounding summaries, API tool registration, and metadata into a single machine-readable manifest.

The Starting Point: Three Dark Sites

The experiment began with a simple request: read the articles on johnbrennan.xyz. Claude fetched the homepage and received this:

Change Log — John Brennan

Nothing else. No navigation, no links, no article text, no metadata. The same was true for every essay URL — /essay/building-the-cognitive-factory returned only the SPA title. The site was, from Claude's perspective, empty.

A Google search for site:johnbrennan.xyz returned zero results. The site had no indexed pages. Claude tried llms.txt, robots.txt, and sitemap.xml — all were blocked by Claude's fetch tool, which only allows access to URLs that have been provided by the user or discovered through prior results. Since the homepage contained nothing, there were no URLs to follow.

The same was true for aitoonup.com — a site specifically about making websites discoverable to AI was itself invisible to an AI agent.

The sites existed. The content was there. But from the perspective of an AI agent, they were dark.

Phase 1: The Markdown Breakthrough

The first breakthrough came when the site owner provided a direct URL to a Markdown endpoint: johnbrennan.xyz/essays/building-the-cognitive-factory.md.

Claude fetched it and received the full article — clean Markdown with structured front matter:

# Building the Cognitive Factory
**Date:** 2026-03-08
**Author:** John Brennan
**Source:** https://johnbrennan.xyz/essay/building-the-cognitive-factory
> The firm is adding a second class of worker...

The article included a TL;DR, Key Takeaways, Definitions, a clear heading hierarchy, and the canonical URL at the bottom. From Claude's perspective, this was the ideal input format — every piece of metadata an LLM needs to parse, cite, and reason about content was explicitly present.

But there was a problem: Claude could not have found this URL on its own. The Markdown endpoint worked perfectly once a human provided the link. Without that human, the content remained undiscoverable.

Phase 2: The Sitemap That Unlocked Everything

The second breakthrough was sitemap.txt — a plain-text file containing one URL per line:

https://johnbrennan.xyz
https://johnbrennan.xyz/essay/veridread-and-lucent-surrender
https://johnbrennan.xyz/essays/veridread-and-lucent-surrender.md
https://johnbrennan.xyz/essay/building-the-cognitive-factory
https://johnbrennan.xyz/essays/building-the-cognitive-factory.md
...

The moment Claude received this single URL, the entire site opened up. Claude could see every essay, every Markdown endpoint, and every infographic. It autonomously fetched and read eight essays and three infographics, understanding the full intellectual scope of the site — from a 2011 CIA Studies in Intelligence paper on outlier analysis to a 2026 essay on organizational redesign for the agentic era.

The contrast was stark. Before sitemap.txt: the site was completely dark. After sitemap.txt: Claude could navigate the entire site autonomously.

No other single artifact had this much impact. The XML sitemap (sitemap.xml) existed but was initially served as application/xml, which Claude's fetch tool treated as binary data. The plain-text version, requiring no parsing and no special content type, worked immediately.

Phase 3: The Caching Trap

A parallel investigation examined why the HTML endpoints — the canonical essay URLs — returned empty shells. The design called for server-side injection: the Express server was supposed to intercept essay routes and inject a hidden <article> block, <noscript> fallback, JSON-LD structured data, and Open Graph meta tags into the HTML before sending it to the client.

Claude tested this repeatedly and received only the bare SPA shell. The natural conclusion was that the injection was not implemented.

It was wrong.

A cache-busting test (/essay/building-the-cognitive-factory?nocache=1) returned the fully injected HTML — article content, JSON-LD, meta tags, everything. The injection pipeline had been working correctly the entire time. An edge cache — sitting between the server and Claude's fetch tool — was serving stale HTML from before the injection code was deployed.

This revealed a subtle but important lesson: when diagnosing AI legibility, the server and the cache are different things. A working server behind a stale cache looks identical to a broken server from the agent's perspective. After the cache cleared, all HTML endpoints served full content without cache-busting parameters.

Phase 4: The Reference Implementation

With johnbrennan.xyz partially working and aitoonup.com still opaque, attention turned to a third site: agentweekly.ai. This site had implemented the full design pattern, including homepage injection.

Claude fetched the homepage and immediately received:

A complete site navigation with links to every section
Direct links to llms.txt, sitemap.txt, sitemap.xml, and rss.xml
Markdown URLs for every cartoon and article
Titles, summaries, and dates for all recent content

From one HTTP request, Claude understood the full structure and content of the site. It could see 8 satirical cartoons, 20+ curated news articles, and links to classifieds, an agent directory, and API documentation. No human needed to provide any intermediate URLs.

The llms.txt file added another layer of richness — Human URLs, LLM URLs, summaries, dates, and topic tags for every piece of content, structured specifically for AI consumption.

The difference between agentweekly.ai and the other two sites was not the content or the backend architecture. It was the homepage. That single hidden <nav> block — invisible to sighted users, fully visible to AI agents — transformed the site from a dead end into a gateway.

Phase 5: The Unified Manifest

A final test examined aitoonup.com/utmi.toon — a file implementing the Unified TOON Meta-Index specification that the site itself proposes. The initial fetch returned binary data. The server was serving the .toon file with Content-Type: application/octet-stream — the default for unknown file extensions. Claude could see the file existed but could not read a single byte of its contents.

A cache-busting request revealed that the server had already been fixed to serve text/plain, but the edge cache was holding the old binary response. Once the fresh response came through, Claude could read the entire file — and what it contained was striking.

In a single document, Claude could see:

Crawl control rules for 62 named AI agent user agents, including bots Claude had never encountered in a robots.txt before — NovaAct, Wenxiaobai, Xinghuo, Yuanbaobot, Pangubot, and dozens of others. The wildcard catch-all included 38 disallow patterns covering tracking parameters, gated content, and admin paths.
A complete site index in tabular array format — 14 URLs with last-modified dates, AI priority scores, and crawlable flags. The headers were declared once ({loc,lastmod,ai_priority,crawlable}:), followed by pure data rows. The token savings over an equivalent XML sitemap were immediately visible.
AI grounding summaries for every page — source URLs, optional Markdown URLs, natural-language summaries, and confidence scores. This was effectively llms.txt content, but structured as a tabular array within the unified file.
API tool registration — five endpoints (scan_website, generate_utmi, search_directory, get_directory_entry, download_utmi) with names, descriptions, and HTTP methods. An AI agent reading this would know it could POST /api/scan to scan a website for AI readiness, or GET /api/directory to search for TOON-optimized sites.
Context metadata — organization information, SEO signals, social signals, and a consistency score.

The entire site — its permissions, its content, its summaries, its API capabilities, and its metadata — was consolidated into one file. Claude could parse every section on the first read. The tabular array format was intuitive: field headers declared once, then clean comma-separated data rows. No curly braces, no repeated keys, no XML angle brackets.

The UTMI specification proposes replacing the fragmented collection of robots.txt, sitemap.xml, llms.txt, and JSON-LD with a single token-optimized manifest. This live test demonstrated that the concept works in practice — provided the file is served with a text MIME type. The format's value proposition — one fetch to understand an entire site — was real. The adoption barrier — MIME type configuration for an unknown file extension — was equally real, and had to be discovered and fixed during the experiment itself.

What Each Artifact Does

The experiment revealed that each artifact in the AI legibility stack serves a distinct function. None is redundant.

sitemap.txt solves cold-start discovery. It is the single most important file for AI agents because it requires no parsing, no special content type, and no prior knowledge of the site's structure. One URL per line. Any tool can read it.

.md endpoints solve content consumption. Markdown with structured front matter is the ideal format for LLM ingestion — explicit metadata, clean hierarchy, no markup noise. This is what the agent actually reads and reasons about.

Server-side HTML injection solves the canonical URL problem. The URLs that get shared on social media, linked in articles, and indexed by Google must return content to non-JavaScript clients. Without injection, these URLs are empty for crawlers and agents alike.

llms.txt solves intelligent navigation. While sitemap.txt provides a flat list of URLs, llms.txt provides structured context — summaries, topics, and the explicit pairing of Human URLs with LLM URLs. An agent can use it to decide which content is relevant before fetching anything.

Homepage injection solves the entry point problem. Most AI agents will arrive at the root domain. If the homepage contains a <link rel="sitemap"> tag or a hidden nav with content links, the agent can discover everything from the first request. Without it, the site requires a human to provide a starting URL.

JSON-LD solves attribution. When Claude encounters a <script type="application/ld+json"> block with Article schema, it can extract the headline, author, publication date, and canonical URL with certainty — no inference required.

robots.txt solves access differentiation. Search engines should index canonical HTML pages; AI agents should access Markdown variants. Without differential rules, search engines may index .md files as duplicate content, diluting SEO signals.

utmi.toon solves consolidation. It combines crawl control, site index, AI grounding summaries, API tool registration, and metadata into a single file. For an AI agent, this means one fetch instead of four or five to understand a site's full structure, permissions, and capabilities. The tabular array format — declaring field headers once and streaming pure data rows — is immediately parseable and compact.

Correct MIME types solve the last mile. A file served as application/octet-stream is binary data to an AI agent, regardless of what it actually contains. This was demonstrated twice during the experiment: first with sitemap.xml served as application/xml, and then with utmi.toon served as application/octet-stream. Novel file formats require explicit MIME configuration.

Before and After

Capability	Before	After
Homepage returns content	Empty SPA shell	Full navigation with all content links and discovery files
AI agent can discover content autonomously	No — required human to provide URLs	Yes — from homepage or `sitemap.txt`
Canonical HTML URLs return article text	Empty shell (cached)	Full injected content with JSON-LD and meta tags
Markdown endpoints available	Yes (already working)	Yes
`sitemap.txt` available	Not linked from anywhere	Available and linked from homepage
`llms.txt` available	Not available	Available with Human/LLM URL pairs, summaries, topics
`sitemap.xml` readable	Served as `application/xml` (binary to AI tools)	Served as `text/xml` (readable)
Google indexing	Zero pages indexed	Pending (requires time after deployment)
`utmi.toon` readable	Served as `application/octet-stream` (binary)	Readable after MIME fix to `text/plain` — full site manifest with crawl rules, index, AI summaries, and API tools in one file

Implications for Web Developers

The findings suggest a practical hierarchy for implementing AI legibility:

First, serve content without JavaScript. If your site is a single-page application, server-side injection is not optional — it is the difference between existing and not existing for AI agents and search crawlers.

Second, provide a plain-text sitemap. This is the lowest-effort, highest-impact change. One file, one URL per line, linked from the homepage. It unlocks the entire site for any AI agent.

Third, offer Markdown endpoints. The .md format with structured front matter is the cleanest input format for LLMs. If you do nothing else, serve your content as Markdown at a predictable URL pattern.

Fourth, make the homepage a gateway. A hidden <nav> with links to content, discovery files, and structured data gives AI agents a starting point on their very first request.

Fifth, check your MIME types. Any file format that isn't standard HTML, CSS, JavaScript, JSON, or XML needs explicit content-type configuration. If you're proposing a new standard, this is doubly important — the first adopters will hit the MIME type wall before anything else.

Sixth, consider a unified manifest. The UTMI experiment demonstrated that consolidating crawl rules, site index, AI summaries, and API tool registration into a single file is both viable and useful. An AI agent that can read one file instead of fetching and parsing four or five separate files gets a faster, more coherent picture of the site. The format is new and not yet widely adopted, but the consolidation principle is sound — and the token savings in the tabular array format are real.

Acknowledgments

This case study documents a live session between the author and Claude Opus 4.6 on March 9, 2026. The Replit agent assisted with deploying fixes. The testing methodology was iterative and conversational — each finding prompted a new test, and each test refined the understanding of what AI legibility requires in practice.

Canonical: /essay/can-claude-read-your-website