I recently added a RAG-backed chat layer to aleph.fi. The original goal was not to bolt an LLM onto a portfolio just because that is what people are doing now. I wanted something more specific: a site where visitors could ask normal questions about my work and get grounded answers from the actual body of content behind the site.
That means questions like:
- “What did he build at Secret Escapes?”
- “What are his strongest skills?”
- “Which projects show ML plus product work?”
- “Has he written anything relevant to RAG or Rust?”
And importantly, I wanted those answers to come from a real retrieval pipeline rather than a vague system prompt plus vibes.
So the project ended up becoming a fairly compact but real production system:
- an offline content pipeline
- embeddings generated with Workers AI
- hybrid retrieval inside a Cloudflare Worker
- streaming responses via OpenRouter
- tool calling for project and blog post cards
- follow-up suggestion generation
- abuse protection with Turnstile, origin checks, and rate limiting
Table of contents
Open Table of contents
The shape of the system
At a high level, the system looks like this:
Offline indexing
┌──────────────────────┐ ┌──────────────────────┐
│ knowledge-base.md │ │ blog posts (main) │
└──────────┬───────────┘ └──────────┬───────────┘
└──────────────┬─────────────┘
▼
extract.ts
▼
chunks.json
▼
embed.ts
▼
corpus.json
▼
bundled into the Cloudflare Worker
Runtime
┌────────────────────────────┐
│ Browser │
│ ChatWidget + Turnstile │
└──────────────┬─────────────┘
│ POST /api/chat + token
▼
┌───────────────────────────────────────────────┐
│ Cloudflare Worker │
│ - validate token / origin / rate limits │
│ - embed query with Workers AI │
│ - retrieve top-8 chunks from corpus.json │
│ - build system prompt │
│ - stream answer from OpenRouter │
│ - optionally emit cards + follow-ups │
└──────────────┬────────────────────────────────┘
│ SSE stream
▼
┌────────────────────────────┐
│ Browser renders │
│ text, cards, follow-ups │
└────────────────────────────┘
The backend is a single Worker with one endpoint, POST /api/chat. There is no separate API server and no external vector database. For this project size that would have been unnecessary complexity.
The main Worker files are roughly split like this:
worker/src/index.ts: request validation, Turnstile verification, rate limiting, orchestration, SSE streamingworker/src/rag.ts: cosine similarity, keyword boosting, prompt construction, content lookupworker/src/openrouter.ts: streaming OpenRouter client, tool definitions, follow-up generationworker/scripts/extract.ts: content extraction and chunkingworker/scripts/embed.ts: embedding generation
On the frontend, the chat UI lives in src/components/ChatWidget.tsx and talks to the Worker over SSE.
The offline content pipeline
The RAG corpus comes from two places:
- a curated knowledge base in
worker/content/knowledge-base.md - blog posts pulled from the
mainbranch
That branch split is important. The portfolio site deploys from root, while the blog content lives on main. So the extraction script fetches blog content from main via git archive instead of assuming everything lives in the checked out branch.
The extraction step writes worker/data/chunks.json.
For the knowledge base, the script splits on ## headings, and sections with ### subheadings get split again into sub-chunks. That turned out to matter a lot for work history because a single job entry can contain several different retrieval targets: company overview, a recommender system, infra work, cost reduction, and so on.
For blog posts, the extractor does a few cleanup steps:
- skips drafts
- skips posts not authored by me
- strips code blocks and replaces them with
[code example] - keeps short posts as one chunk
- splits long posts by section headings
At the moment the corpus is small, roughly 117 chunks. That is small enough that brute-force in-memory cosine similarity is completely fine inside a Worker.
Then embed.ts calls Cloudflare Workers AI to generate embeddings using @cf/baai/bge-base-en-v1.5, a 768-dimensional embedding model. The same model is used both offline and at query time. That consistency matters; mixing embedding models across indexing and retrieval is a great way to get quietly bad search.
The output is worker/data/corpus.json, which contains both the text chunks and the precomputed embeddings. That file is committed and bundled into the Worker at deploy time.
This is also why I did not bother with a vector database. At this scale, a vector database would mainly add latency, cost, and operational surface area. In-memory cosine similarity over ~117 vectors is trivial.
What happens at query time
When a user sends a message, the Worker does the following:
- validates the Turnstile token
- checks the request origin
- rate limits by IP
- validates message count and size
- embeds the latest user message with Workers AI
- retrieves the top 8 chunks from the in-memory corpus
- builds a system prompt containing those chunks plus metadata
- streams the answer from OpenRouter
- optionally handles tool calls
- generates follow-up suggestions with a second cheaper model
The system prompt is intentionally strict. It tells the model to:
- answer only from retrieved context
- refer to me in the third person
- stay concise and professional
- refuse off-topic or prompt-injection attempts
- call a content recommendation tool when posts or projects are relevant
Each chunk is formatted with its chunk ID, source, type, section name, and relevance score. That chunk ID becomes useful when the model decides to call a tool and point the UI to the exact project or post it is talking about.
For inference I used OpenRouter. The main answer path uses anthropic/claude-sonnet-4, streamed, with tools enabled. Follow-up questions are generated with a second cheaper model, anthropic/claude-haiku-4.5, after the main answer finishes.
Retrieval quality was the real problem
The easy part of RAG is getting text in and out. The annoying part is getting the right text in the middle.
The retrieval path is hybrid:
- base score: cosine similarity
- extra boost: keyword-based section nudges
The keyword boost rules are small, intentionally so. They add a boost of about +0.03 to +0.05 when the query strongly suggests a category like employment, skills, or education.
For example, if a user asks about work history, resume, CV, roles, or companies, chunks whose section looks like employment get a small bump. I added this because semantically rich chunks can sometimes outrank more obviously relevant ones, especially in a small corpus where some chunks are dense and broad.
One concrete case was work experience queries not reliably surfacing the right Secret Escapes chunks high enough. The fix was not to add a whole reranking service. It was to keep the architecture simple and add a light keyword boost layer in rag.ts.
I also added a debug endpoint, GET /api/debug, behind a secret. It returns chunk IDs, scores, section names, types, and text previews for a given query. That was extremely useful for diagnosing why some answers felt “kind of right” but not actually useful.
Tool calling made the UI much better
Plain text answers are fine, but they are not the best UX when the real goal is “show me the relevant thing”.
So the model can call a recommend_content tool. Instead of making the model invent URLs or awkwardly paste them into prose, the Worker resolves chunk IDs into structured cards.
Those cards use metadata attached during extraction:
contentIdtypetitleurldescriptiontagsdate
That lets the frontend render actual clickable project and blog post cards under the response.
There is also a show_contact_card tool for obvious collaboration or contact intent.
The Worker streams these events back over SSE using named event types:
- default
data:events for text deltas event: referencesfor content cardsevent: contactfor the contact cardevent: followupsfor suggested next questions
That split keeps the frontend simple. The parser in ChatWidget.tsx just attaches cards and follow-ups to the current assistant message as the stream arrives.
One subtle bug here was that the model could stream text and also issue a tool call in the same response. My original orchestration always did a second LLM call after tools, which created duplicate-feeling responses. The fix was to track whether the first streamed response already had text and skip the second call when it did.
Follow-up suggestions are a separate model call
After each answer, the Worker makes a second non-streaming call to Haiku and asks for exactly three short follow-up questions as JSON.
I like this for two reasons:
First, it improves discoverability. A visitor often does not know what the corpus contains, so a few good follow-ups help them explore.
Second, it keeps the main answer path clean. The main model focuses on answering the question. The follow-up model focuses on navigation.
The nice thing is that this is cheap. The follow-up generation is roughly a couple hundred input tokens plus a tiny output, which works out to a negligible extra cost per message. If it fails, the Worker just returns no follow-ups and the main chat still works.
Abuse prevention mattered more than I expected
Public chat endpoints get weird traffic almost immediately.
So the Worker has a few layers of protection:
- Turnstile verification
- server-side origin checks
- rate limiting: 20 requests per minute per IP
- max 20 messages per conversation
- max 1000 characters per user message
- max 32 KB request body
- prompt-level guardrails
- an OpenRouter API key with a spending cap
Turnstile is especially useful because it means someone cannot just curl the API from anywhere and burn tokens. They need to render the site in a real browser session and obtain a valid domain-bound token.
I also recently added retries around token acquisition because some requests were failing on the frontend before they even reached the model. In practice this was mostly a token freshness issue, not an LLM issue.
The frontend is more than a bubble now
The original UI was a floating bubble in the corner. That is still there as a mode, but I ended up making the homepage itself a chat-first interactive CV.
Technically the frontend is still quite simple:
- Astro renders the page shell
- React handles the chat state in
ChatWidget.tsx - SSE streams tokens directly into the current assistant message
- markdown is rendered client-side
- starter prompts and first-visit hints help users get going
The main architectural point here is that the UI is not pulling from a polling API or waiting for a full response blob. It is event-driven and incremental. That keeps the chat feeling responsive even though under the hood it may involve retrieval, streaming generation, tool resolution, and then a second follow-up generation pass.
A few production bugs were more interesting than the happy path
There were a handful of issues that were much more educational than writing the initial feature.
Turnstile did not like display: none
The invisible Turnstile iframe would not work properly when hidden with display: none. The fix was to hide it using zero dimensions and absolute positioning instead.
WebDAV deploys were serving stale HTML
The portfolio deploy goes to Fastmail WebDAV via rclone. Because WebDAV does not support checksums in the way I wanted, rclone sync --checksum quietly degraded into something close to size-based behavior, which meant HTML files with identical sizes could remain stale across deploys.
The fix was a two-phase deploy and --ignore-times for the non-asset files.
Retrieval needed real instrumentation
Without the debug endpoint, retrieval issues were annoyingly fuzzy. You can feel that an answer is wrong, but until you see the ranked chunks and their scores, it is hard to know whether the problem is chunking, embeddings, prompt design, or source wording.
Source wording matters a lot
At one point the system kept overemphasizing LangGraph because the knowledge base literally described it as one of the strongest skills. That was not the model being weird; that was the corpus doing exactly what I told it to do. RAG systems are surprisingly sensitive to small wording choices in curated source content.
Why I like this architecture
The whole thing is deliberately small.
- no vector DB
- no separate backend service
- no orchestration framework
- no giant indexing pipeline
Just content extraction, embeddings, an in-memory corpus, a Worker, and a chat UI.
That simplicity is part of why I like it. At this scale, it is easy to inspect, easy to redeploy, and easy to reason about when something breaks.
I think personal sites are a particularly good use case for this shape. A portfolio is already a structured corpus. It has a bounded domain, relatively stable content, and clear user questions. That makes it a much better target for RAG than the average “AI assistant for everything” product.
If you try it and notice weird retrieval, odd emphasis, or better ways the homepage should guide first-time visitors, let me know. This is one of those features where the first version is mostly an excuse to build a better second version.