Building a RAG Chat for My Portfolio

I recently added a RAG-backed chat layer to aleph.fi. The original goal was not to bolt an LLM onto a portfolio just because that is what people are doing now. I wanted something more specific: a site where visitors could ask normal questions about my work and get grounded answers from the actual body of content behind the site.

That means questions like:

“What did he build at Secret Escapes?”
“What are his strongest skills?”
“Which projects show ML plus product work?”
“Has he written anything relevant to RAG or Rust?”

And importantly, I wanted those answers to come from a real retrieval pipeline rather than a vague system prompt plus vibes.

So the project ended up becoming a fairly compact but real production system:

an offline content pipeline
embeddings generated with Workers AI
hybrid retrieval inside a Cloudflare Worker
streaming responses via OpenRouter
tool calling for project and blog post cards
follow-up suggestion generation
abuse protection with Turnstile, origin checks, and rate limiting

Open Table of contents

The shape of the system
The offline content pipeline
What happens at query time
Retrieval quality was the real problem
Tool calling made the UI much better
Follow-up suggestions are a separate model call
Abuse prevention mattered more than I expected
The frontend is more than a bubble now
A few production bugs were more interesting than the happy path
Why I like this architecture

The shape of the system

At a high level, the system looks like this:

Offline indexing

┌──────────────────────┐     ┌──────────────────────┐
│ knowledge-base.md    │     │ blog posts (main)    │
└──────────┬───────────┘     └──────────┬───────────┘
           └──────────────┬─────────────┘
                          ▼
                     extract.ts
                          ▼
                     chunks.json
                          ▼
                      embed.ts
                          ▼
                     corpus.json
                          ▼
          bundled into the Cloudflare Worker

Runtime

┌────────────────────────────┐
│ Browser                    │
│ ChatWidget + Turnstile     │
└──────────────┬─────────────┘
               │ POST /api/chat + token
               ▼
┌───────────────────────────────────────────────┐
│ Cloudflare Worker                            │
│ - validate token / origin / rate limits      │
│ - embed query with Workers AI                │
│ - retrieve top-8 chunks from corpus.json     │
│ - build system prompt                        │
│ - stream answer from OpenRouter              │
│ - optionally emit cards + follow-ups         │
└──────────────┬────────────────────────────────┘
               │ SSE stream
               ▼
┌────────────────────────────┐
│ Browser renders            │
│ text, cards, follow-ups    │
└────────────────────────────┘

The backend is a single Worker with one endpoint, POST /api/chat. There is no separate API server and no external vector database. For this project size that would have been unnecessary complexity.

The main Worker files are roughly split like this:

worker/src/index.ts: request validation, Turnstile verification, rate limiting, orchestration, SSE streaming
worker/src/rag.ts: cosine similarity, keyword boosting, prompt construction, content lookup
worker/src/openrouter.ts: streaming OpenRouter client, tool definitions, follow-up generation
worker/scripts/extract.ts: content extraction and chunking
worker/scripts/embed.ts: embedding generation

On the frontend, the chat UI lives in src/components/ChatWidget.tsx and talks to the Worker over SSE.

The offline content pipeline

The RAG corpus comes from two places:

a curated knowledge base in worker/content/knowledge-base.md
blog posts pulled from the main branch

That branch split is important. The portfolio site deploys from root, while the blog content lives on main. So the extraction script fetches blog content from main via git archive instead of assuming everything lives in the checked out branch.

The extraction step writes worker/data/chunks.json.

For the knowledge base, the script splits on ## headings, and sections with ### subheadings get split again into sub-chunks. That turned out to matter a lot for work history because a single job entry can contain several different retrieval targets: company overview, a recommender system, infra work, cost reduction, and so on.

For blog posts, the extractor does a few cleanup steps:

skips drafts
skips posts not authored by me
strips code blocks and replaces them with [code example]
keeps short posts as one chunk
splits long posts by section headings

At the moment the corpus is small, roughly 117 chunks. That is small enough that brute-force in-memory cosine similarity is completely fine inside a Worker.

Then embed.ts calls Cloudflare Workers AI to generate embeddings using @cf/baai/bge-base-en-v1.5, a 768-dimensional embedding model. The same model is used both offline and at query time. That consistency matters; mixing embedding models across indexing and retrieval is a great way to get quietly bad search.

The output is worker/data/corpus.json, which contains both the text chunks and the precomputed embeddings. That file is committed and bundled into the Worker at deploy time.

This is also why I did not bother with a vector database. At this scale, a vector database would mainly add latency, cost, and operational surface area. In-memory cosine similarity over ~117 vectors is trivial.

What happens at query time

When a user sends a message, the Worker does the following:

validates the Turnstile token
checks the request origin
rate limits by IP
validates message count and size
embeds the latest user message with Workers AI
retrieves the top 8 chunks from the in-memory corpus
builds a system prompt containing those chunks plus metadata
streams the answer from OpenRouter
optionally handles tool calls
generates follow-up suggestions with a second cheaper model

The system prompt is intentionally strict. It tells the model to:

answer only from retrieved context
refer to me in the third person
stay concise and professional
refuse off-topic or prompt-injection attempts
call a content recommendation tool when posts or projects are relevant

Each chunk is formatted with its chunk ID, source, type, section name, and relevance score. That chunk ID becomes useful when the model decides to call a tool and point the UI to the exact project or post it is talking about.

For inference I used OpenRouter. The main answer path uses anthropic/claude-sonnet-4, streamed, with tools enabled. Follow-up questions are generated with a second cheaper model, anthropic/claude-haiku-4.5, after the main answer finishes.

Retrieval quality was the real problem

The easy part of RAG is getting text in and out. The annoying part is getting the right text in the middle.

The retrieval path is hybrid:

base score: cosine similarity
extra boost: keyword-based section nudges

The keyword boost rules are small, intentionally so. They add a boost of about +0.03 to +0.05 when the query strongly suggests a category like employment, skills, or education.

For example, if a user asks about work history, resume, CV, roles, or companies, chunks whose section looks like employment get a small bump. I added this because semantically rich chunks can sometimes outrank more obviously relevant ones, especially in a small corpus where some chunks are dense and broad.

One concrete case was work experience queries not reliably surfacing the right Secret Escapes chunks high enough. The fix was not to add a whole reranking service. It was to keep the architecture simple and add a light keyword boost layer in rag.ts.

I also added a debug endpoint, GET /api/debug, behind a secret. It returns chunk IDs, scores, section names, types, and text previews for a given query. That was extremely useful for diagnosing why some answers felt “kind of right” but not actually useful.

Tool calling made the UI much better

Plain text answers are fine, but they are not the best UX when the real goal is “show me the relevant thing”.

So the model can call a recommend_content tool. Instead of making the model invent URLs or awkwardly paste them into prose, the Worker resolves chunk IDs into structured cards.

Those cards use metadata attached during extraction:

contentId
type
title
url
description
tags
date

That lets the frontend render actual clickable project and blog post cards under the response.

There is also a show_contact_card tool for obvious collaboration or contact intent.

The Worker streams these events back over SSE using named event types:

default data: events for text deltas
event: references for content cards
event: contact for the contact card
event: followups for suggested next questions

That split keeps the frontend simple. The parser in ChatWidget.tsx just attaches cards and follow-ups to the current assistant message as the stream arrives.

One subtle bug here was that the model could stream text and also issue a tool call in the same response. My original orchestration always did a second LLM call after tools, which created duplicate-feeling responses. The fix was to track whether the first streamed response already had text and skip the second call when it did.

Follow-up suggestions are a separate model call

After each answer, the Worker makes a second non-streaming call to Haiku and asks for exactly three short follow-up questions as JSON.

I like this for two reasons:

First, it improves discoverability. A visitor often does not know what the corpus contains, so a few good follow-ups help them explore.

Second, it keeps the main answer path clean. The main model focuses on answering the question. The follow-up model focuses on navigation.

The nice thing is that this is cheap. The follow-up generation is roughly a couple hundred input tokens plus a tiny output, which works out to a negligible extra cost per message. If it fails, the Worker just returns no follow-ups and the main chat still works.

Abuse prevention mattered more than I expected

Public chat endpoints get weird traffic almost immediately.

So the Worker has a few layers of protection:

Turnstile verification
server-side origin checks
rate limiting: 20 requests per minute per IP
max 20 messages per conversation
max 1000 characters per user message
max 32 KB request body
prompt-level guardrails
an OpenRouter API key with a spending cap

Turnstile is especially useful because it means someone cannot just curl the API from anywhere and burn tokens. They need to render the site in a real browser session and obtain a valid domain-bound token.

I also recently added retries around token acquisition because some requests were failing on the frontend before they even reached the model. In practice this was mostly a token freshness issue, not an LLM issue.

The frontend is more than a bubble now

The original UI was a floating bubble in the corner. That is still there as a mode, but I ended up making the homepage itself a chat-first interactive CV.

Technically the frontend is still quite simple:

Astro renders the page shell
React handles the chat state in ChatWidget.tsx
SSE streams tokens directly into the current assistant message
markdown is rendered client-side
starter prompts and first-visit hints help users get going

The main architectural point here is that the UI is not pulling from a polling API or waiting for a full response blob. It is event-driven and incremental. That keeps the chat feeling responsive even though under the hood it may involve retrieval, streaming generation, tool resolution, and then a second follow-up generation pass.

A few production bugs were more interesting than the happy path

There were a handful of issues that were much more educational than writing the initial feature.

Turnstile did not like `display: none`

The invisible Turnstile iframe would not work properly when hidden with display: none. The fix was to hide it using zero dimensions and absolute positioning instead.

WebDAV deploys were serving stale HTML

The portfolio deploy goes to Fastmail WebDAV via rclone. Because WebDAV does not support checksums in the way I wanted, rclone sync --checksum quietly degraded into something close to size-based behavior, which meant HTML files with identical sizes could remain stale across deploys.

The fix was a two-phase deploy and --ignore-times for the non-asset files.

Retrieval needed real instrumentation

Without the debug endpoint, retrieval issues were annoyingly fuzzy. You can feel that an answer is wrong, but until you see the ranked chunks and their scores, it is hard to know whether the problem is chunking, embeddings, prompt design, or source wording.

Source wording matters a lot

At one point the system kept overemphasizing LangGraph because the knowledge base literally described it as one of the strongest skills. That was not the model being weird; that was the corpus doing exactly what I told it to do. RAG systems are surprisingly sensitive to small wording choices in curated source content.

Why I like this architecture

The whole thing is deliberately small.

no vector DB
no separate backend service
no orchestration framework
no giant indexing pipeline

Just content extraction, embeddings, an in-memory corpus, a Worker, and a chat UI.

That simplicity is part of why I like it. At this scale, it is easy to inspect, easy to redeploy, and easy to reason about when something breaks.

I think personal sites are a particularly good use case for this shape. A portfolio is already a structured corpus. It has a bounded domain, relatively stable content, and clear user questions. That makes it a much better target for RAG than the average “AI assistant for everything” product.

If you try it and notice weird retrieval, odd emphasis, or better ways the homepage should guide first-time visitors, let me know. This is one of those features where the first version is mostly an excuse to build a better second version.