Skip to content

Building a RAG Chat for My Portfolio

Posted on:April 13, 2026

I recently added a RAG-backed chat layer to aleph.fi. The original goal was not to bolt an LLM onto a portfolio just because that is what people are doing now. I wanted something more specific: a site where visitors could ask normal questions about my work and get grounded answers from the actual body of content behind the site.

That means questions like:

And importantly, I wanted those answers to come from a real retrieval pipeline rather than a vague system prompt plus vibes.

So the project ended up becoming a fairly compact but real production system:

Table of contents

Open Table of contents

The shape of the system

At a high level, the system looks like this:

Offline indexing

┌──────────────────────┐     ┌──────────────────────┐
│ knowledge-base.md    │     │ blog posts (main)    │
└──────────┬───────────┘     └──────────┬───────────┘
           └──────────────┬─────────────┘

                     extract.ts

                     chunks.json

                      embed.ts

                     corpus.json

          bundled into the Cloudflare Worker

Runtime

┌────────────────────────────┐
│ Browser                    │
│ ChatWidget + Turnstile     │
└──────────────┬─────────────┘
               │ POST /api/chat + token

┌───────────────────────────────────────────────┐
│ Cloudflare Worker                            │
│ - validate token / origin / rate limits      │
│ - embed query with Workers AI                │
│ - retrieve top-8 chunks from corpus.json     │
│ - build system prompt                        │
│ - stream answer from OpenRouter              │
│ - optionally emit cards + follow-ups         │
└──────────────┬────────────────────────────────┘
               │ SSE stream

┌────────────────────────────┐
│ Browser renders            │
│ text, cards, follow-ups    │
└────────────────────────────┘

The backend is a single Worker with one endpoint, POST /api/chat. There is no separate API server and no external vector database. For this project size that would have been unnecessary complexity.

The main Worker files are roughly split like this:

On the frontend, the chat UI lives in src/components/ChatWidget.tsx and talks to the Worker over SSE.

The offline content pipeline

The RAG corpus comes from two places:

That branch split is important. The portfolio site deploys from root, while the blog content lives on main. So the extraction script fetches blog content from main via git archive instead of assuming everything lives in the checked out branch.

The extraction step writes worker/data/chunks.json.

For the knowledge base, the script splits on ## headings, and sections with ### subheadings get split again into sub-chunks. That turned out to matter a lot for work history because a single job entry can contain several different retrieval targets: company overview, a recommender system, infra work, cost reduction, and so on.

For blog posts, the extractor does a few cleanup steps:

At the moment the corpus is small, roughly 117 chunks. That is small enough that brute-force in-memory cosine similarity is completely fine inside a Worker.

Then embed.ts calls Cloudflare Workers AI to generate embeddings using @cf/baai/bge-base-en-v1.5, a 768-dimensional embedding model. The same model is used both offline and at query time. That consistency matters; mixing embedding models across indexing and retrieval is a great way to get quietly bad search.

The output is worker/data/corpus.json, which contains both the text chunks and the precomputed embeddings. That file is committed and bundled into the Worker at deploy time.

This is also why I did not bother with a vector database. At this scale, a vector database would mainly add latency, cost, and operational surface area. In-memory cosine similarity over ~117 vectors is trivial.

What happens at query time

When a user sends a message, the Worker does the following:

  1. validates the Turnstile token
  2. checks the request origin
  3. rate limits by IP
  4. validates message count and size
  5. embeds the latest user message with Workers AI
  6. retrieves the top 8 chunks from the in-memory corpus
  7. builds a system prompt containing those chunks plus metadata
  8. streams the answer from OpenRouter
  9. optionally handles tool calls
  10. generates follow-up suggestions with a second cheaper model

The system prompt is intentionally strict. It tells the model to:

Each chunk is formatted with its chunk ID, source, type, section name, and relevance score. That chunk ID becomes useful when the model decides to call a tool and point the UI to the exact project or post it is talking about.

For inference I used OpenRouter. The main answer path uses anthropic/claude-sonnet-4, streamed, with tools enabled. Follow-up questions are generated with a second cheaper model, anthropic/claude-haiku-4.5, after the main answer finishes.

Retrieval quality was the real problem

The easy part of RAG is getting text in and out. The annoying part is getting the right text in the middle.

The retrieval path is hybrid:

The keyword boost rules are small, intentionally so. They add a boost of about +0.03 to +0.05 when the query strongly suggests a category like employment, skills, or education.

For example, if a user asks about work history, resume, CV, roles, or companies, chunks whose section looks like employment get a small bump. I added this because semantically rich chunks can sometimes outrank more obviously relevant ones, especially in a small corpus where some chunks are dense and broad.

One concrete case was work experience queries not reliably surfacing the right Secret Escapes chunks high enough. The fix was not to add a whole reranking service. It was to keep the architecture simple and add a light keyword boost layer in rag.ts.

I also added a debug endpoint, GET /api/debug, behind a secret. It returns chunk IDs, scores, section names, types, and text previews for a given query. That was extremely useful for diagnosing why some answers felt “kind of right” but not actually useful.

Tool calling made the UI much better

Plain text answers are fine, but they are not the best UX when the real goal is “show me the relevant thing”.

So the model can call a recommend_content tool. Instead of making the model invent URLs or awkwardly paste them into prose, the Worker resolves chunk IDs into structured cards.

Those cards use metadata attached during extraction:

That lets the frontend render actual clickable project and blog post cards under the response.

There is also a show_contact_card tool for obvious collaboration or contact intent.

The Worker streams these events back over SSE using named event types:

That split keeps the frontend simple. The parser in ChatWidget.tsx just attaches cards and follow-ups to the current assistant message as the stream arrives.

One subtle bug here was that the model could stream text and also issue a tool call in the same response. My original orchestration always did a second LLM call after tools, which created duplicate-feeling responses. The fix was to track whether the first streamed response already had text and skip the second call when it did.

Follow-up suggestions are a separate model call

After each answer, the Worker makes a second non-streaming call to Haiku and asks for exactly three short follow-up questions as JSON.

I like this for two reasons:

First, it improves discoverability. A visitor often does not know what the corpus contains, so a few good follow-ups help them explore.

Second, it keeps the main answer path clean. The main model focuses on answering the question. The follow-up model focuses on navigation.

The nice thing is that this is cheap. The follow-up generation is roughly a couple hundred input tokens plus a tiny output, which works out to a negligible extra cost per message. If it fails, the Worker just returns no follow-ups and the main chat still works.

Abuse prevention mattered more than I expected

Public chat endpoints get weird traffic almost immediately.

So the Worker has a few layers of protection:

Turnstile is especially useful because it means someone cannot just curl the API from anywhere and burn tokens. They need to render the site in a real browser session and obtain a valid domain-bound token.

I also recently added retries around token acquisition because some requests were failing on the frontend before they even reached the model. In practice this was mostly a token freshness issue, not an LLM issue.

The frontend is more than a bubble now

The original UI was a floating bubble in the corner. That is still there as a mode, but I ended up making the homepage itself a chat-first interactive CV.

Technically the frontend is still quite simple:

The main architectural point here is that the UI is not pulling from a polling API or waiting for a full response blob. It is event-driven and incremental. That keeps the chat feeling responsive even though under the hood it may involve retrieval, streaming generation, tool resolution, and then a second follow-up generation pass.

A few production bugs were more interesting than the happy path

There were a handful of issues that were much more educational than writing the initial feature.

Turnstile did not like display: none

The invisible Turnstile iframe would not work properly when hidden with display: none. The fix was to hide it using zero dimensions and absolute positioning instead.

WebDAV deploys were serving stale HTML

The portfolio deploy goes to Fastmail WebDAV via rclone. Because WebDAV does not support checksums in the way I wanted, rclone sync --checksum quietly degraded into something close to size-based behavior, which meant HTML files with identical sizes could remain stale across deploys.

The fix was a two-phase deploy and --ignore-times for the non-asset files.

Retrieval needed real instrumentation

Without the debug endpoint, retrieval issues were annoyingly fuzzy. You can feel that an answer is wrong, but until you see the ranked chunks and their scores, it is hard to know whether the problem is chunking, embeddings, prompt design, or source wording.

Source wording matters a lot

At one point the system kept overemphasizing LangGraph because the knowledge base literally described it as one of the strongest skills. That was not the model being weird; that was the corpus doing exactly what I told it to do. RAG systems are surprisingly sensitive to small wording choices in curated source content.

Why I like this architecture

The whole thing is deliberately small.

Just content extraction, embeddings, an in-memory corpus, a Worker, and a chat UI.

That simplicity is part of why I like it. At this scale, it is easy to inspect, easy to redeploy, and easy to reason about when something breaks.

I think personal sites are a particularly good use case for this shape. A portfolio is already a structured corpus. It has a bounded domain, relatively stable content, and clear user questions. That makes it a much better target for RAG than the average “AI assistant for everything” product.

If you try it and notice weird retrieval, odd emphasis, or better ways the homepage should guide first-time visitors, let me know. This is one of those features where the first version is mostly an excuse to build a better second version.