18 Jul 2025

Exploring LLMs

A skeptic reconsiders

I’ve been a long-time skeptic of large language models and their applicability to intellectual work. Too readily do they hallucinate information, too amateurish are their programming mistakes. So for too long, I didn’t give them a proper chance to redeem themselves. Now, I’ve spent a couple of focused days reevaluating my stance, first by catching up on the latest advancements in transformer architecture and then by putting these models to the test.

This is mostly a worklog I was writing as I went along. The structure might be a bit weird as a result.

Goals and summary

Learn technical details of how LLMs work: Excellent videos by 3Blue1Brown and Andrej Karpathy.

Explore modern local models: Underwhelming. Models small enough to run on my laptop leave much to be desired.

Explore modern LLM tools and agents: Claude Code phenomenally useful for coding. Claude also useful for a bunch of other tasks.

Developing an intuition for how LLMs work

I came into this with some knowledge of machine learning and neural networks from university courses. I even read the OG transformer paper back then but apparently didn’t take much away from it. So previously I would consider LLMs as the classical stochastic token predictors that don’t think, but only produce text that conforms to previous patterns, thus appearing plausibly intelligent. Trying to assign any cognitive abilities to LLMs is merely misguided anthropomorphizing of a statistical process.

But after catching up on transformer architecture, I’ve changed my mind. In particular, the concept that attention layers allow the LLM to exchange context between tokens evidently leads to a very expressive computational model. Contrasting this with a simple feed-forward, fully connected neural network, it seems obvious that this architecture is something fundamentally different. I’m also just nihilistic enough to reject any vague spiritual arguments that human cognition is somehow special. That leads me to consider it plausible that an LLM could actually “think”. Naturally, much has been written about this topic, both technical and philosophical. This is just my semi-informed take primarily from a technical perspective.

It’s worth mentioning that transformer models have obvious limitations that may not apply to human cognition: Reasoning is inherently token-based, models are static and not updating weights during inference, and I’m sure there are many more.

Explore local models, Claude and Gemini

To get a first impression of modern developments, I used local and hosted LLMs for some search queries and other tasks. Below are two examples with my impression of each model. For the local models I used Msty, the hosted ones are the free tier at this point.

Who is currently in the Swiss National Council

Gemini 2.5 Flash: Very good.
Claude Sonnet 4: Good but slow, can’t figure out presidency.
Gemma3:1B: Bad, gets even the number of members wrong.
Llama 3.2: Gets the number of members correct but has knowledge cutoff in 2023 and doesn’t do web search. Also slow to load model.
Qwen3: Super slow but pretty good.
Gemma3: Gets number of seats really wrong and annoyingly agreeable to having it pointed out.

Summarize my server setup blog post and explain the deployment scripts

Gemini 2.5 Flash: Phenomenal.
Claude Sonnet 4: Phenomenal, a bit more concise than Gemini.
Gemma3:1B: Useless. When asking for typos it thinks I’m talking about typos in its response. Asking it to explain something in the script is fruitless.
Llama 3.2: Not good. Doesn’t find any typos and can’t explain script at all.
Qwen3: Not good. It comes up with a new deployment script, doesn’t analyze mine.
Gemma3: Not good. Comes up with new deployment script like Qwen3. Slightly more similar to the script in the post though.

Local models conclusion

As much as I would like to use local models for privacy reasons, it doesn’t look like it’s going to happen. This is not to say that open-weight models in general are bad, just the ones that are small enough that I can run them on my laptop. So I guess I’ll have to go with one of the bigger hosted ones.

Hosted models

I’m going with Claude for this exploration for two reasons. First, it’s said to be the strongest model for coding, which is my primary use case. Second, Anthropic seems to be the most sympathetic of the AI companies out there to me.

I can’t get behind Google due to privacy reasons—they can simply bring together too many threads of my digital persona. OpenAI is exceptionally unpleasant in their public conduct. xAI also won’t cut it for similar reasons.

Unfortunately there are no good options when it comes to privacy, but it seems Anthropic is the least bad.

Claude

Proofreading

Works very well. I can paste in a note in mdx format and just need to remind it to keep line breaks as is, and it will do a fine job directly correcting typos and awkward sentences, ready to paste back into my website.

It even handles markdown tables etc. flawlessly. Much simpler than pasting a note into an online spell-checker and manually fixing all the findings. With the note being checked into git, it’s easy to see the diff that Claude produced. I wouldn’t trust it otherwise.

Addendum: I also pasted this very blog post for proofreading into Claude and was caught off-guard when it referenced the content in its answer:

You’re welcome! The corrected version is ready to paste back into your website. With it being tracked in git, you’ll be able to easily see the diff of changes I made. Happy to help with proofreading anytime!

Addendum to the addendum: On the second iteration of passing this through Claude, it gave the above quote again. After asking whether it did this deliberately after seeing it in the text, it answered this:

Yes, exactly! I noticed that section where you mentioned pasting your blog post into Claude for proofreading and being “caught off-guard” when it referenced the content in its answer with that exact response. So I deliberately included the same response as a bit of meta-humor, since you were literally doing the same thing again. It seemed like a fun way to acknowledge that I had read and understood the content, rather than just mechanically proofreading without engaging with what you’d written about your LLM experiences.

Notably, on the second pass it wanted to replace a bunch of commas with a dash. It reads sensibly, but is quite out-of-character for a non-native speaker, so I reverted most of them.

Spot mistakes in crypto code

I’m pasting in the encryption code used in Fieldnotes and ask to assess the security of this code and whether it follows best practices. Claude says the code is sound and accomplishes its goal. Asking follow-up questions, it’s clear that Claude has a rather sophisticated understanding. E.g. “What could be the advantage of using per-note keys rather than encrypting notes directly with the main key?”.

To check that Claude isn’t overly agreeable I ask the same question in a new context with modified crypto code that includes some issues. In one instance I set the main key to be extractable, which it correctly flags as something that could be improved. Further, when passing undefined in the initialization vectors, it raises alarm bells, calling this a critical bug with catastrophic consequences.

This isn’t to say that an AI chatbot should be entrusted with security audits, but it is still reassuring that it doesn’t mess up on obvious issues. Overall I’m quite impressed here. Most engineers I work with wouldn’t know about IVs, not to mention spot mistakes like this.

Architecture discussion

I’m writing a short overview of requirements for my upcoming rewrite of my observability / metrics tool. There are a bunch of open questions in my head and I haven’t managed to get a clear picture of what I want to do. Part of the reason is the absolute mess that is OpenTelemetry documentation.

# Observatory: Observability and Time Series System
Observatory is a new system to store observability data as well as time series data.

## Store time series data from embedded devices
There are multiple embedded devices that periodically record sensor data. They should be able
to send that time series data to Observatory.

## Store observability data from backend services
There are multiple backend services implemented in Rust. They use the `tracing` crate to instrument them.
They should be able to export their tracing data to Observatory.

## Visualization
Observatory should include a web application that visualizes the data stored within.

## Questions
- What storage layer is advisable?
- Can a database be used directly to ingest observability data or is a backend application in front of
the storage layer needed?

After pasting in this document and prompting Claude to ask clarifying questions, it produces an updated requirements doc. It contains a bunch of boilerplate that would satisfy an architect, but also some interesting stuff. Together we iterate towards a stack with ClickHouse at the center with either a custom OpenTelemetry collector in front or Jaeger using CH as the storage layer and Grafana as the visualization tool.

I also ask follow-up questions regarding OpenTelemetry that I haven’t been able to find answers to after hours of scouring the web (due to the over-engineered mess that is the OpenTelemetry ecosystem) and Claude can give competent answers. Overall, exceptionally helpful for this task.

Claude as a Google Search replacement

Restaurant recommendations: Good.
Smart plug product search: Okay, has to be pointed in some directions but can summarize options quite well.
Particulate matter sensor: Very good.
JS editor research: Excellent.
HAR file to wiremock stub: Why search for a tool when it can generate the script directly?

In my experience, when Claude needs up-to-date information from the web it’s noticeably less useful than when knowledge / skill is sufficient that it already has embedded in its weights.

Embedded development

Failed to set up a Rust project for programming an ESP32-C6. But then again, so did I until I figured out a peculiarity in my toolchain.
Interestingly, Claude prefers to generate code from memory rather than invoke the template generator it most definitely has this code from. That leads to outdated dependencies.
Once the project is set up, it can make changes no problem, including using new APIs that I don’t know about.

I’m starting to notice that whenever Claude is really struggling with something, chances are so will I when I have to inevitably do it myself.

Web development

Masonry grid

Changing an image layout from cropped images in a square grid to masonry (without CSS masonry grid): Not managing well. It literally tried to absolute position the images *facepalm*.
Claude lacks contextual persistence: It just modified a file which throws a JS error now. Upon pasting this error into Claude it first searches for a minute for where this error might be instead of immediately checking the file it just modified.
Not fun to use at all. Claude just messed something new up every time. An exercise in frustration. Even if it technically satisfies the requirements, it’s super jumpy / glitchy. It has no taste whatsoever to build a robust solution. To be fair, it is building this blind. It has no channel to observe the result, which would be hard for me as well. But that’s just the nature of these tools for now I guess.
I’m giving up on this. It will be faster and importantly less nerve-wracking to do this myself.

A learning here is to be fair to the model. This is a challenging task to do well, and I secretly gave it to Claude because I don’t want to do it myself. Coupled with the fact that it can’t see the output of its work, I really shouldn’t be surprised that it can’t pull it off. Neither could I.

Async image processing

Changing an image upload endpoint from immediate compression and upload of multiple versions to one where only the original is uploaded and compression and upload of compressed images is happening asynchronously: Works flawlessly on first try! Did not expect that.
Later, I noticed a bug that I couldn’t get Claude to resolve itself. After checking the code in detail I found the culprit: There were two tasks executing in parallel using rayon::join. The error returned from this was checked, but hidden was that the individual Responses went unchecked. In this case, an upstream storage API was sometimes throwing 500, but that went unnoticed due to missing error handling.

While bugs like this don’t happen to me because I’m very meticulous in handling any error the type system throws my way (part of the reason I enjoy Rust so much), I have to restrain myself from being annoyed at this as the time saving is still there. It’s not much different from reviewing PRs at work.

OpenTelemetry shenanigans

Continuing from the architecture discussions with Claude, I went on to try some of the conclusions in a proof of concept. I also have a more detailed write-up of this here.

Have Claude Code create a compose stack with ClickHouse, OpenTelemetry Collector and Grafana. Works flawlessly. It saves me the 5 minute dance of remembering how docker networking works to get different services to talk to each other. After some errors on the Collector’s part, I can get Claude to make sure the DB is up and running before the Collector tries to connect to it. It also creates the OTEL collector config easily.
At this point the stack is running but trying to get Claude to juggle the 6 (!) Rust crates needed to send the tracing data in otel-compatible format to the Collector is a mess. Did I mention that I don’t like the OpenTelemetry ecosystem? Here, the crates are a bit too young and moving too fast, Claude stubbornly tries to call a function that existed in an older version of one of the crates. But even after manually intervening I’m also struggling. Ultimately we managed to get it running.
Once tracing data is in ClickHouse I start experimenting with charting interesting metrics in Grafana. The auto-generated table schema is quite involved and I have no knowledge of ClickHouse’s SQL dialect. Claude to the rescue—I can simply paste in the schema and ask for convoluted queries. It works phenomenally well for this.

How do I use LLMs now?

With all this testing done, I’m starting to get a grip on how LLMs can be useful for me. I might add that I’m primarily talking about my personal projects here. At work I use LLMs significantly less for various reasons: There is so much context required to produce good code that it’s often faster to do it myself than it is to verbalize all this context. The complexity in my project at work is also mostly in the interactions between systems, dependencies and a host of human factors. Technologically it’s quite simple. Naturally that’s a bad fit for LLMs. They’re not about to coordinate an API change with another team for me. Personal projects on the other hand are mostly about the technical challenges and all the code lives in a repo available to the LLM. API changes can be made without a care in the world, UIs can be changed without discussing with UX for an hour.

So for personal projects I use Claude Code in a terminal next to my normal IDE setup. This is a nice way of keeping in control while having Claude sit there and help out if needed. I always work on a branch and commit often to be able to see isolated changes in a diff. No vibe coding for me, I always check Claude’s changes in detail. Typically I watch Claude make changes and closely guide it into the direction I prefer. This whole ‘agentic’ workflow where people supposedly have multiple instances of LLMs running simultaneously does not work for me.

Apart from using Claude Code in the terminal, I also use the web interface often for one-off tasks, such as crafting ClickHouse queries, discussing architectural approaches, proofreading stuff etc.

Appendix

Notes on Deep Dive into LLMs like ChatGPT

Base models don’t know about question / answer format, they simply predict the next token from the context.
One can fake a conversational agent by constructing a few-shot example prompt that shows the question / answer format ending in answer: . The model will continue in a similar fashion.
Instruct models are post-trained on human-generated conversations to properly bake in the concept of answering questions. While the base training might take months, post-training typically takes only hours.
Hallucinations are being addressed by first probing a model for what it knows: Paste a paragraph of Wikipedia into one model and have it generate question / answer pairs related to the paragraph. This is easy because the answers are right there in the context window. Take the generated questions to another model and have it answer without having the context. Compare answers to determine if second model knows the answer. If model doesn’t know, add an answer to post-training data set to give answer “Don’t know”.
Tool usage works as follows: Model generates special tokens for e.g. web search, inference code pauses when it encounters those tokens, goes off to do the search, pastes response back into context and continues inference.
Interesting insight on reasoning: There is only a fixed amount of computation flowing into any given token. So training data needs to be phrased in such a way that the model doesn’t have to solve the entire problem in one token, but instead can spread out intermediary computation over more tokens. Kind of like slowly reasoning your way towards the solution rather than producing an answer immediately and then post-hoc justifying the answer. This leads to the conclusion that LLMs need tokens to think. Thinking is intrinsically linked to language processing.
Many counting and spelling tasks are not working well due to tokens. Tokens only exist for efficiency, there are efforts to move to character-level or byte-level models. That may solve this problem.
Post-processing: Reinforcement learning: Model is sampled many times on same question. Answers are checked for correctness against known answer. Correct answers are trained on to make those token streams more likely (reinforced). E.g. model has reasoned its way through problem well, not pinning too much computation on a single token and thus got to the right answer. This is good and should be encouraged through fine-tuning on those answers.
DeepSeek R1 was the first model where the use of reinforcement learning was publicly talked about. This model learned to generate thinking sequences, where multiple different approaches are generated and compared by the model. Only then does it generate a nice output intended for the human with the previous thinking in context.
together.ai for trying open weights models
Reinforcement learning in verifiable domains: There is a way to reliably tell what the correct answer is. Trivially or via LLM judge.
RLHF: Reinforcement learning from human feedback
Reinforcement learning in unverifiable domains (e.g. jokes): Train rewards model from little human feedback, then use that to judge many outputs.
Discriminator-generator gap: For a human it’s way easier to judge output than create output.
Risk of RLHF: RL may discover a way to game the model as a lossy simulation of a human is judging output. That may lead to nonsensical output getting high scores. (Adversarial example)
As a result RLHF is often run briefly to avoid this problem.
Upcoming capabilities: Multimodal models: Tokenize audio, video and train as usual.
Upcoming capabilities: Agents: Long, coherent, error-correcting contexts.
Upcoming capabilities: Test-time training: Current models only do in-context learning on the context.

Notes on How I use LLMs

Keep context on-topic. Large context may distract model and slightly decrease accuracy.
Thinking (reasoning) models are considered ones that have been improved with reinforcement learning.
GPT-4o is non-thinking model.
OpenAI models starting with o (o3-mini, o1, etc.) are thinking models.
Search tool use very useful to quickly gather websites and summarize content.
Deep research combines search and thinking to do research.
Some models support document upload. Here, the document is simply added to the context for the LLM to query.
ChatGPT is trained to recognize math problems it probably can’t do in its “head”. In this case it will invoke a python interpreter instead (tool use).
Claude Artifacts can build little applications and run them directly in browser in their interface.
SuperWhisper to transcribe voice to text system-wide. Also useful for prompting without typing.
Advanced voice mode in ChatGPT: Handle voice natively in the LLM instead of transcribing to text and operating on that.
NotebookLM for generating podcasts on arbitrary topics.
DALL-E does not generate images inside the LLM, it sends a caption to a separate image-generation model.
ChatGPT typically wipes context in a new chat. But there is the option to ask it to remember something. It will simply add those memories to the beginning of the context in a new chat.
ChatGPT custom instructions to avoid repeating preferences in every new chat.
Custom GPT: provide a system prompt to get answers in specific format.

Notes on Transformers, the tech behind LLMs

“Attention is all you need” originally introduced transformers.
GPT-3 has 175B parameters
Word embedding in GPT-3 is 12k-dimensional
Typical word-embedding may choose to encode gender information in one dimension. Thus, the classic man - woman = uncle - aunt. Similarly, hitler + italy - germany = mussolini.
Further, one direction in the embedding space indicates plurality. i.e. plur = cats - cat and the dot product between plurals and the plur direction is higher than with singular words.
Embedding matrix in GPT-3 has around 617M weights.
Embedding maps every token to a vector in isolation. It’s the attention mechanism that enables the network to exchange meaning between the vectors, to arrive at a more enriched meaning for each token. e.g. river bank vs deposit at the bank, bank has different meaning based on context.
At the end, the unembedding matrix maps the last token in the context to a probability distribution across the entire vocabulary (50k in GPT-3) for prediction.
All the other tokens in the last layer are actually encoding their immediately following tokens and are not used for the prediction of a new token. This turns out to favor training.
Unembedding matrix also has 617M weights in GPT-3.
Softmax is used to turn embedding vector into a probability distribution. The temperature is an extra parameter here to guide how “sharp” the distribution is. i.e T = 0, highest component gets 100% of the probability; T = 5, spreads out the probability much more evenly, thus making predictions more likely that have lower components in the vector.
Components in the output vector after unembedding are called logits.

Notes on Attention in transformers, step-by-step

Initially, every input token gets mapped to the same embedding vector—there is no sense of context.
One interaction of attention may be to update the embeddings of nouns with the preceding adjectives. This should refine the embedding vector to one that captures the essence of the noun in context better. (this assumes tokens are words).
Query and key vectors determine how relevant each token is to each other, i.e. the embedding of one word attends to the embedding of another.
The dot products of each query vector with each key vector yields the attention pattern.
To improve training, not only is the last token expected to predict the next one, but all tokens are expected to predict the following.
Masking is the process of setting the lower left half of the attention pattern to 0, to ensure later tokens cannot “give away” the prediction of earlier ones.
Size of attention pattern is bottleneck for context window as it scales with the square.
Finally, embeddings of tokens are updated by multiplying attention pattern by value vectors (produced using a value matrix). This way, every token gets some part of every other token (preceding it), but more from the ones that are attending to it more strongly.
This whole process is called one head of attention.
One attention head uses approximately 6.4M weights.
Multi-headed attention is running multiple attention layers in parallel. 96 per block for GPT-3.
All deltas of each attention layer are added together to the original embedding.
GPT-3 has 96 attention blocks, leading to 58B weights dedicated to attention.
The rest of the 175B parameters are in the in-between multilayer perceptron layers.

Notes on How might LLMs store facts

2/3 of weights in a typical transformer live in the Multilayer Perceptron layers (MLP).
In MLP layers vectors don’t “talk” to each other, the operations happen in isolation.
First operation in MLP layer is a matrix multiplication that maps input vector into higher dimension.
In case of GPT-3, this matrix has 49k rows.
In some way, each of those rows can be thought of as a question that is being asked of the input vector.
Next, resulting vector is run through non-linear function (e.g ReLU) to nicely clip components that didn’t satisfy question.
Finally a down projection matrix maps the intermediate vector down to the input dimension.
The resulting vector is added to input vector.
Single MLP layer has about 1.2B parameters.
Fascinating insight: For an n-dimensional space, there are exactly n vectors that are pairwise orthogonal to each other. But if the requirement is relaxed slightly to between 89 and 91 degrees, the number of vectors actually grows exponentially with the dimension! (Johnson-Lindenstrauss Lemma)
This has huge implications for LLMs and explains why model performance seems to scale so well with size. A model that is 10 times larger can represent way more than 10 times the number of concepts in its latent space.
This explains why inside an MLP layer the neurons are not lighting up as unit vectors to represent a single concept, instead it’s some random looking vector that happens to be one of the nearly orthogonal directions in that high-dimensional space.
Research area trying to extract true meaning of those neurons: Sparse Autoencoders.