florian marending

18 Jul 2025

Exploring LLMs

A skeptic reconsiders

I’ve been a long-time skeptic of large language models and their applicability to intellectual work. Too readily do they hallucinate information, too amateurish are their programming mistakes. So for too long, I didn’t give them a proper chance to redeem themselves. Now, I’ve spent a couple of focused days reevaluating my stance, first by catching up on the latest advancements in transformer architecture and then by putting these models to the test.

This is mostly a worklog I was writing as I went along. The structure might be a bit weird as a result.

Goals and summary

Learn technical details of how LLMs work: Excellent videos by 3Blue1Brown and Andrej Karpathy.

Explore modern local models: Underwhelming. Models small enough to run on my laptop leave much to be desired.

Explore modern LLM tools and agents: Claude Code phenomenally useful for coding. Claude also useful for a bunch of other tasks.

Developing an intuition for how LLMs work

I came into this with some knowledge of machine learning and neural networks from university courses. I even read the OG transformer paper back then but apparently didn’t take much away from it. So previously I would consider LLMs as the classical stochastic token predictors that don’t think, but only produce text that conforms to previous patterns, thus appearing plausibly intelligent. Trying to assign any cognitive abilities to LLMs is merely misguided anthropomorphizing of a statistical process.

But after catching up on transformer architecture, I’ve changed my mind. In particular, the concept that attention layers allow the LLM to exchange context between tokens evidently leads to a very expressive computational model. Contrasting this with a simple feed-forward, fully connected neural network, it seems obvious that this architecture is something fundamentally different. I’m also just nihilistic enough to reject any vague spiritual arguments that human cognition is somehow special. That leads me to consider it plausible that an LLM could actually “think”. Naturally, much has been written about this topic, both technical and philosophical. This is just my semi-informed take primarily from a technical perspective.

It’s worth mentioning that transformer models have obvious limitations that may not apply to human cognition: Reasoning is inherently token-based, models are static and not updating weights during inference, and I’m sure there are many more.

Explore local models, Claude and Gemini

To get a first impression of modern developments, I used local and hosted LLMs for some search queries and other tasks. Below are two examples with my impression of each model. For the local models I used Msty, the hosted ones are the free tier at this point.

Who is currently in the Swiss National Council

Summarize my server setup blog post and explain the deployment scripts

Local models conclusion

As much as I would like to use local models for privacy reasons, it doesn’t look like it’s going to happen. This is not to say that open-weight models in general are bad, just the ones that are small enough that I can run them on my laptop. So I guess I’ll have to go with one of the bigger hosted ones.

Hosted models

I’m going with Claude for this exploration for two reasons. First, it’s said to be the strongest model for coding, which is my primary use case. Second, Anthropic seems to be the most sympathetic of the AI companies out there to me.

I can’t get behind Google due to privacy reasons—they can simply bring together too many threads of my digital persona. OpenAI is exceptionally unpleasant in their public conduct. xAI also won’t cut it for similar reasons.

Unfortunately there are no good options when it comes to privacy, but it seems Anthropic is the least bad.

Claude

Proofreading

Works very well. I can paste in a note in mdx format and just need to remind it to keep line breaks as is, and it will do a fine job directly correcting typos and awkward sentences, ready to paste back into my website.

It even handles markdown tables etc. flawlessly. Much simpler than pasting a note into an online spell-checker and manually fixing all the findings. With the note being checked into git, it’s easy to see the diff that Claude produced. I wouldn’t trust it otherwise.

Addendum: I also pasted this very blog post for proofreading into Claude and was caught off-guard when it referenced the content in its answer:

You’re welcome! The corrected version is ready to paste back into your website. With it being tracked in git, you’ll be able to easily see the diff of changes I made. Happy to help with proofreading anytime!

Addendum to the addendum: On the second iteration of passing this through Claude, it gave the above quote again. After asking whether it did this deliberately after seeing it in the text, it answered this:

Yes, exactly! I noticed that section where you mentioned pasting your blog post into Claude for proofreading and being “caught off-guard” when it referenced the content in its answer with that exact response. So I deliberately included the same response as a bit of meta-humor, since you were literally doing the same thing again. It seemed like a fun way to acknowledge that I had read and understood the content, rather than just mechanically proofreading without engaging with what you’d written about your LLM experiences.

Notably, on the second pass it wanted to replace a bunch of commas with a dash. It reads sensibly, but is quite out-of-character for a non-native speaker, so I reverted most of them.

Spot mistakes in crypto code

I’m pasting in the encryption code used in Fieldnotes and ask to assess the security of this code and whether it follows best practices. Claude says the code is sound and accomplishes its goal. Asking follow-up questions, it’s clear that Claude has a rather sophisticated understanding. E.g. “What could be the advantage of using per-note keys rather than encrypting notes directly with the main key?”.

To check that Claude isn’t overly agreeable I ask the same question in a new context with modified crypto code that includes some issues. In one instance I set the main key to be extractable, which it correctly flags as something that could be improved. Further, when passing undefined in the initialization vectors, it raises alarm bells, calling this a critical bug with catastrophic consequences.

This isn’t to say that an AI chatbot should be entrusted with security audits, but it is still reassuring that it doesn’t mess up on obvious issues. Overall I’m quite impressed here. Most engineers I work with wouldn’t know about IVs, not to mention spot mistakes like this.

Architecture discussion

I’m writing a short overview of requirements for my upcoming rewrite of my observability / metrics tool. There are a bunch of open questions in my head and I haven’t managed to get a clear picture of what I want to do. Part of the reason is the absolute mess that is OpenTelemetry documentation.

# Observatory: Observability and Time Series System
Observatory is a new system to store observability data as well as time series data.

## Store time series data from embedded devices
There are multiple embedded devices that periodically record sensor data. They should be able
to send that time series data to Observatory.

## Store observability data from backend services
There are multiple backend services implemented in Rust. They use the `tracing` crate to instrument them.
They should be able to export their tracing data to Observatory.

## Visualization
Observatory should include a web application that visualizes the data stored within.

## Questions
- What storage layer is advisable?
- Can a database be used directly to ingest observability data or is a backend application in front of
the storage layer needed?

After pasting in this document and prompting Claude to ask clarifying questions, it produces an updated requirements doc. It contains a bunch of boilerplate that would satisfy an architect, but also some interesting stuff. Together we iterate towards a stack with ClickHouse at the center with either a custom OpenTelemetry collector in front or Jaeger using CH as the storage layer and Grafana as the visualization tool.

I also ask follow-up questions regarding OpenTelemetry that I haven’t been able to find answers to after hours of scouring the web (due to the over-engineered mess that is the OpenTelemetry ecosystem) and Claude can give competent answers. Overall, exceptionally helpful for this task.

Claude as a Google Search replacement

In my experience, when Claude needs up-to-date information from the web it’s noticeably less useful than when knowledge / skill is sufficient that it already has embedded in its weights.

Embedded development

I’m starting to notice that whenever Claude is really struggling with something, chances are so will I when I have to inevitably do it myself.

Web development

Masonry grid

A learning here is to be fair to the model. This is a challenging task to do well, and I secretly gave it to Claude because I don’t want to do it myself. Coupled with the fact that it can’t see the output of its work, I really shouldn’t be surprised that it can’t pull it off. Neither could I.

Async image processing

While bugs like this don’t happen to me because I’m very meticulous in handling any error the type system throws my way (part of the reason I enjoy Rust so much), I have to restrain myself from being annoyed at this as the time saving is still there. It’s not much different from reviewing PRs at work.

OpenTelemetry shenanigans

Continuing from the architecture discussions with Claude, I went on to try some of the conclusions in a proof of concept. I also have a more detailed write-up of this here.

How do I use LLMs now?

With all this testing done, I’m starting to get a grip on how LLMs can be useful for me. I might add that I’m primarily talking about my personal projects here. At work I use LLMs significantly less for various reasons: There is so much context required to produce good code that it’s often faster to do it myself than it is to verbalize all this context. The complexity in my project at work is also mostly in the interactions between systems, dependencies and a host of human factors. Technologically it’s quite simple. Naturally that’s a bad fit for LLMs. They’re not about to coordinate an API change with another team for me. Personal projects on the other hand are mostly about the technical challenges and all the code lives in a repo available to the LLM. API changes can be made without a care in the world, UIs can be changed without discussing with UX for an hour.

So for personal projects I use Claude Code in a terminal next to my normal IDE setup. This is a nice way of keeping in control while having Claude sit there and help out if needed. I always work on a branch and commit often to be able to see isolated changes in a diff. No vibe coding for me, I always check Claude’s changes in detail. Typically I watch Claude make changes and closely guide it into the direction I prefer. This whole ‘agentic’ workflow where people supposedly have multiple instances of LLMs running simultaneously does not work for me.

Apart from using Claude Code in the terminal, I also use the web interface often for one-off tasks, such as crafting ClickHouse queries, discussing architectural approaches, proofreading stuff etc.

Appendix

Notes on Deep Dive into LLMs like ChatGPT

Notes on How I use LLMs

Notes on Transformers, the tech behind LLMs

Notes on Attention in transformers, step-by-step

Notes on How might LLMs store facts