Self-Hosting LLMs

In February I wrote a post on how I stopped paying for ChatGPT in favor of a self-hosted solution. Since then, I haven’t completely stopped using ChatGPT, and I did resume my subscription briefly, but I’ve cut the cord again, this time I think for good.

I mentioned previously that I’m using Ollama as a backend to run large language models on a local machine, and while it has gotten some significant upgrades in the past few months and even has its own UI now, I still prefer the solution I found earlier this year for day-to-day use, Open WebUI. It’s a powerful self-hosted interface for using both local and cloud AI models, and has rapidly gained a massive community and countless new features thanks to dozens of contributors. The team behind the app has kept pace very well with the major commercial offerings, and they show no signs of slowing down, having gained both Tailscale and Warp, the AI-powered IDE, as sponsors.

Not Just a ChatGPT Clone

When you use Open WebUI, there’s no mistaking it takes inspiration from ChatGPT, but it’s also reminiscent of other LLM interfaces as that design is where the industry seems to have generally converged. Open WebUI brings not only plenty of eye candy but also many power user options to the table that commercial products don’t provide, convincing me more every day that paid models just don’t raise the bar enough to justify their prices. Of course, it’s a different story entirely if you aren’t someone who is inclined to self-host or has the hardware to run LLMs. But for those who are, the combination of Ollama, Open WebUI, and the right models can’t be beat.

Under the Hood

With Open WebUI, there are numerous options for architecting a self-hosted AI solution, particularly thanks to its broad API compatibility. If you have the hardware to run LLMs in the machine you serve the app from, Open WebUI can even host models for you, eliminating the need for Ollama or another model runner.

Because my Mac Studio has way more compute power and is much more efficient than any of my servers, I chose to split my setup:

Open WebUI runs on a small public server
Ollama runs on my Mac Studio
Authentication is handled via OIDC

And of course, everything is behind my firewall, with a reverse proxy handling external traffic.

The first problem I ran into when splitting my setup like this is that Ollama would only serve itself on 127.0.0.1, so it was only accessible from the Mac. I changed this to 0.0.0.0:

launchctl setenv OLLAMA_HOST '0.0.0.0:11434'

but this required running the command on every restart of Ollama. Thankfully in a recent update Ollama added an option to expose itself to the network, so this is no longer needed.

Another issue I had early on was that when my Mac would sleep, Open WebUI would lose connection to Ollama a few seconds later. I didn’t want to have to keep my Mac awake and the screen constantly on as that defeated the purpose of running on a power-efficient device, so I added a button to my Stream Deck that triggers a shortcut containing the action “Put the display to sleep.” This action locks the screen and puts the display to sleep, but doesn’t put the Mac to sleep, keeping networking alive.

Something interesting to note is that the Mac Studio produces a faint, chattering coil whine when a model is generating content. The whine is consistent word-for-word with the model’s output, so that it sounds like the computer is physically speaking or typing. I doubt it’s harmful to the system, as people who run models on dedicated GPUs have reported similar experiences. I just think it’s a neat little quirk.

Comparison with Commercial Models

I’ve spent a significant amount of time using this system both alongside ChatGPT and by itself, especially as more powerful models from Asia have been released this year, and overall I’ve found the capabilities of this setup to be very compelling. My Mac Studio has 128 GB of RAM, so the models I can run max out around 90 billion parameters, whereas models like those behind ChatGPT can range in size from 400-600 billion parameters. The parameter count has a significant influence on both a model’s “mental” abilities as a whole as well as how knowledgeable it is, but despite being much smaller, distilled models in the 35-90 billion range are still very capable.

Deepseek’s R1, for example, is a reasoning model that much like ChatGPT’s o1 and o3 models, using chain-of-thought to “mull over” a conversation for a few seconds to a couple of minutes in order to refine an answer to be more targeted and (hopefully) more accurate. While I’ve seen that R1 can “overthink” from time to time, it’s a great model for parsing a large volume of information and summarizing it or extracting key information that would otherwise take far longer to obtain.

Llama 3.2 is what I’ve been using for vision processing, and it’s capable of not only image recognition but also visual analysis and reasoning. I find that its accuracy tends to drop off considerably with very complex images such as detailed maps, but it’s great at identifying the broad strokes such as “there appears to be a person in this photo,” as well as spatial relationships. I’ve even noticed it picking out details I’ve overlooked in images. While I still prefer to go to ChatGPT for more advanced image processing, Llama 3.2 is becoming my go-to particularly where privacy is of elevated concern.

Qwen 3 is my new favorite all-around model. I’ve pitted it against Llama 3.3 for some time, and Qwen 3 just keeps pulling ahead. To me it feels like it matches the capabilities of GPT-4o, possibly even beating it in some ways. Its responses are comprehensive and well-structured, and it handles complex prompting well. It handles complex math and science discussions with ease, areas in which other open-source models seem to struggle much more. There are several fine-tuned variations such as Qwen 3 Coder, but so far in my testing I haven’t found them to be notably better in their areas of focus than the base model.

A Word of Caution

I have to point out that most currently available models are very unlikely to reach the capability of flagship models produced by OpenAI and Anthropic. I don’t have the hardware capable of hosting an open-source flagship model such as Qwen3 235B A22B or DeepSeek R1 671B, so it’s possible that such models are indistinguishable in practice from models such as GPT-5. But there is a noticeable gap in quality between GPT-5 and Qwen 3 35B, and between GPT-o1 and DeepSeek R1.

However, those gaps seem to be narrowing consistently over time. When I first booted up Llama 2 on my MacBook Pro 18 months ago, I was utterly amazed to have an LLM running fully offline, but it certainly didn’t feel like it was going to replace anything commercially available. Now, the latest open models are capable enough that it feels like they’re stepping on the heels of AI giants.

What’s Next

I’ll follow this post up with more details on how I’m using each model, and possibly by that time I’ll have solved my biggest want, a router that automatically loads the best model for a prompt without me having to select it manually. That’s a convenience factor that keeps me going back to ChatGPT when I’m in a hurry. Going forward, I’d like to stop using ChatGPT entirely, but even now it’s becoming more of an edge-case handler than my primary AI tool.

To that end, much is happening in the open-source space with AI. As I’ve been writing this post over the past few days, several Asian companies have announced or released significant upgrades to popular models. In the past few months, a new LLM file format, Generic GPT Unified Format (GGUF), has been developed that increases a model’s efficiency and accuracy while decreasing its computational demand and memory footprint. And more tech companies are getting into the GPU market, which will inevitably drive down prices and make locally-hosted AI more accessible to the average person.

For myself, the next major goals I have are:

Text-to-image generation
(Self-hosted) cloud storage connection
Agentic AI

Of course, the landscape of AI changes rapidly from one month to the next, so these priorities could easily shift by the time I post again.

Not Just a ChatGPT Clone#

Under the Hood#

Comparison with Commercial Models#

A Word of Caution#

What’s Next#

Not Just a ChatGPT Clone

Under the Hood

Comparison with Commercial Models

A Word of Caution

What’s Next