Ollama, LM Studio vs llama.cpp
5 minutes read •
When getting started with local LLMs, one of the very first decisions you have to make is what kind of inference engine [1] to use. There are quite a few out there, so which one is best? I'll take a look at the three main options: LM Studio, Ollama and llama.cpp.
[1] Inference engine: a program that gives life to the huge data files on HuggingFace.
LM Studio
LM Studio is primarily a GUI application. It has a user-friendly interface for downloading the models and getting them running. It includes a chat interface where you can quickly converse with the loaded models.
The models that LM Studio uses come from HuggingFace. In HuggingFace there's LM Studio Community which provides a set of curated versions for LM Studio.
Behind the scenes, it uses llama.cpp to do the actual model inference - CPU, CUDA, ROCm, Vulkan & Metal are all supported. In addition to all the llama.cpp-based engines, there's also an Apple's MLX-based engine.
LM Studio provides its own APIs in addition to an OpenAI-compatible API for interacting with the models programmatically, or from agents and IDEs that support OpenAI-compatible providers.
It also has lms CLI utility for people who prefer to interact with it via the terminal.
Ollama
Ollama was trying to be the Docker of LLMs (before Docker started supporting running models itself). It is a CLI based application that gives users a very convenient way to download and run models. It focused on developers first, and successfully got itself integrated in various applications that end up being the more user-friendly frontends for it.
The main innovation that Ollama introduced, in my opinion, is automatic model unloading based on VRAM usage. Ollama can keep as many models loaded as memory permits and will unload some of them only when needed to make room for the newly requested model.
There are a few other ideas that they've adopted which differ from the status-quo. For example, Ollama introduced a Docker-like registry for models where files are stored and downloaded based on their SHA-256 hashes. HuggingFace has only recently implemented their "registry" support, so now it's possible to pull GGUFs directly from HuggingFace.
Then there is the Modelfile concept, which is meant to be a Dockerfile-inspired solution to defining models. To me, this seems like an unnecessary - or even wrong - abstraction. If you want to tweak parameters (e.g., num_ctx, num_gpu, use_mmap) to optimise a model for your hardware, you must create a new version of each model you use.
What puzzles me the most is Ollama's decision to replace the the de-facto standard Jinja templates (for chat messages) with Go templates. This means that if one wants to run a newer model on Ollama, you also have to port the Jinja template to Go template syntax.
Choosing the quantization level for a model is also somewhat involved. Ollama defaults to Q4_K_M quantization level for all its published models. If you want to change that, you must first obtain the Modelfile of ollama's model (for the chat template and parameters), then find a GGUF with the desired quantization on HuggingFace and finally combine the two.
Ollama has been using llama.cpp behind the scenes as its inference engine, but it is slowly moving towards its own version. In the short term, I think, it won't be great for users as the new engine has to play catch-up with llama.cpp; in the long term, however, it could be interesting.
Although Ollama relies on llama.cpp, it lags behind in some areas. For example, Vulkan support is still experimental, and Ollama does not support partitioned GGUFs, which are becoming the norm on HuggingFace.
Llama.cpp
Llama.cpp is a C++ library that powers the two offerings above. In addition to being a library that others can build on, llama.cpp ships with llama-server - a tool that lets you run models via OpenAI-compatible APIs, just like Ollama and LM Studio. There is also LlamaBarn, a light-weight GUI for macOS.
Llama.cpp is very actively developed, support for new models and optimizations for existing ones are added almost daily. The project does not have scheduled releases - what gets implemented (merged) is shipped automatically. This means that you always get the latest changes, but you may also encounter bugs and/or regressions.
Llama-server exposes a lot of knobs to tweak and supports various backends (the main ones being CUDA, ROCm, Vulkan, and CPU). This allows you to experiment with a lot of them to find the best configuration for your hardware. Most of these knobs are also exposed in Ollama & LM Studio, but not all.
The easiest way to run llama-server is via their official Docker images. It is also available through winget, homebrew or nix package managers. Pre-built binaries can be downloaded from the Github releases.
In general, llama-server is best if you want to run the latest models & get the most out of your hardware.
Side note: Reusing models
It is perfectly reasonable to use more than one of the tools above. One question that may arise is whether you can download the model once and run in different tools, instead of downloading the data multiple times.
Between LM Studio and llama-server reusing the downloaded models is relatively simple. Some LM studio specific models may use Jinja template functions that don't exist in llama.cpp, but in most cases the models are interchangeable.
Ollama makes this more difficult. Even though it uses GGUFs behind the scenes, it stores everything by hash, so an extra step is required to translate to/from the hash when trying to reuse data.