Running LLMs Locally
Many people are interested in running LLMs locally. Their primary interests are privacy/security and in some cases to cost control. The idea being that if you are running on your own hardware you don't have to share data with a third party and you can do as many API calls as you like for a (relatively) fixed price.
Running locally is certainly possible, though some things to consider are:
Nail down what you mean by 'locally'. It could be on your computer, on a computer in your home or office network, or in you virtual private network provided by your cloud vendor. This article is geared towards the first two scenarios. If you are interested in running in the cloud, review why you want to run your own model, and then see what services your cloud vendor offers.
Think through the workflow you need. The main ones I'm asked about are:
- chat - where conversations are managed and stored
- RAG - to query your internal/private documents
- API access - to write scripts for automating workflows and to experiment with agents
- Coding assistance - a private version of GitHub CoPilot
Different workflows requirements will be best served with different solution configurations. What works great in one situation may be overkill or unsuitable for other situations.
Generative AI is resource intensive. All things being equal (though they never are) larger LLMs will give higher quality answers but require more memory and compute than smaller ones. However, smaller models may be good enough for the task you have in mind. You'll need to experiment with different models.
You don't technically need a GPU. But without one generating responses will be noticeably slower. This might be perfectly fine if you are running a batch process and not an interactive chat session.
Even if you have a GPU you may be limited by memory. GPU memory is (usually) separate from and different than system memory. Consumer grade GPUs can come with 6-24 gb of memory. A quantized 7B model will take about 8gb of memory to run and that scales roughly linearly with the size of the model. So that will limit the largest models you can run. There are techniques to manage memory limitations but they add to the time required to generate an answer.
The M series Macs are nowhere as fast as an NVIDIA GPU. But they have a unified system memory. So you'll be able to use most/all of your computer's memory and available GPU cores for running larger LLMs as well as other models such as Whisper.
Approaches
There are many ways to run LLMs locally including llamafile, oogabooa, LM Studio, Ollama and coding libraries like HuggingFace Transformers, LangChain and Llama Index. I'm sure I'm missing some and more will be created in the future so make sure to do a quick survey before you decide permanently or if my suggestions don't work for you for some reason.
Today, I mostly see people using LM Studio or Ollama. LM Studio runs many LLMs, provides a web chat interface, and can serve an API. Ollama also runs many LLMs, has a primitive command line chat interface and provides a powerful, easier to use, OpenAI compatible API.
I primarily use Ollama for the API which is straight forward to use. You can access the API from any language with a simple HTTP POST, from their JS and Python libraries, from LangChain and Llama Index and from other tools that expect an OpenAI API such as AutoGen and CrewAI.
One powerful feature of the Ollama API is that you can specify which model you want to use in the request and Ollama will load it on demand for each request. So, in your workflow you can easily choose a specific model for particular steps. For example one request can specify llava, a multi-modal model, to analyze an image, mixtral to generate additional text, and DeepSeek Coder for a code generation step.
There is a lot more to talk about but it is all situation specific and hopefully this will get you started. If you have questions about running LLMs locally to meet your needs do not hesitate to get in touch.