Selecting an Open Source LLM

People often ask me about running an LLM on "their own data" or on "their own hardware" since for a variety of they'd rather not use ChatGPT, Bard, Bing, or Claude, etc. These reasons include:

Privacy - They really don't want to or are legally not allowed to send data to a third party. This could include situations involving sensitive personal data or state or corporate secrets.

Specialized and up-to date models - They really want to run a model trained on up to date data (though carefully consider Retrieval Augmented Generation (RAG) first). Or they want a model specifically trained on a particular task (such as information extraction) or on particular data (such as private or copyright safe data) or on a completely different kind of 'language' (DNA, Music, etc.).

Change control management - They want to make sure they are in complete control of the model and any guard rails or restrictions that may be put in place. Since even subtle changes in what is allowed in the prompt or in the response can really wreak havoc on your workflow.

Technical requirements - They have some other technical/operational requirement such as Internet connectivity, latency, throughput or budgetary considerations.

Why Start with an Open Source (OS) LLM?

Rather than start from scratch most organizations will be better served by starting with one of the open source LLMs that have already been pretrained. If you really need to start from scratch you'll need to:

Write code to implement the architecture of the model with its transformers
Collect and process data to do the initial pre-training
Spend a large amount of money on GPU time for the initial training
Collect and process data for the fine tuning
Spend more resources on the fine tuning and evaluation

Pretraining is the process of getting an LLM to understand the basic structure of data. In the case of natural language you can think of it as getting the model to understand basic language structure by getting it to auto-complete the next word.

Fine tuning is the process of getting the LLM to generate text in the particular way that you want for the task you have in mind. For example for a pretrained LLM a reasonable completion for "What is the capital of England?" might be "What is the capital of France?" since it does not know if you want it to answer questions or create a geography quiz. Fine tuning lets you specify what the task is by showing it the kinds of completions you'd like to see.

Many open source models come fine tuned already (to answer questions, follow instructions, etc.) or only pre-trained which then can be fine tuned on your data.

When looking at using an "open source" model some issues to consider include which models:

have adequate performance "as is"
can be further fine tuned
meet deployment latency, resource and budgetary requirements
have acceptable business use licenses

Model Performance

Adequate performance can be hard to gauge and really comes down to your specific use case. There are standard benchmarks and leader boards, such as the HuggingFace Open LLM Leader board, that keep track of them. However, though the benchmarks are useful and may give you some general insight, performance on your particular task is most important and must be explicitly measured.

The factors that affect performance are:

Model size and architecture
Data size and quality
Training process and duration

All else being equal larger models will generate better answers but will be harder fine tune, take more to deploy and take longer to process requests. At the time of this writing we see that various versions of the Llama-2 70 billion parameter model are near the top of the leaderboard. The different versions have taken the base model from Meta and fine tuned them with different data and techniques. The Llama-2 family is a good place to start as they can be further fine tuned on your data if you find it necessary. Which exact version is best for you will depend on the specific use case.

A 70 billion parameter model is fairly large and will take some planning to deploy into production. Quantizing the model will reduce the resource requirements but will affect performance. Again the tradeoffs are use case specific.

Additionally the amount of data used in pretraining and the quality of fine-tuning data are both very important to performance. Several companies are working on using smaller models with larger and higher quality data. Of particular interest is Microsoft's Phi 1.5 project and accompanying publication Textbooks Are All You Need

Licensing

The state of licensing is undergoing a lot of discussion and change and is something you need to be aware of. Consult with your legal council and explain your exact use. This is not legal advice.

Licensing can be broken down into three main areas. 1) the model architecture, 2) the model weights and 3) the data used in training. Make sure you are covered in all three areas.

Usually the architecture and the initial weights are released and license together. However someone could take a published architecture and use that to create their own weights. In which case the original architecture and new base weights could have different licenses.

A more common scenario is that someone takes an initial architecture and weights and then does fine tuning by adding or adjusting the weights. That new weight diff/snapshot could then fall under a different license.

And finally, the data used during both the pre-training and fine tuning could have their own licensing considerations. Many models were trained on copyrighted data and it is not yet clear if this falls under "fair use". Other models are specifically trained on data without copyright issues but this data may be limited or out of date.

For example, and confusingly, Llama 1 was not released with an open license but various organizations and individuals were able to duplicate the architecture (Dolly, Alpaca, etc), train it on different data sets, some copyrighted some not, and then released the code and weights. Some of these models were release for academic use only and others release for permissive use. However even in the latter case, the models may have been trained with copyrighted data which means there may still be potential legal issues.

The Llama-2 model (architecture and weights) were released with a permissive license. However that license does put restrictions on commercial use for some very large users and some purposes leading some to claim that it is not truly open source.

Conclusion

In summary, when you go to choose an open source model, make sure to consider:

Your specific use case
Strategy to measure performance
Data you need to fine tune if desired
Required resources and latency
Licensing for each part of the whole system
Currently the Llama-2 family is a good place to start

Want to get notified of new articles and insights?