The primary focus of this endeavour was to demonstrate the feasibility of running Llama 2 models on low-powered devices using pure C code.

Someone who can easily build a GPT-5 in a weekend spends a surprising amount of time testing the features of the open source Llama 2. The quest to run LLMs on a single computer prompted Andrej Karpathy of OpenAI, known for his contributions to the field of deep learning, to undertake a weekend project to create a simplified version of the Llama 2 model, and here it is, the 2 model.

 To do this, "I took nanoGPT, configured it to implement the Llama 2 architecture instead of GPT-2, and the content was to write a C inference engine in run.c," explained Karpathy in the Llama2.c GitHub repository. His goal was to implement nanoGPT on the Llama 2 architecture, not GPT in the C programming language. The archive has already received 2.2 thousand stars.  

 The success of Karpathy's approach lies in its ability to achieve highly interactive speed even with moderately large models containing a few million parameters trained on the 15 million parameter TinyStories dataset. He reports that on his M1 MacBook Air, the ~15 million parameter Llama 2 model can infer about 100 characters per second at fp32, all with C code he developed. This result is surprising because it shows the feasibility of complex models in resource-limited devices with a simple implementation. 
 
 Sample output 
 
 Karpathy also explains in a HackerNews chat how he was surprised to find that compilation on the MacBook Air M1 was much faster than expected, at 100 characters per second. Encouraged by this result, Karpathy actively updated the archive and also began testing with a 
 million parameter model, which is three times larger. Surprisingly, he was able to train 200,000 iterations with a set size of 32 on four A100 GPUs in about eight hours. "With this success, it appears that achieving the 7B Llama model may be within reach," Karpathy said. He is known for several courses on how to build a GPT from scratch. People congratulated OpenAI for hiring Karpathy back from Tesla.

What is the Baby Llama approach?

Karpathy said that this approach was largely inspired by Georgi Gerganov's project, llama.cpp, which was almost the same project that ran the first version of LLaMA on the MacBook using C and C. 
 Karpathy's approach involves training the Llama 2 LLM architecture from scratch using PyTorch. After training, it saves the model weights to a binary file. Here's the interesting part: He writes a 500-line C file called "run.c" that loads the saved model and infers using single-precision floating-point numbers (fp32). This minimalist approach ensures a low memory footprint and requires no external libraries, enabling the powerful performance of a single M1 laptop without the need for graphics cards. 

 Karpathy also explores several methods to improve C code performance, including various compilation flags such as -O3, -Ofast, -march=native and more. These flags optimize the code, allowing vectorization, loop unwinding, and other hardware-specific configuration. By experimenting with these flags, users can make inferences about their systems even faster.

 
 You can test the baby Llama 2 model on your device by downloading the pre-trained model checkpoint from the Karpathy repository. The included code lets you compile and run C code on your system, giving a glimpse into the magic of running a deep learning model in a minimalist environment. 
 
 It's important to note that Karpathy's project is a weekend experiment and not intended for production-level deployment, which he admits. The main goal of this effort was to demonstrate that Llama 2 models can run on low-power devices using pure C code, which has long been considered impractical for machine learning because it does not include graphics execution. 
 
 The Rise Of Small Llms 
 
 The main reason that the models have become smaller during this time is to train them and integrate them into smaller local devices. In addition to not requiring a GPU, Karpathy's approach sets a precedent for what can be achieved with discrete devices. It is possible that through the Meta partnership, Microsoft will release a series of small LLMs based on Llama 2. 
 Similarly, the release of Meta Llama 2 was accompanied by an amazing collaboration with chipmaker Qualcomm. The goal of this partnership is to make Llama 2 run on local hardware. Apple also has a huge ecosystem of developers, for whom the company recently released the Transformers architecture, which is optimized for Apple Silicon. Carpathia has already shown that a lot is possible.