Small Language Models: Evaluating Phi-3 and Llama-3 on Low-Power Devices
Written on
Chapter 1: Introduction to Small Language Models
In recent times, the landscape of AI model development has taken an intriguing turn. Traditionally, larger models were deemed more intelligent and capable of handling intricate tasks. However, this increased size comes at a high computational cost. Major tech companies like Microsoft, Google, and Samsung are now introducing AI features to their users, but the potential expenses of cloud computing could be staggering if millions of users access these features on their personal devices. The answer? Running models directly on devices. This approach offers several benefits, including reduced latency (no need for network access), improved privacy (no data processing in the cloud), and lowered computational expenses. Utilizing local AI models is crucial not only for laptops and smartphones but also for autonomous robots, smart home devices, and other edge computing applications.
Currently, two notable models are specifically optimized for on-device operations:
- Google's Gemini Nano: Announced in December 2023, this model comes in two variants with 1.8B and 3.25B parameters. It will be integrated into the Android operating system via the AI Edge SDK, but it is not open-source and likely unavailable on platforms like HuggingFace.
- Microsoft's Phi-3: Released in April 2024, this model features 3.8B parameters and comes in two context-length options: 4K and 128K tokens. It is tailored for NVIDIA and ONNX runtime and can operate on a CPU as well. Importantly, the Phi-3 model is open-source and available for download.
As of this writing, Google's Gemini Nano is in an "early access preview" phase, while Microsoft's Phi-3 is accessible on HuggingFace. For testing purposes, I will utilize an 8B Llama-3 model, Meta's latest offering from 2024.
The video titled "Phi-3: Microsoft's TINIEST Model Beats Llama 3 and Mixtral! Super POWERFUL!" provides an overview of Phi-3's capabilities and performance comparisons.
Chapter 2: Methodology
In this section, I will evaluate both the 3.8B Phi-3 and the 8B Llama-3 models using a series of prompts with increasing difficulty levels, ranging from simple inquiries to more complex tasks. The prompts include:
- Basic question answering
- Text summarization and responding to messages
- Utilizing external tools for queries
To conduct these tests, I will leverage the open-source LlamaCpp library alongside Microsoft's ONNX GenAI library. Both models will be run on my desktop PC and a Raspberry Pi to compare their performance and system requirements.
Section 2.1: Setting Up the Raspberry Pi
The focus of this article is to evaluate model performance on edge devices, particularly the Raspberry Pi:
- Raspberry Pi 5: An affordable, compact single-board computer running a 64-bit Linux operating system. It is ideal for robotics and smart home applications. However, the question remains: how well can it handle small language models?
The Raspberry Pi operates on a Debian-based OS, which is suitable for basic applications, but installing the latest libraries can be cumbersome. I encountered challenges when attempting to install the ONNX GenAI runtime on the Raspberry Pi OS, leading me to opt for the more software-friendly Ubuntu OS instead.
Section 2.2: Using LlamaCpp
Both Phi-3 and Llama-3 models can be employed using the LlamaCpp-Python library, which is lightweight and operates across various architectures. Installation can be done easily on the Raspberry Pi.
Section 2.3: Implementing ONNX Generative AI
Another approach to utilizing the Phi-3 model is through Microsoft's open-source GenAI ONNX library. However, I faced difficulties with installation on the Raspberry Pi due to package compatibility issues, necessitating a build from source.
Chapter 3: Running Inference
Now that the setup is complete, let's explore how to run inference with both models. The following Python methods will facilitate model loading and inference for both LlamaCpp and ONNX.
Section 3.1: Basic Prompt Testing
Initially, I will pose simple questions to both models to gauge their response capabilities.
Section 3.2: Message Response Evaluation
Next, I will examine how each model handles a realistic scenario by responding to a spam message. This test will help assess their ability to generate polite responses based on context.
Section 3.3: Using Tools for Complex Queries
Lastly, I will explore the models' abilities to utilize tools for more intricate requests. This requires strict syntax adherence, which may pose challenges for smaller models.
Chapter 4: Performance Analysis
Finally, I will compare the performance of both models on low-power edge devices, such as the Raspberry Pi, and a mid-range desktop setup. The results will provide insight into their operational efficiency under varying conditions.
Conclusion: This article has evaluated the Phi-3 and Llama-3 models, revealing their strengths and weaknesses in processing language tasks on edge devices. The findings underscore the potential for localized AI applications while highlighting the need for improvements in handling complex queries.