
In my previous article, I introduced the purpose of my studies: to understand how to leverage LLMs for creating cost-effective, production-ready solutions. Although the initial solution was functional, it was not cost-effective, since it relied on expensive REST requests to OpenAI APIs. To address this, I started to explore open-source models that are small enough to be easily deployed and adopted in a real-world application.
In the following paragraphs, I will describe my journey, including:
When choosing a model, one of the most important factors to consider is the number of parameters it has. Models trained with a higher number of parameters, often in the billions, are generally more powerful and can perform a wider range of tasks, but they also require more computational resources, which can lead to higher costs and slower execution times.
For this project, the main goal was to create a cost-effective solution, so I decided to focus on smaller models with a lower number of parameters. These models are easier to deploy and can be run on less expensive hardware, making them a more practical choice for real-world applications where budget is a concern.
In the following table it's possible to see, when the parameters grow, the execution time and the required space in RAM and disk grow as well.
| LLM Model | Parameters (Billions) | VRAM Space (Inference, Est.) | Disk Space (Est.) | Execution Time (Latency) |
|---|---|---|---|---|
| Llama 3.1 Instruct 1B | 1 B | ~ 2.3 GB | ~ 2.3 GB | Very Low. Ideal for lightweight inference on basic GPUs or CPUs. |
| Qwen 2.5 Instruct 7B | 7 B | ~ 5GB (4 bit) / ~ 15.2 GB (BF16) | ~ 7-13 GB | Low/Medium. Similar efficiency to Llama 8B, runnable on a single >= 16 GB GPU. |
| Llama 3.1 Instruct 8B | 8 B | 4.9 GB (Q4_K_M) / 16 GB (FP16) | ~ 5-16 GB | Low/Medium. Highly efficient, runnable on a single mid-range GPU. |
| Llama 3.1 Instruct 70B | 70 B | ~ 40 GB (Q4_K_M) / 141 GB (FP16) | ~ 40-141 GB | Medium/High. Requires high-end GPUs or multiple GPUs. |
For the research purposes, and trying to push the application limits, I've decided to use Llama 3.1 Instruct 1B and Qwen 2.5 Instruct 7B.
The table mentions terms like Q3_K_L, Q4_K_M, BF16, FP16, 4 bits, and 8 bits, which all refer to quantization techniques. Quantization is a method for shrinking the memory and processing power required by LLMs. It works by converting the model's parameters from their standard 32-bit floating-point format into lower-precision types, like 8-bit or even 4-bit integers. These parameters, also known as weights, are the numerical values that the model learns during training and that define its behavior.
A standard, unquantized model delivers maximum accuracy but demands substantial memory and computational power, as shown in the table. By applying a 4-bit quantization to a model like Qwen 7B or Llama 8B, high-precision values are mapped to a more compact, lower-precision range. Although this can slightly reduce accuracy, the trade-off is a significantly smaller memory footprint and faster performance, which is great when either deploying models on resource-constrained devices or trying to contain costs. Following the research purpose, I adopted 4-bit quantization to the Qwen 7B model, which led to a memory footprint of around 5GB instead of the starting 15GB.
Before getting our hands dirty, it's important to have a full context of what you will read in the next lines of code. For this reason, I need to introduce to you two new core concepts, two pillars behind LLMs: transformers and tokenizers.
At the heart of LLMs there are transformers, a type of neural network architecture that excels at handling sequential data, such as text. Transformers are composed of two main parts: an encoder and a decoder. The encoder processes the input text and creates a numerical representation of it, while the decoder uses this representation to generate the output text.
Before a transformer can process text, the text must be converted into a format that the model can understand: this is where tokenization comes in. Tokenization is the process of breaking down a piece of text into smaller units, called tokens, that can be words, subwords, or even individual characters.
A tokenizer is a tool that is responsible for performing this tokenization. The tokenizer has a vocabulary of all the tokens that the model knows, and it maps each token to a unique numerical ID. This sequence of IDs is then what is fed into the transformer model. You can find more information about tokenization on HuggingFace.
Now that we have all the pieces, we can finally compose our puzzle. I've created a proper Google Colab Notebook, you can easily run the following steps from there and understand better how everything works.
First, we need to create a quantization configuration. For doing this, I've used a library called BitsAndBytesConfig but there are tons available with the same purpose, depending on the method adopted and the supported process unit (CPU, GPU).
With this configuration, we can then initialize a quantized model and the tokenizer obtained from the base model:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
BASE_MODEL = "qwen/Qwen2.5-7B-Instruct"
# Create the configuration
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
# Load the model and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
device_map="auto", # This allow to use CUDA if available
quantization_config=quantization_config
)
# Load the model tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
To force the LLM to return a structured output, I used Outlines, a library that guides the model's output to conform to a specific structure, such as a JSON schema or a Pydantic model.
To achieve this goal, first define a Pydantic model:
from pydantic import BaseModel
class Food(BaseModel):
protein: float
carbohydrates: float
fats: float
calories: float
sugar: float
fiber: float
Then, create a generator that will output a structured output when the request is fulfilled by the LLM:
from outlines import from_transformers, Generator
generator = Generator(
from_transformers(quantized_model, tokenizer),
Food
)
Finally, we can use the generator to get a structured output from a prompt:
prompt = """
Get the nutritional data of the following food ingredient: **salmon fish**.
Use the following context: ...
"""
result = generator(
prompt,
max_new_tokens=200
)
I've deployed a Qwen service on modal for testing it and a proper response can be obtained with the following cURL:
curl --location 'https://ibbus93--nutritional-rag-service-qwen-nutritionalragserv-7ae00e.modal.run/' \
--header 'Content-Type: application/json' \
--data '{
"description": "Salmon fish"
}'
Which will lead to this result:
{
"protein": 22.56,
"carbohydrates": 0.0,
"fats": 5.57,
"calories": 140,
"sugar": 0.0,
"fiber": 0.0
}
I've deployed a Llama service as well and it can be tested like below:
curl --location 'https://ibbus93--nutritional-rag-service-llama-nutritionalragser-fc918b.modal.run/' \
--header 'Content-Type: application/json' \
--data '{
"description": "Salmon fish"
}'
Even though the two services received the same input and used the same database, the Llama service returned a different response:
{
"protein": 23.19,
"carbohydrates": 0.0,
"fats": 12.95,
"calories": 209,
"sugars": 0.0,
"fibre": 0.0
}
Let's review now the two models, using the following table as comparison.
| Model | Execution time (5 runs) | VRAM Memory footprint | Accuracy |
|---|---|---|---|
| Llama 3.1 Instruct 1B | ~ 2.85 seconds | ~ 2.4 GB | Questionable |
| Qwen 2.5 Instruct 7B | ~ 1.72 seconds | ~ 5 GB | Pretty much accurate |
As expected, Qwen has a larger memory footprint, but it also has a faster response time, which is a key factor for production applications.
Regarding the accuracy, there is a noticeable discrepancy between the two models, both in the JSON schema and the data they return. About the schema, models with fewer parameters are generally less reliable. In this case, the Llama model sometimes returned a different format between runs.
Regarding the data returned, the prompt used by both the models is the following:
Please use only the following context to answer the question.
**Precedence Rule: Always choose the nutritional data for RAW foods if available.**
Get the nutritional data of the following food ingredient: **Salmon fish**.
CONTEXT OPTIONS:
product name: FISH,SALMON,COHO (SILVER),RAW (ALASKA NATIVE), fat: 5.57, carbohydrates: 0.0, proteins: 22.56, calories: 140, sugars: 0.0, fiber: 0.0
product name: FISH,SALMON,RED,(SOCKEYE),KIPPERED (ALASKA NATIVE), fat: 4.75, carbohydrates: 0.0, proteins: 24.5, calories: 141, sugars: 0.0, fiber: 0.0
product name: FISH,SALMON,KING,W/ SKN,KIPPERED,(ALASKA NATIVE), fat: 12.95, carbohydrates: 0.0, proteins: 23.19, calories: 209, sugars: 0.0, fiber: 0.0
The data extracted from the database (hence from the adopted dataset) are providing three different sample for the salmon, but you may notice that the first one is RAW and, while Qwen is using it, Llama is usually ignoring it.
In conclusion, this research has demonstrated that it is possible to build cost-effective, production-ready LLM solutions by using open-source models and quantization.
While smaller models might not always match the accuracy of their larger counterparts, they offer a significant advantage in terms of resource consumption and deployment flexibility.
The choice of the right model will always depend on the specific needs of the application, but with the right approach, it is possible to find a balance between performance and cost.
As a future challenge, it would be interesting to explore other quantization techniques and to fine-tune a smaller model on a specific domain to see if it is possible to improve its accuracy while keeping the resource consumption low.