[{"data":1,"prerenderedAt":1356},["ShallowReactive",2],{"blog-post-/en/articles/open-source-rag-quantization":3},{"id":4,"title":5,"alt":6,"body":7,"date":1341,"description":1342,"extension":1217,"image":1343,"locale":1344,"meta":1345,"navigation":372,"path":1346,"seo":1347,"stem":1348,"tags":1349,"__hash__":1355},"articles/en/articles/3.open-source-rag-quantization.md","Adopting open-source models for cost-effective LLM solutions","Nutritional RAG LLM using open source models",{"type":8,"value":9,"toc":1329},"minimark",[10,15,25,28,44,48,51,54,57,164,174,178,224,244,248,257,261,268,272,294,298,306,327,537,541,549,552,647,650,717,720,786,799,863,866,977,981,984,1034,1037,1137,1141,1144,1204,1207,1210,1213,1277,1284,1288,1297,1300,1304,1325],[11,12,14],"h3",{"id":13},"context","Context",[16,17,18,19,24],"p",{},"In my ",[20,21,23],"a",{"href":22},"./a-nutritional-llm-assistant-using-rag","previous article",", I introduced the purpose of my studies: to understand how to leverage LLMs for creating cost-effective, production-ready solutions. Although the initial solution was functional, it was not cost-effective, since it relied on expensive REST requests to OpenAI APIs. To address this, I started to explore open-source models that are small enough to be easily deployed and adopted in a real-world application.",[16,26,27],{},"In the following paragraphs, I will describe my journey, including:",[29,30,31,35,38,41],"ul",{},[32,33,34],"li",{},"choosing the right model that fits my needs",[32,36,37],{},"applying quantization for optimizing the model",[32,39,40],{},"forcing the output to be in a specific JSON format",[32,42,43],{},"a comparison between a 7B parameters quantized model versus a 1B parameters non-quantized model",[11,45,47],{"id":46},"model-choice","Model choice",[16,49,50],{},"When choosing a model, one of the most important factors to consider is the number of parameters it has. Models trained with a higher number of parameters, often in the billions, are generally more powerful and can perform a wider range of tasks, but they also require more computational resources, which can lead to higher costs and slower execution times.",[16,52,53],{},"For this project, the main goal was to create a cost-effective solution, so I decided to focus on smaller models with a lower number of parameters. These models are easier to deploy and can be run on less expensive hardware, making them a more practical choice for real-world applications where budget is a concern.",[16,55,56],{},"In the following table it's possible to see, when the parameters grow, the execution time and the required space in RAM and disk grow as well.",[58,59,60,93],"table",{},[61,62,63],"thead",{},[64,65,66,73,78,83,88],"tr",{},[67,68,69],"th",{},[70,71,72],"strong",{},"LLM Model",[67,74,75],{},[70,76,77],{},"Parameters (Billions)",[67,79,80],{},[70,81,82],{},"VRAM Space (Inference, Est.)",[67,84,85],{},[70,86,87],{},"Disk Space (Est.)",[67,89,90],{},[70,91,92],{},"Execution Time (Latency)",[94,95,96,113,130,147],"tbody",{},[64,97,98,102,105,108,110],{},[99,100,101],"td",{},"Llama 3.1 Instruct 1B",[99,103,104],{},"1 B",[99,106,107],{},"~ 2.3 GB",[99,109,107],{},[99,111,112],{},"Very Low. Ideal for lightweight inference on basic GPUs or CPUs.",[64,114,115,118,121,124,127],{},[99,116,117],{},"Qwen 2.5 Instruct 7B",[99,119,120],{},"7 B",[99,122,123],{},"~ 5GB (4 bit) / ~ 15.2 GB (BF16)",[99,125,126],{},"~ 7-13 GB",[99,128,129],{},"Low/Medium. Similar efficiency to Llama 8B, runnable on a single >= 16 GB GPU.",[64,131,132,135,138,141,144],{},[99,133,134],{},"Llama 3.1 Instruct 8B",[99,136,137],{},"8 B",[99,139,140],{},"4.9 GB (Q4_K_M) / 16 GB (FP16)",[99,142,143],{},"~ 5-16 GB",[99,145,146],{},"Low/Medium. Highly efficient, runnable on a single mid-range GPU.",[64,148,149,152,155,158,161],{},[99,150,151],{},"Llama 3.1 Instruct 70B",[99,153,154],{},"70 B",[99,156,157],{},"~ 40 GB (Q4_K_M) / 141 GB (FP16)",[99,159,160],{},"~ 40-141 GB",[99,162,163],{},"Medium/High. Requires high-end GPUs or multiple GPUs.",[16,165,166,167,170,171,173],{},"For the research purposes, and trying to push the application limits, I've decided to use ",[168,169,101],"code",{}," and ",[168,172,117],{},".",[11,175,177],{"id":176},"model-quantization","Model quantization",[16,179,180,181,184,185,184,188,184,191,184,194,197,198,201,202,207,208,211,212,215,216,219,220,223],{},"The table mentions terms like ",[168,182,183],{},"Q3_K_L",", ",[168,186,187],{},"Q4_K_M",[168,189,190],{},"BF16",[168,192,193],{},"FP16",[168,195,196],{},"4 bits",", and ",[168,199,200],{},"8 bits",", which all refer to quantization techniques. ",[20,203,206],{"href":204,"target":205},"https://huggingface.co/docs/transformers/main/quantization/overview","_blank","Quantization"," is a method for shrinking the memory and processing power required by LLMs. It works by converting the model's parameters from their standard ",[168,209,210],{},"32-bit"," floating-point format into lower-precision types, like ",[168,213,214],{},"8-bit"," or even ",[168,217,218],{},"4-bit"," integers. These parameters, also known as ",[70,221,222],{},"weights",", are the numerical values that the model learns during training and that define its behavior.",[16,225,226,227,229,230,233,234,237,238,240,241,243],{},"A standard, unquantized model delivers maximum accuracy but demands substantial memory and computational power, as shown in the table. By applying a ",[168,228,218],{}," quantization to a model like ",[168,231,232],{},"Qwen 7B"," or ",[168,235,236],{},"Llama 8B",", high-precision values are mapped to a more compact, lower-precision range. Although this can slightly reduce accuracy, the trade-off is a significantly smaller memory footprint and faster performance, which is great when either deploying models on resource-constrained devices or trying to contain costs. Following the research purpose, I adopted ",[168,239,218],{}," quantization to the ",[168,242,232],{}," model, which led to a memory footprint of around 5GB instead of the starting 15GB.",[11,245,247],{"id":246},"transformers-and-tokenization","Transformers and tokenization",[16,249,250,251,170,254,173],{},"Before getting our hands dirty, it's important to have a full context of what you will read in the next lines of code. For this reason, I need to introduce to you two new core concepts, two pillars behind LLMs: ",[70,252,253],{},"transformers",[70,255,256],{},"tokenizers",[258,259,260],"h4",{"id":253},"Transformers",[16,262,263,264,267],{},"At the heart of LLMs there are ",[20,265,253],{"href":266,"target":205},"https://developers.google.com/machine-learning/crash-course/llm/transformers",", a type of neural network architecture that excels at handling sequential data, such as text. Transformers are composed of two main parts: an encoder and a decoder. The encoder processes the input text and creates a numerical representation of it, while the decoder uses this representation to generate the output text.",[258,269,271],{"id":270},"tokenizer","Tokenizer",[16,273,274,275,278,279,282,283,286,287,289,290,173],{},"Before a transformer can process text, the text must be converted into a format that the model can understand: this is where ",[70,276,277],{},"tokenization"," comes in. Tokenization is the process of breaking down a piece of text into smaller units, called ",[70,280,281],{},"tokens",", that can be words, subwords, or even individual characters.",[284,285],"br",{},"\nA ",[70,288,270],{}," is a tool that is responsible for performing this tokenization. The tokenizer has a vocabulary of all the tokens that the model knows, and it maps each token to a unique numerical ID. This sequence of IDs is then what is fed into the transformer model. You can find more information about tokenization ",[20,291,293],{"href":292,"target":205},"https://huggingface.co/learn/llm-course/chapter2/4","on HuggingFace",[11,295,297],{"id":296},"applying-quantization-to-qwen","Applying quantization to Qwen",[16,299,300,301,305],{},"Now that we have all the pieces, we can finally compose our puzzle. I've created a proper ",[20,302,304],{"href":303,"target":205},"https://colab.research.google.com/drive/1SPlXn66dBq8jSx3RdqEPwQQw9NttRzN3?usp=sharing","Google Colab Notebook",", you can easily run the following steps from there and understand better how everything works.",[16,307,308,309,312,313,316,317,319,320,323,324,326],{},"First, we need to create a quantization configuration. For doing this, I've used a library called ",[70,310,311],{},"BitsAndBytesConfig"," but there are ",[20,314,315],{"href":204,"target":205},"tons"," available with the same purpose, depending on the method adopted and the supported process unit (CPU, GPU).",[284,318],{},"\nWith this configuration, we can then initialize ",[70,321,322],{},"a quantized model"," and the ",[70,325,270],{}," obtained from the base model:",[328,329,334],"pre",{"className":330,"code":331,"language":332,"meta":333,"style":333},"language-python shiki shiki-themes material-theme-lighter material-theme material-theme-palenight","from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n\nBASE_MODEL = \"qwen/Qwen2.5-7B-Instruct\"\n\n# Create the configuration\nquantization_config = BitsAndBytesConfig(load_in_4bit=True)\n\n# Load the model and apply quantization \nquantized_model = AutoModelForCausalLM.from_pretrained(\n  BASE_MODEL, \n  device_map=\"auto\", # This allow to use CUDA if available\n  quantization_config=quantization_config\n)\n\n# Load the model tokenizer\ntokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)\n","python","",[168,335,336,367,374,393,398,405,427,432,438,456,467,488,499,505,510,516],{"__ignoreMap":333},[337,338,341,345,349,352,355,359,362,364],"span",{"class":339,"line":340},"line",1,[337,342,344],{"class":343},"s7zQu","from",[337,346,348],{"class":347},"sTEyZ"," transformers ",[337,350,351],{"class":343},"import",[337,353,354],{"class":347}," AutoTokenizer",[337,356,358],{"class":357},"sMK4o",",",[337,360,361],{"class":347}," AutoModelForCausalLM",[337,363,358],{"class":357},[337,365,366],{"class":347}," BitsAndBytesConfig\n",[337,368,370],{"class":339,"line":369},2,[337,371,373],{"emptyLinePlaceholder":372},true,"\n",[337,375,377,380,383,386,390],{"class":339,"line":376},3,[337,378,379],{"class":347},"BASE_MODEL ",[337,381,382],{"class":357},"=",[337,384,385],{"class":357}," \"",[337,387,389],{"class":388},"sfazB","qwen/Qwen2.5-7B-Instruct",[337,391,392],{"class":357},"\"\n",[337,394,396],{"class":339,"line":395},4,[337,397,373],{"emptyLinePlaceholder":372},[337,399,401],{"class":339,"line":400},5,[337,402,404],{"class":403},"sHwdD","# Create the configuration\n",[337,406,408,411,413,417,420,424],{"class":339,"line":407},6,[337,409,410],{"class":347},"quantization_config ",[337,412,382],{"class":357},[337,414,416],{"class":415},"s2Zo4"," BitsAndBytesConfig",[337,418,419],{"class":357},"(",[337,421,423],{"class":422},"sHdIc","load_in_4bit",[337,425,426],{"class":357},"=True)\n",[337,428,430],{"class":339,"line":429},7,[337,431,373],{"emptyLinePlaceholder":372},[337,433,435],{"class":339,"line":434},8,[337,436,437],{"class":403},"# Load the model and apply quantization \n",[337,439,441,444,446,448,450,453],{"class":339,"line":440},9,[337,442,443],{"class":347},"quantized_model ",[337,445,382],{"class":357},[337,447,361],{"class":347},[337,449,173],{"class":357},[337,451,452],{"class":415},"from_pretrained",[337,454,455],{"class":357},"(\n",[337,457,459,462,464],{"class":339,"line":458},10,[337,460,461],{"class":415},"  BASE_MODEL",[337,463,358],{"class":357},[337,465,466],{"class":415}," \n",[337,468,470,473,475,478,481,483,485],{"class":339,"line":469},11,[337,471,472],{"class":422},"  device_map",[337,474,382],{"class":357},[337,476,477],{"class":357},"\"",[337,479,480],{"class":388},"auto",[337,482,477],{"class":357},[337,484,358],{"class":357},[337,486,487],{"class":403}," # This allow to use CUDA if available\n",[337,489,491,494,496],{"class":339,"line":490},12,[337,492,493],{"class":422},"  quantization_config",[337,495,382],{"class":357},[337,497,498],{"class":415},"quantization_config\n",[337,500,502],{"class":339,"line":501},13,[337,503,504],{"class":357},")\n",[337,506,508],{"class":339,"line":507},14,[337,509,373],{"emptyLinePlaceholder":372},[337,511,513],{"class":339,"line":512},15,[337,514,515],{"class":403},"# Load the model tokenizer\n",[337,517,519,522,524,526,528,530,532,535],{"class":339,"line":518},16,[337,520,521],{"class":347},"tokenizer ",[337,523,382],{"class":357},[337,525,354],{"class":347},[337,527,173],{"class":357},[337,529,452],{"class":415},[337,531,419],{"class":357},[337,533,534],{"class":415},"BASE_MODEL",[337,536,504],{"class":357},[11,538,540],{"id":539},"outlines-for-structured-outputs","Outlines for structured outputs",[16,542,543,544,548],{},"To force the LLM to return a structured output, I used ",[20,545,547],{"href":546,"target":205},"https://github.com/dottxt-ai/outlines","Outlines",", a library that guides the model's output to conform to a specific structure, such as a JSON schema or a Pydantic model.",[16,550,551],{},"To achieve this goal, first define a Pydantic model:",[328,553,555],{"className":330,"code":554,"language":332,"meta":333,"style":333},"from pydantic import BaseModel\n\nclass Food(BaseModel):\n  protein: float\n  carbohydrates: float\n  fats: float\n  calories: float\n  sugar: float\n  fiber: float\n",[168,556,557,569,573,591,602,611,620,629,638],{"__ignoreMap":333},[337,558,559,561,564,566],{"class":339,"line":340},[337,560,344],{"class":343},[337,562,563],{"class":347}," pydantic ",[337,565,351],{"class":343},[337,567,568],{"class":347}," BaseModel\n",[337,570,571],{"class":339,"line":369},[337,572,373],{"emptyLinePlaceholder":372},[337,574,575,579,583,585,588],{"class":339,"line":376},[337,576,578],{"class":577},"spNyl","class",[337,580,582],{"class":581},"sBMFI"," Food",[337,584,419],{"class":357},[337,586,587],{"class":581},"BaseModel",[337,589,590],{"class":357},"):\n",[337,592,593,596,599],{"class":339,"line":395},[337,594,595],{"class":347},"  protein",[337,597,598],{"class":357},":",[337,600,601],{"class":581}," float\n",[337,603,604,607,609],{"class":339,"line":400},[337,605,606],{"class":347},"  carbohydrates",[337,608,598],{"class":357},[337,610,601],{"class":581},[337,612,613,616,618],{"class":339,"line":407},[337,614,615],{"class":347},"  fats",[337,617,598],{"class":357},[337,619,601],{"class":581},[337,621,622,625,627],{"class":339,"line":429},[337,623,624],{"class":347},"  calories",[337,626,598],{"class":357},[337,628,601],{"class":581},[337,630,631,634,636],{"class":339,"line":434},[337,632,633],{"class":347},"  sugar",[337,635,598],{"class":357},[337,637,601],{"class":581},[337,639,640,643,645],{"class":339,"line":440},[337,641,642],{"class":347},"  fiber",[337,644,598],{"class":357},[337,646,601],{"class":581},[16,648,649],{},"Then, create a generator that will output a structured output when the request is fulfilled by the LLM:",[328,651,653],{"className":330,"code":652,"language":332,"meta":333,"style":333},"from outlines import from_transformers, Generator\n\ngenerator = Generator(\n  from_transformers(quantized_model, tokenizer), \n  Food\n)\n",[168,654,655,672,676,688,708,713],{"__ignoreMap":333},[337,656,657,659,662,664,667,669],{"class":339,"line":340},[337,658,344],{"class":343},[337,660,661],{"class":347}," outlines ",[337,663,351],{"class":343},[337,665,666],{"class":347}," from_transformers",[337,668,358],{"class":357},[337,670,671],{"class":347}," Generator\n",[337,673,674],{"class":339,"line":369},[337,675,373],{"emptyLinePlaceholder":372},[337,677,678,681,683,686],{"class":339,"line":376},[337,679,680],{"class":347},"generator ",[337,682,382],{"class":357},[337,684,685],{"class":415}," Generator",[337,687,455],{"class":357},[337,689,690,693,695,698,700,703,706],{"class":339,"line":395},[337,691,692],{"class":415},"  from_transformers",[337,694,419],{"class":357},[337,696,697],{"class":415},"quantized_model",[337,699,358],{"class":357},[337,701,702],{"class":415}," tokenizer",[337,704,705],{"class":357},"),",[337,707,466],{"class":415},[337,709,710],{"class":339,"line":400},[337,711,712],{"class":415},"  Food\n",[337,714,715],{"class":339,"line":407},[337,716,504],{"class":357},[16,718,719],{},"Finally, we can use the generator to get a structured output from a prompt:",[328,721,723],{"className":330,"code":722,"language":332,"meta":333,"style":333},"prompt = \"\"\"\n  Get the nutritional data of the following food ingredient: **salmon fish**.\n  Use the following context: ...\n\"\"\"\nresult = generator(\n  prompt, \n  max_new_tokens=200\n)\n",[168,724,725,735,740,745,750,762,771,782],{"__ignoreMap":333},[337,726,727,730,732],{"class":339,"line":340},[337,728,729],{"class":347},"prompt ",[337,731,382],{"class":357},[337,733,734],{"class":357}," \"\"\"\n",[337,736,737],{"class":339,"line":369},[337,738,739],{"class":388},"  Get the nutritional data of the following food ingredient: **salmon fish**.\n",[337,741,742],{"class":339,"line":376},[337,743,744],{"class":388},"  Use the following context: ...\n",[337,746,747],{"class":339,"line":395},[337,748,749],{"class":357},"\"\"\"\n",[337,751,752,755,757,760],{"class":339,"line":400},[337,753,754],{"class":347},"result ",[337,756,382],{"class":357},[337,758,759],{"class":415}," generator",[337,761,455],{"class":357},[337,763,764,767,769],{"class":339,"line":407},[337,765,766],{"class":415},"  prompt",[337,768,358],{"class":357},[337,770,466],{"class":415},[337,772,773,776,778],{"class":339,"line":429},[337,774,775],{"class":422},"  max_new_tokens",[337,777,382],{"class":357},[337,779,781],{"class":780},"sbssI","200\n",[337,783,784],{"class":339,"line":434},[337,785,504],{"class":357},[16,787,788,789,793,794,798],{},"I've deployed a ",[20,790,792],{"href":791,"target":205},"https://github.com/federicoibba/nutritional-information-rag/blob/main/services/qwen.py","Qwen service"," on ",[20,795,797],{"href":796,"target":205},"https://modal.com","modal"," for testing it and a proper response can be obtained with the following cURL:",[328,800,804],{"className":801,"code":802,"language":803,"meta":333,"style":333},"language-bash shiki shiki-themes material-theme-lighter material-theme material-theme-palenight","curl --location 'https://ibbus93--nutritional-rag-service-qwen-nutritionalragserv-7ae00e.modal.run/' \\\n--header 'Content-Type: application/json' \\\n--data '{\n    \"description\": \"Salmon fish\"\n}'\n","bash",[168,805,806,826,840,850,855],{"__ignoreMap":333},[337,807,808,811,814,817,820,823],{"class":339,"line":340},[337,809,810],{"class":581},"curl",[337,812,813],{"class":388}," --location",[337,815,816],{"class":357}," '",[337,818,819],{"class":388},"https://ibbus93--nutritional-rag-service-qwen-nutritionalragserv-7ae00e.modal.run/",[337,821,822],{"class":357},"'",[337,824,825],{"class":347}," \\\n",[337,827,828,831,833,836,838],{"class":339,"line":369},[337,829,830],{"class":347},"--header ",[337,832,822],{"class":357},[337,834,835],{"class":388},"Content-Type: application/json",[337,837,822],{"class":357},[337,839,825],{"class":347},[337,841,842,845,847],{"class":339,"line":376},[337,843,844],{"class":347},"--data ",[337,846,822],{"class":357},[337,848,849],{"class":388},"{\n",[337,851,852],{"class":339,"line":395},[337,853,854],{"class":388},"    \"description\": \"Salmon fish\"\n",[337,856,857,860],{"class":339,"line":400},[337,858,859],{"class":388},"}",[337,861,862],{"class":357},"'\n",[16,864,865],{},"Which will lead to this result:",[328,867,871],{"className":868,"code":869,"language":870,"meta":333,"style":333},"language-json shiki shiki-themes material-theme-lighter material-theme material-theme-palenight","{\n  \"protein\": 22.56,\n  \"carbohydrates\": 0.0,\n  \"fats\": 5.57,\n  \"calories\": 140,\n  \"sugar\": 0.0,\n  \"fiber\": 0.0\n}\n","json",[168,872,873,877,895,911,927,943,958,972],{"__ignoreMap":333},[337,874,875],{"class":339,"line":340},[337,876,849],{"class":357},[337,878,879,882,885,887,889,892],{"class":339,"line":369},[337,880,881],{"class":357},"  \"",[337,883,884],{"class":577},"protein",[337,886,477],{"class":357},[337,888,598],{"class":357},[337,890,891],{"class":780}," 22.56",[337,893,894],{"class":357},",\n",[337,896,897,899,902,904,906,909],{"class":339,"line":376},[337,898,881],{"class":357},[337,900,901],{"class":577},"carbohydrates",[337,903,477],{"class":357},[337,905,598],{"class":357},[337,907,908],{"class":780}," 0.0",[337,910,894],{"class":357},[337,912,913,915,918,920,922,925],{"class":339,"line":395},[337,914,881],{"class":357},[337,916,917],{"class":577},"fats",[337,919,477],{"class":357},[337,921,598],{"class":357},[337,923,924],{"class":780}," 5.57",[337,926,894],{"class":357},[337,928,929,931,934,936,938,941],{"class":339,"line":400},[337,930,881],{"class":357},[337,932,933],{"class":577},"calories",[337,935,477],{"class":357},[337,937,598],{"class":357},[337,939,940],{"class":780}," 140",[337,942,894],{"class":357},[337,944,945,947,950,952,954,956],{"class":339,"line":407},[337,946,881],{"class":357},[337,948,949],{"class":577},"sugar",[337,951,477],{"class":357},[337,953,598],{"class":357},[337,955,908],{"class":780},[337,957,894],{"class":357},[337,959,960,962,965,967,969],{"class":339,"line":429},[337,961,881],{"class":357},[337,963,964],{"class":577},"fiber",[337,966,477],{"class":357},[337,968,598],{"class":357},[337,970,971],{"class":780}," 0.0\n",[337,973,974],{"class":339,"line":434},[337,975,976],{"class":357},"}\n",[11,978,980],{"id":979},"llama-service","Llama service",[16,982,983],{},"I've deployed a Llama service as well and it can be tested like below:",[328,985,987],{"className":801,"code":986,"language":803,"meta":333,"style":333},"curl --location 'https://ibbus93--nutritional-rag-service-llama-nutritionalragser-fc918b.modal.run/' \\\n--header 'Content-Type: application/json' \\\n--data '{\n    \"description\": \"Salmon fish\"\n}'\n",[168,988,989,1004,1016,1024,1028],{"__ignoreMap":333},[337,990,991,993,995,997,1000,1002],{"class":339,"line":340},[337,992,810],{"class":581},[337,994,813],{"class":388},[337,996,816],{"class":357},[337,998,999],{"class":388},"https://ibbus93--nutritional-rag-service-llama-nutritionalragser-fc918b.modal.run/",[337,1001,822],{"class":357},[337,1003,825],{"class":347},[337,1005,1006,1008,1010,1012,1014],{"class":339,"line":369},[337,1007,830],{"class":347},[337,1009,822],{"class":357},[337,1011,835],{"class":388},[337,1013,822],{"class":357},[337,1015,825],{"class":347},[337,1017,1018,1020,1022],{"class":339,"line":376},[337,1019,844],{"class":347},[337,1021,822],{"class":357},[337,1023,849],{"class":388},[337,1025,1026],{"class":339,"line":395},[337,1027,854],{"class":388},[337,1029,1030,1032],{"class":339,"line":400},[337,1031,859],{"class":388},[337,1033,862],{"class":357},[16,1035,1036],{},"Even though the two services received the same input and used the same database, the Llama service returned a different response:",[328,1038,1040],{"className":868,"code":1039,"language":870,"meta":333,"style":333},"{\n  \"protein\": 23.19,\n  \"carbohydrates\": 0.0,\n  \"fats\": 12.95,\n  \"calories\": 209,\n  \"sugars\": 0.0,\n  \"fibre\": 0.0\n}\n",[168,1041,1042,1046,1061,1075,1090,1105,1120,1133],{"__ignoreMap":333},[337,1043,1044],{"class":339,"line":340},[337,1045,849],{"class":357},[337,1047,1048,1050,1052,1054,1056,1059],{"class":339,"line":369},[337,1049,881],{"class":357},[337,1051,884],{"class":577},[337,1053,477],{"class":357},[337,1055,598],{"class":357},[337,1057,1058],{"class":780}," 23.19",[337,1060,894],{"class":357},[337,1062,1063,1065,1067,1069,1071,1073],{"class":339,"line":376},[337,1064,881],{"class":357},[337,1066,901],{"class":577},[337,1068,477],{"class":357},[337,1070,598],{"class":357},[337,1072,908],{"class":780},[337,1074,894],{"class":357},[337,1076,1077,1079,1081,1083,1085,1088],{"class":339,"line":395},[337,1078,881],{"class":357},[337,1080,917],{"class":577},[337,1082,477],{"class":357},[337,1084,598],{"class":357},[337,1086,1087],{"class":780}," 12.95",[337,1089,894],{"class":357},[337,1091,1092,1094,1096,1098,1100,1103],{"class":339,"line":400},[337,1093,881],{"class":357},[337,1095,933],{"class":577},[337,1097,477],{"class":357},[337,1099,598],{"class":357},[337,1101,1102],{"class":780}," 209",[337,1104,894],{"class":357},[337,1106,1107,1109,1112,1114,1116,1118],{"class":339,"line":407},[337,1108,881],{"class":357},[337,1110,1111],{"class":577},"sugars",[337,1113,477],{"class":357},[337,1115,598],{"class":357},[337,1117,908],{"class":780},[337,1119,894],{"class":357},[337,1121,1122,1124,1127,1129,1131],{"class":339,"line":429},[337,1123,881],{"class":357},[337,1125,1126],{"class":577},"fibre",[337,1128,477],{"class":357},[337,1130,598],{"class":357},[337,1132,971],{"class":780},[337,1134,1135],{"class":339,"line":434},[337,1136,976],{"class":357},[11,1138,1140],{"id":1139},"model-comparison","Model comparison",[16,1142,1143],{},"Let's review now the two models, using the following table as comparison.",[58,1145,1146,1170],{},[61,1147,1148],{},[64,1149,1150,1155,1160,1165],{},[67,1151,1152],{},[70,1153,1154],{},"Model",[67,1156,1157],{},[70,1158,1159],{},"Execution time (5 runs)",[67,1161,1162],{},[70,1163,1164],{},"VRAM Memory footprint",[67,1166,1167],{},[70,1168,1169],{},"Accuracy",[94,1171,1172,1187],{},[64,1173,1174,1176,1179,1184],{},[99,1175,101],{},[99,1177,1178],{},"~ 2.85 seconds",[99,1180,1181],{},[70,1182,1183],{},"~ 2.4 GB",[99,1185,1186],{},"Questionable",[64,1188,1189,1191,1196,1199],{},[99,1190,117],{},[99,1192,1193],{},[70,1194,1195],{},"~ 1.72 seconds",[99,1197,1198],{},"~ 5 GB",[99,1200,1201],{},[70,1202,1203],{},"Pretty much accurate",[16,1205,1206],{},"As expected, Qwen has a larger memory footprint, but it also has a faster response time, which is a key factor for production applications.",[16,1208,1209],{},"Regarding the accuracy, there is a noticeable discrepancy between the two models, both in the JSON schema and the data they return.\nAbout the schema, models with fewer parameters are generally less reliable. In this case, the Llama model sometimes returned a different format between runs.",[16,1211,1212],{},"Regarding the data returned, the prompt used by both the models is the following:",[328,1214,1218],{"className":1215,"code":1216,"language":1217,"meta":333,"style":333},"language-md shiki shiki-themes material-theme-lighter material-theme material-theme-palenight","Please use only the following context to answer the question.\n**Precedence Rule: Always choose the nutritional data for RAW foods if available.**\n\nGet the nutritional data of the following food ingredient: **Salmon fish**.\nCONTEXT OPTIONS:\nproduct name: FISH,SALMON,COHO (SILVER),RAW (ALASKA NATIVE), fat: 5.57, carbohydrates: 0.0, proteins: 22.56, calories: 140, sugars: 0.0, fiber: 0.0 \nproduct name: FISH,SALMON,RED,(SOCKEYE),KIPPERED (ALASKA NATIVE), fat: 4.75, carbohydrates: 0.0, proteins: 24.5, calories: 141, sugars: 0.0, fiber: 0.0 \nproduct name: FISH,SALMON,KING,W/ SKN,KIPPERED,(ALASKA NATIVE), fat: 12.95, carbohydrates: 0.0, proteins: 23.19, calories: 209, sugars: 0.0, fiber: 0.0\n","md",[168,1219,1220,1225,1238,1242,1257,1262,1267,1272],{"__ignoreMap":333},[337,1221,1222],{"class":339,"line":340},[337,1223,1224],{"class":347},"Please use only the following context to answer the question.\n",[337,1226,1227,1231,1235],{"class":339,"line":369},[337,1228,1230],{"class":1229},"sHepR","**",[337,1232,1234],{"class":1233},"so75L","Precedence Rule: Always choose the nutritional data for RAW foods if available.",[337,1236,1237],{"class":1229},"**\n",[337,1239,1240],{"class":339,"line":376},[337,1241,373],{"emptyLinePlaceholder":372},[337,1243,1244,1247,1249,1252,1254],{"class":339,"line":395},[337,1245,1246],{"class":347},"Get the nutritional data of the following food ingredient: ",[337,1248,1230],{"class":1229},[337,1250,1251],{"class":1233},"Salmon fish",[337,1253,1230],{"class":1229},[337,1255,1256],{"class":347},".\n",[337,1258,1259],{"class":339,"line":400},[337,1260,1261],{"class":347},"CONTEXT OPTIONS:\n",[337,1263,1264],{"class":339,"line":407},[337,1265,1266],{"class":347},"product name: FISH,SALMON,COHO (SILVER),RAW (ALASKA NATIVE), fat: 5.57, carbohydrates: 0.0, proteins: 22.56, calories: 140, sugars: 0.0, fiber: 0.0 \n",[337,1268,1269],{"class":339,"line":429},[337,1270,1271],{"class":347},"product name: FISH,SALMON,RED,(SOCKEYE),KIPPERED (ALASKA NATIVE), fat: 4.75, carbohydrates: 0.0, proteins: 24.5, calories: 141, sugars: 0.0, fiber: 0.0 \n",[337,1273,1274],{"class":339,"line":434},[337,1275,1276],{"class":347},"product name: FISH,SALMON,KING,W/ SKN,KIPPERED,(ALASKA NATIVE), fat: 12.95, carbohydrates: 0.0, proteins: 23.19, calories: 209, sugars: 0.0, fiber: 0.0\n",[16,1278,1279,1280,1283],{},"The data extracted from the database (hence from the adopted dataset) are providing three different sample for the salmon, but you may notice that the first one is ",[70,1281,1282],{},"RAW"," and, while Qwen is using it, Llama is usually ignoring it.",[11,1285,1287],{"id":1286},"conclusions-and-future-challenges","Conclusions and future challenges",[16,1289,1290,1291,1293,1294,1296],{},"In conclusion, this research has demonstrated that it is possible to build cost-effective, production-ready LLM solutions by using open-source models and quantization.",[284,1292],{},"\nWhile smaller models might not always match the accuracy of their larger counterparts, they offer a significant advantage in terms of resource consumption and deployment flexibility.",[284,1295],{},"\nThe choice of the right model will always depend on the specific needs of the application, but with the right approach, it is possible to find a balance between performance and cost.",[16,1298,1299],{},"As a future challenge, it would be interesting to explore other quantization techniques and to fine-tune a smaller model on a specific domain to see if it is possible to improve its accuracy while keeping the resource consumption low.",[11,1301,1303],{"id":1302},"bibliography","Bibliography",[29,1305,1306,1314],{},[32,1307,1308],{},[20,1309,1313],{"href":1310,"rel":1311},"https://github.com/federicoibba/nutritional-information-rag/",[1312],"nofollow","Repository project",[32,1315,1316,1317,793,1321],{},"Article photo by ",[20,1318,1320],{"href":1319,"target":205},"https://unsplash.com/@kat_katerina?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText","Katerina",[20,1322,1324],{"href":1323,"target":205},"https://unsplash.com/photos/opened-brown-wooden-window-FQYCJSqER_0?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText","Unsplash",[1326,1327,1328],"style",{},"html pre.shiki code .s7zQu, html code.shiki .s7zQu{--shiki-light:#39ADB5;--shiki-light-font-style:italic;--shiki-default:#89DDFF;--shiki-default-font-style:italic;--shiki-dark:#89DDFF;--shiki-dark-font-style:italic}html pre.shiki code .sTEyZ, html code.shiki .sTEyZ{--shiki-light:#90A4AE;--shiki-default:#EEFFFF;--shiki-dark:#BABED8}html pre.shiki code .sMK4o, html code.shiki .sMK4o{--shiki-light:#39ADB5;--shiki-default:#89DDFF;--shiki-dark:#89DDFF}html pre.shiki code .sfazB, html code.shiki .sfazB{--shiki-light:#91B859;--shiki-default:#C3E88D;--shiki-dark:#C3E88D}html pre.shiki code .sHwdD, html code.shiki .sHwdD{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#546E7A;--shiki-default-font-style:italic;--shiki-dark:#676E95;--shiki-dark-font-style:italic}html pre.shiki code .s2Zo4, html code.shiki .s2Zo4{--shiki-light:#6182B8;--shiki-default:#82AAFF;--shiki-dark:#82AAFF}html pre.shiki code .sHdIc, html code.shiki .sHdIc{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#EEFFFF;--shiki-default-font-style:italic;--shiki-dark:#BABED8;--shiki-dark-font-style:italic}html .light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html.light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .spNyl, html code.shiki .spNyl{--shiki-light:#9C3EDA;--shiki-default:#C792EA;--shiki-dark:#C792EA}html pre.shiki code .sBMFI, html code.shiki .sBMFI{--shiki-light:#E2931D;--shiki-default:#FFCB6B;--shiki-dark:#FFCB6B}html pre.shiki code .sbssI, html code.shiki .sbssI{--shiki-light:#F76D47;--shiki-default:#F78C6C;--shiki-dark:#F78C6C}html pre.shiki code .sHepR, html code.shiki .sHepR{--shiki-light:#39ADB5;--shiki-light-font-weight:bold;--shiki-default:#89DDFF;--shiki-default-font-weight:bold;--shiki-dark:#89DDFF;--shiki-dark-font-weight:bold}html pre.shiki code .so75L, html code.shiki .so75L{--shiki-light:#E53935;--shiki-light-font-weight:bold;--shiki-default:#F07178;--shiki-default-font-weight:bold;--shiki-dark:#F07178;--shiki-dark-font-weight:bold}",{"title":333,"searchDepth":369,"depth":369,"links":1330},[1331,1332,1333,1334,1335,1336,1337,1338,1339,1340],{"id":13,"depth":376,"text":14},{"id":46,"depth":376,"text":47},{"id":176,"depth":376,"text":177},{"id":246,"depth":376,"text":247},{"id":296,"depth":376,"text":297},{"id":539,"depth":376,"text":540},{"id":979,"depth":376,"text":980},{"id":1139,"depth":376,"text":1140},{"id":1286,"depth":376,"text":1287},{"id":1302,"depth":376,"text":1303},"2025-11-02T00:00:00.000Z","A Nutritional RAG LLM using open-source models and quantization","/images/articles/open-source-rag.jpg","en",{},"/en/articles/open-source-rag-quantization",{"title":5,"description":1342},"en/articles/3.open-source-rag-quantization",[1350,1351,1352,1353,1354],"LLM","RAG","Llama","Qwen","quantization","Lo2MXPzVpj-GzrH9kyYVbOxrI4Bf1UkEV8CmHuNpZ6M",1763894162934]