I'm really in awe by how well you explain everything. I wish I had professors with your patience and teaching abilities when I was at university. Anyway. Thank you the lesson. I already love Ollama but your content is making me really see LLMs and, in this case, quantizations, with different eyes. To be honest, I used to think that anything lower than q8 for, let's say, a 7 or 8 billion parameter model, would be pretty much useless, but after experimenting with Llama 3.1, Mistral, and a few others, I think q4 is definitely the sweet spot for my needs. Llama 3.1-q4 retains a decent amount of reasoning capabilities and I can increase the context length to have it work better with whatever information I want to feed it on the spot. Thanks again for the content. It's awesome!
Excellent video, Matt. Your take on benchmarks usefulness is smart.
You changed my thinking on which quant to use! I’ll experiment more with running the lowest quant to get acceptable answers. Thanks!
Great tutorial as usual. Btw, here is a windows powershell command to update all the models if you have more then 1 already installed on Windows. ollama ls | ForEach-Object {"{0}" -f($_ -split '\s+')} | Where-Object { $_ -notmatch 'failed' -and $_ -notmatch 'NAME' } | ForEach-Object {$model = $_; "Updating model $model"; ollama pull $model} This one works both on Mac and Linux: ollama ls | awk '{print $1}' | grep -v NAME | while read model; do echo "### updating $model ###"; ollama pull $model; done Those I wrote them myself but you can ask your favorite GPT for an explanation.
Looking at quantize models is something I haven't even looked at yet for my home server. I have an HPE microserver that I can't put a graphics card into just because it's physically too small so I'm running CPU only. Now you've got me curious if I can actually get vaster speeds just by using a smaller quantized model. Thank you so much for making this content You're absolutely amazing.
I found the evaluation interesting, and the conclusion wise: try with your own prompts and see what happens. I would suggest to extend the evaluation to many more conversation turns, because some models get lost later on despite doing well on the first reply. Your evaluation made me curious to try different quants at temperature 0 and the same seed, to see if some of the quants end up with an identical output!
I agree with all the points made in the video and would just like to add my own experience for viewers looking for more details in the comments. One important factor to consider when choosing quantization levels is the impact of hardware constraints. For example, I've been running the LLaMA 3.1 70B model, which fits in 48GB of VRAM at Q4 but with a limited context window. I found that running the 70B model at Q2 (which frees up memory for extending the context window) gave me better results than the 8B model with an extended context window. This balance between model size, quantization, and context window size can be crucial depending on your specific use case and hardware capabilities.
This is great. I’d love to see the difference between the quant levels as the length of the prompt increases. I find that the lower quants don’t handle longer inputs very well. I’m. Im not sure why that is.
Thanks Matt. Very well explain and informative. I employ three of thoughts, obviously with multi-shoot prompting, and those are normally very complex tasks where the model needs to pick the hints in a clinical case description to help diagnose or optimize the treatment of the patient. I see the bigger parameters models perform better, because they pick more details and correlate better the facts. I noticed that Claude and Gemini are the kings there. What about quantization in this case? any recommendation?
Interesting video! Thanks!
Great stuff!
Would love to see how to get it running in Proxmox with multiple GPUs. Lots of old articles out there
Another great video thanks. Would you mind adding the link to the youtube videos, related to the folder in your videoprojects repo, in each README file? Would make it a lot easier to find the video when browsing the repo. Ta, keep up the good work.
I think that q2 was the best in the scenario where you basically were searching for "json" string because it wasn't trying to "understand" what json is, it was just a word/string, thus, always "catched"
@technovangelist Thanks, Matt. The experiment was surprising. All in all, it seems that higher quantization achieves better results (at least for function calling), which is counterintuitive to me. However, if this is statistically true, it's good news for local applications driven by a small LLM that calls external (but still local) services. In other words, it's promising for real-time on-prem automation! As someone suggested in a comment, the balance between model size, quantization, and context window size seems to be crucial. I’d suggest dedicating a session to the context window size and its usage. I’m personally confused by the default length value in Ollama and how to set the desired window size. Thanks for this course. Giorgio
Great video
thanks! :)
I usually benchmark models with a simple programming question: ``` You are a software engineer experienced in C++: Write a trivial C++ program that follow this code-style: Use modern C++20 Use the auto func(...) -> ret syntax, even for auto main()->int Always open curly braces on new line: DONT auto main()->int{\n... (with no new line between int and '{'); but DO: auto main() ->int \n{\n... instead (with new line between int and '{'). Comment your code. No explanation, no introduction, keep verbosity to the minimum, only code. ``` Even this dummy question fails most of the time on any Q4 model I tried on Ollama. I hope to get better results with better quantization, but I need to upgrade my computer for this.
Rule of thumb Take the k-quants that fit into your gpu memory. Usually until q3 there is really a negligible loss. If the model+context fit into the memory use I-Quants
@romulopontual6254