llama.cpp

30 articles tagged with this topic

Consumer GPU Hits 100K Context: Local LLM Hardware Thresholds Drop Fast

We see an RTX 3090 run a 27B model, 100K context, 50 tokens/s via quant+MTP+KV compression. Consumer inference now rivals last year's enterprise setup

5d ago2 min read

llama.cppMTP

llama.cpp MTP Hits Beta: Local LLM Inference Speed Gap Narrowing

llama.cpp MTP beta supports Qwen3.5. With tensor parallelism maturing, the local-cloud inference speed gap is narrowing, making local LLM deployment m

May 42 min read

RedditMeta

Reddit's AI Hall of Fame: Giants Set the Tone, Community Does the Dirty Work

Reddit's open-source AI Hall of Fame covers Meta, DeepSeek, and llama.cpp. LLM prosperity depends on a strict community division of labor, not just bi

May 32 min read

AMD R9700local deployment

3 GPUs Run Agent Clusters: Local AI Bottleneck Shifts to Orchestration

A dev used 3 AMD GPUs for a local multi-agent setup: small models work solo, cloud model supervises. New local AI bottleneck: orchestration, not just

May 32 min read

MistralUnsloth

Mistral 3.5 Inference Bug Fixed by Open-Source Team — LLM Delivery QA Flashing Red

Unsloth fixed a Mistral Medium 3.5 inference bug from a core config error, exposing absent QA in commercial LLMs. Beware the "community beta" business

May 22 min read

GemmaGoogle

Gemma 4 Hits HuggingFace — Open Source Outpaces Official Toolchain

gemma-4-31B-it-DFlash on HuggingFace lacks llama.cpp support. We see models outpacing toolchains—having models you can't run is the new paradox.

May 22 min read

PFlashllama.cpp

10x Speedup on Consumer GPUs for Long-Context LLMs — PFlash Ends the Wait

PFlash cuts RTX 3090 128K long-text wait from 4 min to 24 sec. First-token latency on consumer GPUs solved—local LLM deployment now commercially viabl

May 12 min read

UnslothQwen3.6

Qwen3.6 GGUF Benchmarks

Un sloth claims top KLD-vs-disk-space performance for Qwen3.6-35B-A3B quants in 21 of 22 pareto frontier comparisons.

Apr 173 min read

llama.cppQwen3

GPoUr with ~12gb vram and a 3080 getting 40tg/s on qwen3.6 35BA3B w/ 260k ctx

A llama.cpp fork with turbo3 KV cache quantization achieves ~40 tok/s on Qwen3-35 B-A3B with only 12GB VRAM.

Apr 163 min read

Gemma- 4Qwen3.5

Gemma 4 and Qwen 3.5 GGUFs: Detailed Analysis by oobabooga

Oobabooga published 5 benchmark reports covering 70-90 GGUF quants each for Gemma 4 and Qwen 3.5 models using KL Divergence methodology.

Apr 153 min read

Gemma-4Google-De epMind

Gemma 4 Jailbreak System Prompt

A system prompt designed to bypass Gemma 4's safety filters is circulating on Reddit with 112 upvotes.

Apr 153 min read

LocalLLaMAllama.cpp

Local AI is the best

A Reddit post praising local AI tools contains no verifiable news, data, or technical developments.

Apr 151 min read

Qwen3.5GGUF

Qwen3.5-9B GGUF Quant Rankings: Q8_0 Dominates KLD Scores

KLD benchmarks across community GGUF quants show Q8_0 variants cluster near 0.001 KLD, with quality degrading shar ply below Q5.

Apr 143 min read

llama.cppAndroid

端侧AI 模型部署实战五(Android大模型加载)

Step-by-step JNI bridge implementation for running quantized LLMs on Android using llama.cpp.

Apr 143 min read

UnslothMiniMax-M2.7

Unsloth Releases Full GGUF Quant Suite for MiniMax M2.7

Unsloth uploads 22 GGUF quantizations of MiniMax M2.7, ranging from 1-bit (60.7 GB) to BF16 (457 GB).

Apr 123 min read

MiniMax-M2.7llama.cpp

MiniMax-M1 229B MoE Gets First GGUF Quants for Apple Silicon

MiniMax-M2.7 (229B MoE) quantized to Q3_K_L (110GB) and Q8_0 (243GB) GGUF formats, now on HuggingFace.

Apr 123 min read

local-deploymentvram-optimization

KV Cache Compression Breakthrough: Structural Rewrite of Local LLM Deployment Costs

llama.cpp achieves 6.8x KV cache compression, cutting 131K context VRAM from 8.2GB to 1.2GB, rewriting local AI hardware procurement logic.

Apr 112 min read

OCRLocal Deployment

The Rise of Local OCR Models: The Countdown to the End of Bill Recognition Outsourcing

llama.cpp now enables local OCR deployment, letting enterprises bypass cloud APIs and forcing repricing in the annual bill recognition outsourcing mar

Apr 102 min read

Qwen-32Bllama.cpp

Local LLMs Lose Tool Call Accuracy After 8–9 Chained Calls

Qwen 32B, Gemma 9B, and Command R 32B all fail similarly after 8+ tool calls — attention dilution, not context limits.

Apr 84 min read

Qwen3.5LocalAI

Qwen 3.5 35B Benchmarks: Vulkan vs ROCm on AMD Strix Halo

Vulkan wins token generation (~57.5 t/s) while ROCm leads prompt processing (~1052 t/s) on AMD Ryzen AI MAX+ 395.

Apr 83 min read

Gemma 4llama.cpp

Fixing Gemma 4 Tool Calls in llama.cpp: Root Causes Explained

Four bugs in llama.cpp's Gemma 4 chat template handling caused tool call results to crash or loop.

Apr 83 min read

Gemma 4Qwen3

Controlling Gemma 4 Thinking Tokens via System Prompts

Users struggle to reliably toggle Gemma 4's reasoning mode via system prompts, unlike Qwen-30B-A3B.

Apr 83 min read

Ollamallama.cpp

Local LLM Setup Guide for RTX 5070 12GB VRAM

Choosing local AI models for chat, writing, and music on a 12GB VRAM RTX 5070 build.

Apr 83 min read

Google Edge Galleryon-device LLM

Google Edge Gallery App: First Impressions from LocalLLaMA Community

A LocalLLaMA user shares early impressions of Google's Edge Gallery on-device AI app for Android.

Apr 71 min read

Gemma 4llama.cpp

Gemma 4 Local CUDA Setup: Precision Traps and Real Benchmarks

Running Gemma 4 locally on CUDA requires strict dtype matching at KV cache boundaries or output degenerates silently.

Apr 72 min read

Gemma-4Qwen3.5

Gemma-4 E4B Vision Benchmarked: Scores 0.27 vs Qwen3.5-4B's 0.5

Community testing shows Gemma-4 E4B scores 0.27 on 100 vision tasks vs Qwen3.5-4B's baseline 0.5, raising red flags for multimodal use.

Apr 72 min read

llama.cppllama-bench

llama.cpp llama-bench Adds -fitc and -fitt Benchmark Flags

llama-bench gains -fitc and -fitt flags from build b4679, enabling finer control over benchmark timing output.

Apr 61 min read

ggmlllama.cpp

GGML Adds Q1_0 1-Bit Quantization: Run 8B Models at 1.15GB

GGML now supports Q1_0 1-bit quantization, shrinking Bonsai 8B models to 1.15GB for CPU-only inference.

Apr 62 min read

llama.cppIntel Arc

llama.cpp Q8_0 Gets 3.1x Speedup on Intel Arc GPUs via SYCL Fix

A 200-line SYCL patch fixes missing reorder optimization for Q8_0, boosting Arc B70 from 4.88 to 15.24 t/s.

Apr 62 min read

llama.cppQwen

37 LLMs Benchmarked on MacBook Air M5 32GB: Full Speed Results

Community benchmark of 37 local LLMs on M5 Air 32GB using llama-bench reveals MoE models as clear winners for speed-to-quality ratio.

Apr 62 min read