What Happened

The llama.cpp project has merged support for two new command-line arguments in its llama-bench benchmarking tool: -fitc (format input token count) and -fitt (format input token time). These flags are available starting from build b8679. The change was anticipated by community members running local LLM performance tests.

Why It Matters

llama-bench is the standard tool for measuring inference throughput on local hardware. Before this change, benchmark output formatting options were limited, making it harder to script automated performance comparisons or pipe results into monitoring dashboards. The new flags give developers explicit control over how token count and timing columns are presented, which matters when:

  • Comparing performance across multiple model quantizations (Q4_K_M, Q8_0, etc.)
  • Feeding benchmark data into CI pipelines or spreadsheets
  • Running batch tests across GPU and CPU backends on the same machine

For indie developers and SMEs running self-hosted inference, cleaner benchmark output reduces the manual work needed to evaluate whether a hardware upgrade or quantization change actually improves throughput per dollar.

Asia-Pacific Angle

Developers in China and Southeast Asia frequently run llama.cpp on cost-optimized hardware — consumer GPUs like the RTX 4060, or Apple Silicon MacBooks — to avoid cloud API costs and data residency concerns. Automated benchmarking with consistent output formatting is especially valuable when evaluating Chinese-language models such as Qwen2.5 or DeepSeek-R1 across different quantization levels. The -fitc and -fitt flags make it easier to build local scripts that track tokens-per-second regressions as new model versions are released, without manually parsing inconsistent console output.

Action Item This Week

Update your llama.cpp build to b8679 or later, then run llama-bench -fitc -fitt against your current model to establish a baseline CSV-friendly output you can version-control alongside your hardware configuration.