ML Benchmarking

--- description: Benchmark LLM inference across local GPUs, Apple Silicon, cloud (RunPod), and embedding models. Covers llama-bench, mlx-lm, vLLM, and automated pipeline to SiYuan. category: personal --- flowchart TD _HEADER_["

ML Benchmarking

Benchmark LLM inference across local GPUs, Apple Silicon, cloud (RunPod), and embedding models. Covers llama-bench, mlx-lm, vLLM, and automated pipeline to SiYuan.

"]:::headerStyle classDef headerStyle fill:none,stroke:none subgraph _MAIN_[" "] %% Phase 0: Choose Target subgraph Choose["Phase 0: Choose Benchmark Target"] START([Start]) --> TARGET{What to
Benchmark?} TARGET -->|LLM on
local GPU| LOCAL_GPU TARGET -->|LLM on
Apple Silicon| MAC TARGET -->|LLM on
cloud GPU| CLOUD TARGET -->|Embedding
model| EMBED TARGET -->|Batch
throughput| THROUGHPUT end %% Phase 1: Local GPU Benchmarking subgraph LocalGPU["Phase 1a: Local GPU (NVIDIA)"] LOCAL_GPU[Select GPU
+ Model GGUF] --> GPU_SELECT[GPU Selection
CUDA_VISIBLE_DEVICES] GPU_SELECT --> SINGLE_GPU{Single or
Multi-GPU?} SINGLE_GPU -->|Single| LLAMA_SINGLE[llama-bench
single GPU] SINGLE_GPU -->|Multi| LLAMA_MULTI[llama-bench
split-mode layer] LLAMA_SINGLE --> LOCAL_RESULTS[Collect Results
pp + tg metrics] LLAMA_MULTI --> LOCAL_RESULTS end %% Phase 1b: Mac Benchmarking subgraph MacBench["Phase 1b: Apple Silicon"] MAC[Choose Framework] --> MAC_CHOICE{mlx-lm or
llama-bench?} MAC_CHOICE -->|mlx-lm
recommended| MLX_BENCH[mlx_lm.benchmark
prefill + gen] MAC_CHOICE -->|llama-bench
cross-platform| LLAMA_METAL[llama-bench
Metal backend] MAC_CHOICE -->|Quick check| MLX_GEN[mlx_lm.generate
--verbose T] MLX_BENCH --> MAC_RESULTS[Collect Results
tok/s + peak memory] LLAMA_METAL --> MAC_RESULTS MLX_GEN --> MAC_RESULTS end %% Phase 1c: Cloud GPU Benchmarking subgraph CloudBench["Phase 1c: RunPod Cloud"] CLOUD[Select Cloud GPU] --> LIST_GPUS[deploy-and-bench.py
--list-gpus] LIST_GPUS --> DEPLOY[deploy-and-bench.py
GPU_NAME] DEPLOY --> POD_LIFECYCLE[Pod Created
SSH Ready
Benchmark Runs] POD_LIFECYCLE --> CLOUD_RESULTS[Results Collected
Pod Terminated] end %% Phase 1d: Embedding Benchmarking subgraph EmbedBench["Phase 1d: Embedding Models"] EMBED[Choose Embedding
Backend] --> EMBED_BACKEND{Backend?} EMBED_BACKEND -->|llama.cpp
fastest| BENCH_LLAMACPP_EMB[bench-llamacpp-
embeddings.py] EMBED_BACKEND -->|mlx
Apple only| BENCH_MLX_EMB[bench-mlx-
embeddings.py] EMBED_BACKEND -->|TEI
cross-platform| BENCH_TEI_EMB[bench-tei-
embeddings.py] BENCH_LLAMACPP_EMB --> EMBED_RESULTS[Throughput Results
tok/s + emb/s] BENCH_MLX_EMB --> EMBED_RESULTS BENCH_TEI_EMB --> EMBED_RESULTS EMBED_RESULTS --> REAL_WORLD[bench-realworld-
embeddings.py
validate with docs] end %% Phase 1e: Batch Throughput subgraph BatchBench["Phase 1e: Batch Throughput"] THROUGHPUT[Choose Server
Framework] --> BATCH_CHOICE{Framework?} BATCH_CHOICE -->|llama.cpp| LLAMA_SERVER[llama-server
-np 8 --cont-batching] BATCH_CHOICE -->|vLLM| VLLM_SERVER[vllm api_server
benchmark_serving.py] LLAMA_SERVER --> LOAD_TEST[Load Test
hey or concurrent curl] VLLM_SERVER --> LOAD_TEST LOAD_TEST --> THROUGHPUT_RESULTS[Throughput Metrics
req/s + p50/p95/p99] end %% Phase 2: Analyze + Record subgraph Record["Phase 2: Analyze and Record"] LOCAL_RESULTS --> COMPARE{Compare
Results?} MAC_RESULTS --> COMPARE CLOUD_RESULTS --> COMPARE EMBED_RESULTS --> COMPARE REAL_WORLD --> COMPARE THROUGHPUT_RESULTS --> COMPARE COMPARE -->|Yes| CROSS_COMPARE[Cross-reference
GPU vs GPU or
framework vs framework] COMPARE -->|No| FORMAT CROSS_COMPARE --> FORMAT[Format JSON
bench-mlx.py or
bench-llamacpp.py] FORMAT --> POST_SIYUAN[post-to-siyuan.py
ML Benchmarks notebook] POST_SIYUAN --> DONE([Done]) end click START "#" "**Start**\nBegin an ML benchmarking session.\nDecide what hardware, model, and workload to test." click TARGET "#" "**What to Benchmark?**\nChoose your benchmark track:\n- Local NVIDIA GPU (llama-bench on Chungus)\n- Apple Silicon (mlx-lm or llama-bench Metal)\n- Cloud GPU via RunPod (GPUs you don't own)\n- Embedding model throughput (5 backends)\n- Batch/concurrent throughput (server load testing)" click LOCAL_GPU "#" "**Select GPU + Model**\nPick the NVIDIA GPU and GGUF model to test.\n\nOn Chungus:\n- 2x RTX 3090 (24GB, fast) -- use these\n- 2x Tesla P40 (24GB, 4x slower) -- skip\n- 6x Tesla M40 (12GB, 14x slower) -- skip\n\nCheck GPU layout:\n`nvidia-smi --query-gpu=index,name --format=csv`" click GPU_SELECT "#" "**GPU Selection**\nGPU numbering differs between tools:\n- nvidia-smi uses physical PCIe order\n- CUDA uses CUDA_VISIBLE_DEVICES remapped order\n- llama.cpp uses its own enumeration\n\nSafest approach:\n`CUDA_VISIBLE_DEVICES=0 llama-bench -m model.gguf`" click SINGLE_GPU "#" "**Single or Multi-GPU?**\nSingle GPU: simpler, consistent baseline.\nMulti-GPU: test layer-split scaling.\n\nDual 3090 gives ~1.5x speedup at 2k+ tokens.\nvLLM tensor parallelism has no speedup for MoE models." click LLAMA_SINGLE "#" "**llama-bench Single GPU**\nPrefill:\n`llama-bench -m model.gguf -ngl 99 -fa 1 -p 512,1024,2048 -n 0 -r 3`\n\nGeneration:\n`llama-bench -m model.gguf -ngl 99 -fa 1 -p 0 -n 128,256 -r 3`\n\nExpected on RTX 3090: ~3200 tok/s pp512, ~120 tok/s tg128" click LLAMA_MULTI "#" "**llama-bench Multi-GPU**\n`llama-bench -m model.gguf --device CUDA0,CUDA1 --split-mode layer -ngl 99 -fa 1 -p 512,1024,2048,4096,8192 -n 0 -r 3`\n\nExpected dual 3090: ~4600 tok/s pp4096.\nScaling improves at longer contexts." click LOCAL_RESULTS "#" "**Collect Local Results**\nKey metrics:\n- **pp** (prefill): tok/s processing input prompt\n- **tg** (generation): tok/s during autoregressive decode\n- VRAM usage\n- Context length limits\n\nUse `-o json` for machine-readable output." click MAC "#" "**Choose Framework**\nTwo options for Apple Silicon:\n- **mlx-lm** (recommended): 1.8-2.2x faster, native MLX format\n- **llama-bench**: Metal backend, GGUF format, cross-platform comparable\n\nmlx-lm: `pip install mlx-lm`\nllama-bench: `brew install llama.cpp`" click MAC_CHOICE "#" "**mlx-lm or llama-bench?**\n- mlx-lm for primary Mac benchmarking (fastest)\n- llama-bench for cross-platform comparison\n- mlx_lm.generate --verbose T for quick spot checks\n\nOn M3 Max 128GB, Qwen3-Coder-Next:\nmlx-lm pp1024: 1073 tok/s vs llama-bench: 605 tok/s" click MLX_BENCH "#" "**mlx_lm.benchmark**\nStructured benchmark with multiple trials:\n`python3 -m mlx_lm.benchmark --model mlx-community/MODEL-4bit -p 1024 -g 128 -n 3`\n\nBatch throughput:\n`python3 -m mlx_lm.benchmark --model MODEL -p 1024 -g 128 -b 4 -n 3`\n\nReports prompt_tps, generation_tps, peak_memory per trial." click LLAMA_METAL "#" "**llama-bench on Metal**\nMetal backend auto-detected on macOS.\n\nPrefill: `llama-bench -m model.gguf -p 512,1024,2048,4096 -n 0 -r 3`\nGeneration: `llama-bench -m model.gguf -p 0 -n 128,256 -r 3`\n\nResults are cross-platform comparable with CUDA benchmarks." click MLX_GEN "#" "**Quick Spot Check**\nSingle-run timing with verbose output:\n`mlx_lm.generate --model mlx-community/MODEL-4bit --prompt 'test' --max-tokens 128 --verbose T`\n\nShows prompt tok/s, generation tok/s, and peak memory.\nGood for quick checks, not rigorous benchmarking." click MAC_RESULTS "#" "**Collect Mac Results**\nKey metrics:\n- prompt_tps (prefill speed)\n- generation_tps (decode speed)\n- peak_memory (GB)\n- batch scaling efficiency\n\nMemory rule: model footprint should be < 70% of total RAM.\n~130MB per 1k context tokens for large MoE models." click CLOUD "#" "**Select Cloud GPU**\nBenchmark GPUs you don't own via RunPod.\nRequires: RunPod API key, SSH key, Python with `runpod` package.\n\nModels stored on RunPod network volumes.\nPod auto-terminates after benchmark completes." click LIST_GPUS "#" "**List Available GPUs**\n`./deploy-and-bench.py --list-gpus`\n\nShows all available GPU types with pricing.\nKey results (GLM-4.7-Flash Q4_K_M):\n- RTX 5090: 6240 pp1024, 196 tg128\n- RTX 4090: 6299 pp1024, $0.44/hr\n- RTX 3090: 3100 pp1024, $0.22/hr\n- H100 SXM: ~5000 pp1024, $2.49/hr" click DEPLOY "#" "**Deploy and Benchmark**\n`./deploy-and-bench.py 'NVIDIA GeForce RTX 4090'`\n\nAutomates the full lifecycle:\n1. Create RunPod pod with selected GPU\n2. Wait for SSH connectivity\n3. Run llama-bench suite\n4. Collect results\n5. Terminate pod\n\nUse `--keep` flag to keep pod for debugging." click POD_LIFECYCLE "#" "**Pod Lifecycle**\nAutomated by deploy-and-bench.py:\n1. Pod created with Docker image containing llama.cpp\n2. Network volume mounted with model files\n3. SSH tunnel established\n4. Prefill benchmarks run (pp512, pp1024, etc.)\n5. Generation benchmarks run (tg128, tg256)\n6. Results collected via SSH\n7. Pod terminated (unless --keep)" click CLOUD_RESULTS "#" "**Cloud Results**\nResults include:\n- Prefill tok/s at various context lengths\n- Generation tok/s\n- Cost per hour\n- Value metric: tok/$ (tokens per dollar)\n\nBest value is often mid-range GPUs (4090 at $0.44/hr)\nnot top-tier (H100 at $2.49/hr)." click EMBED "#" "**Choose Embedding Backend**\nFive backends tested on Apple Silicon:\n- llama.cpp server Q8_0: **78,352 tok/s** (fastest)\n- qwen3-embeddings-mlx 4-bit: 10,567 tok/s\n- mlx-embeddings 4-bit: 9,173 tok/s\n- TEI float16: 1,720 tok/s\n\nllama.cpp is 7-32x faster than alternatives on Metal." click EMBED_BACKEND "#" "**Backend Selection**\nChoose based on platform and speed needs:\n- **llama.cpp**: fastest, cross-platform, GGUF format\n- **mlx**: Apple Silicon only, native MLX format\n- **TEI**: cross-platform, HuggingFace ecosystem\n\nCritical: batch 4 is optimal on Metal.\nBatch 8+ drops throughput 5-15x (bandwidth saturation)." click BENCH_LLAMACPP_EMB "#" "**llama.cpp Embedding Benchmark**\n`scripts/bench-llamacpp-embeddings.py model.gguf`\n\nManages server lifecycle automatically.\nTests single-text and batch throughput.\nReports tok/s, emb/s, and latency.\n\nPeak on M3 Max: 78,352 tok/s at batch 4." click BENCH_MLX_EMB "#" "**MLX Embedding Benchmark**\n`scripts/bench-mlx-embeddings.py --model Qwen/...`\n\nApple Silicon only.\nTests qwen3-embeddings-mlx and mlx-embeddings backends.\nPeak: ~10,567 tok/s at batch 32." click BENCH_TEI_EMB "#" "**TEI Embedding Benchmark**\n`scripts/bench-tei-embeddings.py Qwen/...`\n\nText Embeddings Inference (HuggingFace).\nCross-platform but slowest on Metal: ~1,720 tok/s.\nMay be faster on CUDA (not yet tested)." click EMBED_RESULTS "#" "**Embedding Throughput Results**\nKey metrics:\n- tok/s (tokens per second)\n- emb/s (embeddings per second)\n- Latency per embedding\n- Optimal batch size\n\nCritical pitfalls:\n- Never run GPU benchmarks concurrently\n- Pooling must match model (Qwen3 uses 'last')\n- Real-world docs are 0.3-0.9x synthetic performance" click REAL_WORLD "#" "**Real-World Validation**\n`scripts/bench-realworld-embeddings.py --backend llamacpp --model model.gguf`\n\nSynthetic benchmarks overestimate real performance.\nAlways validate with actual documents.\nExpect 0.3-0.9x of synthetic throughput." click THROUGHPUT "#" "**Choose Server Framework**\nSingle-request benchmarks measure latency.\nProduction needs throughput (total tok/s with concurrent users).\n\nA server doing 50 tok/s per request might achieve 200 tok/s total with 8 concurrent requests." click BATCH_CHOICE "#" "**Framework Selection**\n- **llama.cpp server**: lightweight, `-np 8 --cont-batching`\n- **vLLM**: production-grade, built-in benchmark tools\n\nKey tradeoffs:\n- More parallel slots = higher throughput, higher latency\n- Continuous batching = much higher throughput\n- Speculative decoding = lower latency, slightly lower throughput" click LLAMA_SERVER "#" "**llama.cpp Server**\n`llama-server -m model.gguf -ngl 99 --host 0.0.0.0 --port 8080 -np 8 --cont-batching --metrics`\n\nCheck metrics:\n`curl -s localhost:8080/metrics`\n\nKey: llama_requests_processing, llama_tokens_predicted_total" click VLLM_SERVER "#" "**vLLM Server**\n`python -m vllm.entrypoints.openai.api_server --model MODEL --port 8000`\n\nBenchmark:\n`python benchmarks/benchmark_serving.py --backend vllm --num-prompts 1000 --request-rate 10`\n\nKey metrics: vllm:avg_generation_throughput, gpu_cache_usage_perc" click LOAD_TEST "#" "**Load Testing**\nUse `hey` for proper HTTP load testing:\n`hey -n 100 -c 10 -m POST -H 'Content-Type: application/json' -d '{...}' http://localhost:8080/v1/chat/completions`\n\nWatch p95/p99 latency -- matters more than average for user experience." click THROUGHPUT_RESULTS "#" "**Throughput Metrics**\nKey numbers:\n- req/s (requests per second)\n- p50, p95, p99 latency\n- Total tok/s across all concurrent requests\n- Queue depth and cache utilization\n\np95/p99 latency matters most for real users." click COMPARE "#" "**Compare Results?**\nOptionally cross-reference benchmarks:\n- Same model across different GPUs\n- Same GPU across different quantizations\n- mlx-lm vs llama-bench on Apple Silicon\n- Cost efficiency (tok/$) across cloud GPUs" click CROSS_COMPARE "#" "**Cross-Reference**\nBuild comparison tables:\n- GPU vs GPU (same model, same quant)\n- Framework vs framework (mlx vs llama.cpp)\n- Cost efficiency for cloud GPUs\n\nUse `--compare` flag in post-to-siyuan.py to pull existing results." click FORMAT "#" "**Format Results**\nPipeline scripts output structured JSON to stdout:\n`scripts/bench-mlx.py MODEL > results.json`\n`scripts/bench-llamacpp.py model.gguf > results.json`\n\nJSON includes hardware info, model details, all metrics, timestamps." click POST_SIYUAN "#" "**Post to SiYuan**\n`scripts/post-to-siyuan.py --file results.json`\n\nCreates formatted document in ML Benchmarks notebook.\nWith comparison: `--compare 'Qwen3-Coder-Next'`\nPreview only: `--dry-run`\n\nPipeline: `bench-mlx.py MODEL | post-to-siyuan.py`" click DONE "#" "**Done**\nBenchmark results recorded in SiYuan ML Benchmarks notebook.\nResults include tables for prefill, generation, batch throughput, memory, and key takeaways." classDef start fill:#d1ecf1,stroke:#7ec8d8 classDef decision fill:#fff3cd,stroke:#f0c040 classDef local fill:#dfe6ff,stroke:#5b7bce classDef mac fill:#e8daef,stroke:#b07cc6 classDef cloud fill:#ffeaa7,stroke:#e0c040 classDef embed fill:#d4edda,stroke:#5cb85c classDef throughput fill:#f8d7da,stroke:#e06070 classDef record fill:#d1ecf1,stroke:#7ec8d8 classDef done fill:#d4edda,stroke:#5cb85c class START,DONE start class TARGET,SINGLE_GPU,MAC_CHOICE,EMBED_BACKEND,BATCH_CHOICE,COMPARE decision class LOCAL_GPU,GPU_SELECT,LLAMA_SINGLE,LLAMA_MULTI,LOCAL_RESULTS local class MAC,MLX_BENCH,LLAMA_METAL,MLX_GEN,MAC_RESULTS mac class CLOUD,LIST_GPUS,DEPLOY,POD_LIFECYCLE,CLOUD_RESULTS cloud class EMBED,BENCH_LLAMACPP_EMB,BENCH_MLX_EMB,BENCH_TEI_EMB,EMBED_RESULTS,REAL_WORLD embed class THROUGHPUT,LLAMA_SERVER,VLLM_SERVER,LOAD_TEST,THROUGHPUT_RESULTS throughput class CROSS_COMPARE,FORMAT,POST_SIYUAN record end style _MAIN_ fill:none,stroke:none,padding:0 _HEADER_ ~~~ _MAIN_