Self-Hosted LLM Inference: Hardware Sizing Guide for Bürgerdienste
infrastructure
self-hosted
llm
hardware

Self-Hosted LLM Inference: Hardware Sizing Guide for Bürgerdienste

Vorname Nachname, Engineering Lead8 min read

Why Self-Hosted?

For German public authorities, running AI inference in a public cloud creates two problems: DSGVO compliance (citizen voice data leaving the administrative network) and vendor lock-in at a moment when AI infrastructure is evolving rapidly. Self-hosted inference solves both — but only if the hardware requirements are realistic.

The good news: a modern 7B parameter model running on a single consumer-grade GPU handles 20 simultaneous calls with sub-200 ms latency. Here is how we size it.

The VoiceA Stack at Runtime

A single call involves three inference workloads running in sequence per turn:

  1. Whisper large-v3-turbo (ASR) — ~800 MB VRAM, ~80 ms per 5-second audio chunk on RTX 4090
  2. Mistral 7B or Llama 3.1 8B (LLM intent + response, via Ollama) — ~6 GB VRAM (4-bit quantised), ~120 ms per response at 512 token context
  3. Piper TTS (response synthesis) — CPU-only, ~40 ms per sentence

Total GPU VRAM: ~7 GB for one concurrent session. An RTX 4090 (24 GB) can run approximately 3 fully independent model instances, handling bursts of 8–12 simultaneous calls through batching.

Recommended Configurations

Small deployment (≤ 15 concurrent calls)

  • 1 × NVIDIA RTX 4090 (24 GB VRAM)
  • 64 GB DDR5 RAM
  • 2 TB NVMe SSD (RAID 1 for audit log resilience)
  • 2.5 GbE NIC minimum, 10 GbE recommended
  • Estimated cost: ~€5,000 server build

Medium deployment (15–50 concurrent calls)

  • 2 × NVIDIA RTX 4090 or 1 × NVIDIA A100 40 GB
  • 128 GB RAM
  • 4 TB NVMe SSD
  • 10 GbE NIC
  • Estimated cost: €12,000–€18,000

Large deployment (50+ concurrent calls)

  • 2 × NVIDIA H100 80 GB (NVLink recommended for model sharding)
  • 256 GB RAM
  • 10 GbE + dedicated storage network
  • Estimated cost: €50,000+ (typically justified for Landesbehörden or large Kreise)

Energy Consumption

A key selling point of self-hosted inference is predictable, measurable energy consumption. An RTX 4090 draws ~350 W under full load. At 20 concurrent calls and an average call duration of 4 minutes, the energy cost per call is approximately 0.10–0.13 kWh — roughly equivalent to charging a smartphone twice. Our Freiburg pilot measured 0.12 kWh/call on a comparable configuration.

For comparison, a cloud inference call (API-based) has an estimated energy footprint of 0.08–0.15 kWh/call depending on provider data centre efficiency — comparable, but with the added overhead of data transfer and the loss of local control.

Storage and Backup

Audit logs grow at roughly 2 MB per 100 calls. For a medium-sized Bürgeramt handling 500 calls per day, that is ~3.6 GB per year — well within standard IT budgets. We recommend dedicated encrypted volumes for audit logs with daily off-site backups to a second sovereign location (not cloud).

Conclusion

Self-hosted LLM inference for government voice AI is not an exotic research project. With a single well-specced server, a Bürgeramt can run a fully capable, DSGVO-compliant multilingual voice assistant with no recurring cloud costs and full data sovereignty. The hardware investment pays back within 18–24 months compared to equivalent cloud-based solutions at scale.