5 Things to Check Before Trusting Any FHIR Benchmark

Every few months, someone publishes a FHIR server benchmark and the screenshots start flying around. Some of those numbers hold up. Some of them fall apart the moment you ask how the test was run. The honest answer is that a benchmark is only as useful as the harness behind it, and you can usually tell which bucket a benchmark belongs in by checking a handful of things up front. Health Samurai released an open benchmark on 2026-06-29 that scores well on most of these checks. Marat Surmashev, the engineer who authored it, leans on reproducibility rather than a leaderboard. Walking through the list using that benchmark as a worked example is a useful exercise, and the broader FHIR knowledge base on the site has the surrounding context.

Check 1: Same Hardware for Every Server

You cannot compare a server running on a beefy box to one running on a workstation. The new benchmark pins everything to a single bare-metal machine, 64 cores and 500 GB of RAM, with each FHIR server getting 8 vCPU and 24 GB. Medplum runs as eight 1-vCPU 3-GB replicas to match its supported shape. That is the kind of pinning that lets you trust a comparison at all.

Check 2: Same Data Going In

Different datasets exercise different code paths. The benchmark uses Synthea, 1,000 patient records, around 2 million resources, loaded the same way into every server. That makes the storage and import numbers comparable. The honest caveat the project ships with is that 1,000 patients fits in memory, so this is a baseline rather than a scale test. The next post in the series promises a larger corpus.

Check 3: Same Workload Driving the Load

A benchmark with one client hammering one endpoint tells you a very narrow story. This one drives CRUD, bundle import, and search through Grafana k6 with the same concurrency settings against every server. The scripts live in the repo, which means you can read them, argue with them, and re-run them. Aidbox, HAPI FHIR, Medplum, and the Microsoft FHIR Server all face the same k6 traffic, not a different test profile each.

Check 4: Open Source

If you cannot read the harness, you cannot trust the numbers. The repo and the configuration files are public, which is the part that turns a vendor claim into a hypothesis someone else can test. It is fair to note that Health Samurai is the company behind Aidbox, so the benchmark is vendor-run; the open-source layout is what makes that bias inspectable rather than hidden. The same principle of operator inspection is why hosted vs self-hosted terminology servers for behavioral health practices matters for procurement on the terminology side.

Check 5: Daily Rerun

A benchmark that runs once and then bit-rots in a slide deck is not a benchmark. CI here re-executes the suite every day against the current server images, and the dashboard updates. A regression after a release shows up the next morning. That is the part most public benchmarks skip and the part that makes this one worth checking back on.

For the terminology side of FHIR services for clinics, the complete guide to FHIR terminology services for behavioral health in 2026 sets the framing for the procurement conversation. The five checks above are the same questions to bring to any benchmark, FHIR or otherwise.

Check 1: Same Hardware for Every Server

Check 2: Same Data Going In

Check 3: Same Workload Driving the Load

Check 4: Open Source

Check 5: Daily Rerun

Related Posts

Best Terminology Tools for Suicide-Risk Coding and Z-Codes in 2026

Top 5 FHIR Terminology Servers for SNOMED Mental Health Vocabularies

Top 4 Terminology Engines for Substance Use Disorder ICD-10 Coding