Workflows Benchmarks
This page compares direct model inference latency versus Workflow-wrapped inference latency for popular detection models. The goal is to quantify the overhead introduced by the Workflows execution engine.
Self-Hosted Results
All times are in milliseconds. Each model was warmed up before timing, then measured over 10 iterations.
| Model | Avg Direct (ms) | Stddev Direct | Avg Workflow (ms) | Stddev Workflow |
|---|---|---|---|---|
| rfdetr-nano | 23.86 | 0.35 | 29.99 | 2.04 |
| rfdetr-small | 29.50 | 0.20 | 31.36 | 0.36 |
| rfdetr-medium | 33.97 | 0.54 | 34.78 | 0.23 |
| rfdetr-large | 40.96 | 0.38 | 46.00 | 1.75 |
| rfdetr-xlarge | 73.37 | 0.26 | 74.41 | 0.13 |
| yolo26n-640 | 7.68 | 0.04 | 9.61 | 0.43 |
| yolo26s-640 | 10.10 | 0.08 | 12.48 | 1.04 |
| yolo26m-640 | 16.43 | 0.05 | 18.10 | 0.58 |
| yolo26l-640 | 18.50 | 0.09 | 20.01 | 0.38 |
| yolo26x-640 | 30.46 | 0.14 | 33.51 | 0.17 |
Key Takeaways
- Workflow overhead is minimal -- typically 1-5 ms on top of direct inference, depending on the model.
- For larger models (e.g.
rfdetr-xlarge), the workflow overhead becomes negligible relative to model inference time (~1.4%). - For smaller, faster models (e.g.
yolo26n-640), the overhead is proportionally larger (~25%) but still under 2 ms in absolute terms. - Workflow overhead is CPU-bound (graph scheduling, input preparation, output routing), while model inference itself typically runs on the GPU (when available). As a result, the overhead stays relatively constant regardless of GPU speed.
Methodology
- GPU-accelerated inference (results will vary by hardware).
- Warmup: Each model and workflow engine is warmed up with one inference call before timing begins.
- Iterations: 10 timed iterations per method, per model.
- Direct inference: Uses
get_model()and callsmodel.infer()directly. - Workflow inference: Wraps the same model in a minimal single-step workflow and runs it through the Execution Engine.
Cloud-Hosted Results
The following results were benchmarked on the hosted Serverless API. Some notes:
- Compared to self-hosted results, Serverless server also needs to fetch the Workflow schema and model weights (whereas Serverless model inference only fetches weights).
- There is a static 10-50ms latency overhead for Workflows.