Benchmarking Inference
inference benchmark offers you an easy way to check the performance of inference in your setup. The command is capable of benchmarking both inference server and inference Python package.
To check detail of the command, run inference benchmark --help. Help is also available for each sub-command: inference benchmark api-speed --help.
Benchmarking the inference Python package
Make sure the inference package is installed: pip install inference
Basic benchmark can be run using the following command:
inference benchmark python-package-speed \
-m {your_model_id} \
-d {pre-configured dataset name or path to directory with images} \
-o {output_directory}
The command runs a specified number of inferences using the pointed model and saves statistics (including benchmark parameter, throughput, latency, errors and platform details) in the pointed directory.
Benchmarking the inference server
Before running API benchmark of your local inference server, make sure the server is up and running: inference server start
Basic benchmark can be run using the following command:
inference benchmark api-speed \
-m {your_model_id} \
-d {pre-configured dataset name or path to directory with images} \
-o {output_directory}
The command runs a specified number of inferences using the pointed model and saves statistics (including benchmark parameter, throughput, latency, errors and platform details) in the pointed directory.
This benchmark has more configuration options to support different ways of HTTP API profiling. In default mode, a single client will be spawned, and it will send one request after another sequentially. This may be suboptimal in specific cases, so you can specify number of concurrent clients using -c {number_of_clients} option. Each client will send next request once the previous is handled.
For scenarios closer to production environments where multiple clients are sending requests concurrently, the --rps {value} option can be used (and -c will be ignored). Value provided in --rps specifies how many requests are to be spawned each second without waiting for previous requests to be handled. In I/O intensive benchmark scenarios, we suggest running the command from multiple separate processes and possibly multiple hosts.