Inference SDK
The InferenceHTTPClient enables you to interact with an Inference Server over HTTP -- hosted either by Roboflow or on your own hardware. inference-sdk can be installed via pip:
pip install inference-sdk
Quickstart
You can run inference on images from URLs, file paths, PIL images, and NumPy arrays.
from inference_sdk import InferenceHTTPClient
import os
image_url = "https://media.roboflow.com/inference/soccer.jpg"
client = InferenceHTTPClient(
api_url="https://serverless.roboflow.com",
api_key=os.environ["API_KEY"],
)
results = client.infer(image_url, model_id="soccer-players-5fuqs/1")
print(results)
On the first request, the model weights will be downloaded and set up with your local inference server. This request may take some time depending on your network connection and the size of the model. Once your model has downloaded, subsequent requests will be much faster. You can also pre-load models and manage loaded weights to control this process.
Self-Hosted server
You can also self-host the Inference server (see Inference CLI), and then change api_url in the InferenceHTTPClient:
client = InferenceHTTPClient(
api_url="http://localhost:9001",
api_key=os.environ["API_KEY"],
)
AsyncIO client
import asyncio
from inference_sdk import InferenceHTTPClient
CLIENT = InferenceHTTPClient(
api_url="http://localhost:9001",
api_key="ROBOFLOW_API_KEY"
)
image_url = "https://source.roboflow.com/pwYAXv9BTpqLyFfgQoPZ/u48G0UpWfk8giSw7wrU8/original.jpg"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(
CLIENT.infer_async(image_url, model_id="soccer-players-5fuqs/1")
)
Parallel / Batch inference
You may want to predict against multiple images in a single call. There are two parameters of InferenceConfiguration that specify batching and parallelism options:
max_concurrent_requests-- max number of concurrent requests that can be startedmax_batch_size-- max number of elements that can be injected into single request
Thanks to that the following improvements can be achieved:
- If you run inference container with API on prem on powerful GPU machine -- setting
max_batch_sizeproperly may bring performance / throughput benefits. - If you run inference against hosted Roboflow API -- setting
max_concurrent_requestswill cause multiple images being served at once bringing performance / throughput benefits. - Combination of both options can be beneficial for clients running inference container with API on cluster of machines, then the load of single node can be optimised and parallel requests to different nodes can be made at a time.
from inference_sdk import InferenceHTTPClient
image_url = "https://source.roboflow.com/pwYAXv9BTpqLyFfgQoPZ/u48G0UpWfk8giSw7wrU8/original.jpg"
# Replace ROBOFLOW_API_KEY with your Roboflow API Key
CLIENT = InferenceHTTPClient(
api_url="http://localhost:9001",
api_key="ROBOFLOW_API_KEY"
)
predictions = CLIENT.infer([image_url] * 5, model_id="soccer-players-5fuqs/1")
print(predictions)
Methods that support batching / parallelism:
infer(...)andinfer_async(...)ocr_image(...)andocr_image_async(...)(enforcingmax_batch_size=1)detect_gazes(...)anddetect_gazes_async(...)get_clip_image_embeddings(...)andget_clip_image_embeddings_async(...)
The client also supports core foundation models (CLIP, DocTR, Gaze) and running Workflows for multi-step pipelines.
Inference against stream
You can infer against video or a directory of images:
from inference_sdk import InferenceHTTPClient
# Replace ROBOFLOW_API_KEY with your Roboflow API Key
CLIENT = InferenceHTTPClient(
api_url="http://localhost:9001",
api_key="ROBOFLOW_API_KEY"
)
for frame_id, frame, prediction in CLIENT.infer_on_stream("video.mp4", model_id="soccer-players-5fuqs/1"):
# frame_id is the number of frame
# frame - np.ndarray with video frame
# prediction - prediction from the model
pass
for file_path, image, prediction in CLIENT.infer_on_stream("local/dir/", model_id="soccer-players-5fuqs/1"):
# file_path - path to the image
# frame - np.ndarray with video frame
# prediction - prediction from the model
pass
What is actually returned as prediction?
inference_client returns plain Python dictionaries that are responses from model serving API. Modification is done only in context of visualization key that keeps server-generated prediction visualisation (it can be transcoded to the format of choice) and in terms of client-side re-scaling.