Python Inference SDK

The inference-sdk Python package provides client and utility implementations to interface with Roboflow Inference. You can use it to develop with Roboflow Inference regardless of whether you are using the Serverless / Hosted API, Dedicated Deployments, are self hosting, or deploying to an edge device.

The InferenceHTTPClient enables you to interact with an Inference Server over HTTP -- hosted either by Roboflow or on your own hardware.

pip install inference-sdk

Quickstart

You can run inference on images from URLs, file paths, PIL images, and NumPy arrays.

from inference_sdk import InferenceHTTPClient
import os

image_url = "https://media.roboflow.com/inference/soccer.jpg"

client = InferenceHTTPClient(
    api_url="https://serverless.roboflow.com",
    api_key=os.environ["API_KEY"],
)

results = client.infer(image_url, model_id="soccer-players-5fuqs/1")
print(results)

On the first request, the model weights will be downloaded and set up with your local inference server. This request may take some time depending on your network connection and the size of the model. Once your model has downloaded, subsequent requests will be much faster.

Self-Hosted Server

You can also self-host the Inference server and then change api_url in the InferenceHTTPClient:

client = InferenceHTTPClient(
    api_url="http://localhost:9001",
    api_key=os.environ["API_KEY"],
)

AsyncIO Client

import asyncio
from inference_sdk import InferenceHTTPClient

CLIENT = InferenceHTTPClient(
    api_url="http://localhost:9001",
    api_key="ROBOFLOW_API_KEY"
)

image_url = "https://source.roboflow.com/pwYAXv9BTpqLyFfgQoPZ/u48G0UpWfk8giSw7wrU8/original.jpg"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(
  CLIENT.infer_async(image_url, model_id="soccer-players-5fuqs/1")
)

Parallel / Batch Inference

You may want to predict against multiple images in a single call. There are two parameters of InferenceConfiguration that specify batching and parallelism options:

max_concurrent_requests — max number of concurrent requests that can be started
max_batch_size — max number of elements that can be injected into a single request

from inference_sdk import InferenceHTTPClient

image_url = "https://source.roboflow.com/pwYAXv9BTpqLyFfgQoPZ/u48G0UpWfk8giSw7wrU8/original.jpg"

CLIENT = InferenceHTTPClient(
    api_url="http://localhost:9001",
    api_key="ROBOFLOW_API_KEY"
)
predictions = CLIENT.infer([image_url] * 5, model_id="soccer-players-5fuqs/1")

print(predictions)

Methods that support batching / parallelism:

infer(...) and infer_async(...)
ocr_image(...) and ocr_image_async(...) (enforcing max_batch_size=1)
detect_gazes(...) and detect_gazes_async(...)
get_clip_image_embeddings(...) and get_clip_image_embeddings_async(...)

Inference Against Stream

You can infer against video or a directory of images:

from inference_sdk import InferenceHTTPClient

CLIENT = InferenceHTTPClient(
    api_url="http://localhost:9001",
    api_key="ROBOFLOW_API_KEY"
)
for frame_id, frame, prediction in CLIENT.infer_on_stream("video.mp4", model_id="soccer-players-5fuqs/1"):
    # frame_id is the number of frame
    # frame - np.ndarray with video frame
    # prediction - prediction from the model
    pass

for file_path, image, prediction in CLIENT.infer_on_stream("local/dir/", model_id="soccer-players-5fuqs/1"):
    # file_path - path to the image
    # frame - np.ndarray with video frame
    # prediction - prediction from the model
    pass

What is Returned as Prediction?

inference_client returns plain Python dictionaries that are responses from the model serving API. Modification is done only in the context of the visualization key that keeps server-generated prediction visualization (it can be transcoded to the format of choice) and in terms of client-side re-scaling.