Published on
In this guide, we’ll walk you through the process of building and deploying an inference server. By the end of this guide, you’ll have a better understanding of how to set up a server for inference tasks and how to deploy your models for real-time predictions.
First, we’ll create a simple Flask API that will serve as our inference server. We will load the YOLOv11 model we trained earlier and use it to perform object detection on images uploaded to the server. The full code is available on GitHub.
The meat of the application is in the /predict
endpoint, where we read the image file from the request, perform inference using the YOLOv11 model, and return the results as JSON. We also measure the inference time to give us an idea of how long it takes to process an image.
For high-traffic inference it would be better to integrate a queue and callback mechanism to handle the requests asynchronously. This would allow the server to handle multiple requests concurrently without blocking the main thread.
You should also implement authentication and rate limiting to prevent abuse of the API.
from flask import Flask, request, jsonify
from PIL import Image
import io
from ultralytics import YOLO
import logging
from logging.handlers import RotatingFileHandler
import sys
import time
app = Flask(__name__)
# Set up logging
logging.basicConfig(level=logging.DEBUG)
handler = RotatingFileHandler("app.log", maxBytes=10000, backupCount=1)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
app.logger.addHandler(handler)
app.logger.addHandler(logging.StreamHandler(sys.stdout))
# Load the YOLOv11 model
try:
model = YOLO("./bikeweights.pt")
app.logger.info("YOLOv11 model loaded successfully")
except Exception as e:
app.logger.error(f"Failed to load YOLOv11 model: {str(e)}")
@app.route("/up", methods=["GET"])
def up():
return "Service is up and running"
@app.route("/predict", methods=["POST"])
def predict():
if "file" not in request.files:
return jsonify({"error": "No file part in the request"}), 400
file = request.files["file"]
if file.filename == "":
return jsonify({"error": "No file selected for uploading"}), 400
if file:
# Read the image file
img_bytes = file.read()
img = Image.open(io.BytesIO(img_bytes))
# Perform inference and measure time
start_time = time.time()
results = model(img)
inference_time = time.time() - start_time
# Process results
detections = []
for result in results:
boxes = (
result.boxes.xyxy.tolist()
) # get box coordinates in (top, left, bottom, right) format
classes = result.boxes.cls.tolist() # get class labels
confidences = result.boxes.conf.tolist() # get confidence scores
for box, cls, conf in zip(boxes, classes, confidences):
detections.append(
{"box": box, "class": model.names[int(cls)], "confidence": conf}
)
return jsonify({"detections": detections, "inference_time": inference_time})
if __name__ == "__main__":
app.run(debug=True)
Next, we’ll create a Dockerfile to package our Flask application into a Docker container. This will make it easier to deploy the application on Digitalocean and other cloud platforms.
# Use an official Python runtime as a parent image
FROM python:3.12-slim
# Set the working directory in the container
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
libgl1-mesa-glx \
libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the current directory contents into the container at /app
COPY . .
# Make port 5000 available to the world outside this container
EXPOSE 5050
# Update the CMD to point to the correct location of main.py
CMD ["gunicorn", "--bind", "0.0.0.0:5050", "--chdir", "/app/src/inference-http", "main:app"]
We can test the Docker container locally by building the image and running it:
docker build -t inference-http .
docker run -p 5050:5050 inference-http
Then in another terminal, we can test the API using curl
: (linux / osx)
curl --location --request POST 'http://localhost:5050/predict' \
--form 'file=@"~/Downloads/motorcycles-bikes/train/images/114.jpg"'
An example json response would look like this:
{
"detections": [
{
"box": [
96.507568359375,
0.0,
625.12353515625,
620.3452758789062
],
"class": "bike",
"confidence": 0.8724299073219299
},
{
"box": [
592.53369140625,
175.58175659179688,
640.0,
322.6899719238281
],
"class": "bike",
"confidence": 0.3001745045185089
}
],
"inference_time": 1.596113920211792
}
Initially I had planned to experiement with Digitalocean’s new GPU droplets, but they would not give us access to them despite being Digitalocean customers for over ten years. So I decided to deploy the application to Google Cloud Platform which has a mature GPU offering.
Initially I deployed to a standard-2 droplet on Digitalocean without GPU and the performance is not great on CPU as expected. The inference time is around 1.5 seconds per request.
On GCP, the GPU instance I selected is a n1-standard-1
with a Tesla T4 GPU.
We will deploy the application with Kamal a simple tool for deploying web applications to servers.
Please refer to the Kamal documentation for more information on how to install and use Kamal.
bundle init
bundle add kamal
kamal init
ssh-agent
ssh-add ~/.ssh/(yourkey)
Alternatively you can run kamal in a docker container if you don’t want to install it locally - refer to docs.
Then we need to configure the config/deploy.yml
file with similar content:
# Name of your application. Used to uniquely configure containers.
service: inference
# Name of the container image.
image: youruser/inference-http
# Deploy to these servers.
servers:
web:
hosts:
- ipaddressorhostname
options:
gpus: all
# Enable SSL auto certification via Let's Encrypt (and allow for multiple apps on one server).
# If using something like Cloudflare, it is recommended to set encryption mode
# in Cloudflare's SSL/TLS setting to "Full" to enable end-to-end encryption.
proxy:
ssl: true
host: [yourendpoint]
# kamal-proxy connects to your container over port 80, use `app_port` to specify a different port.
app_port: 5050
# Credentials for your image host.
registry:
# Specify the registry server, if you're not using Docker Hub
# server: registry.digitalocean.com / ghcr.io / ...
server: ghcr.io
username: [registryusername]
# Always use an access token rather than real password (pulled from .kamal/secrets).
password:
- KAMAL_REGISTRY_PASSWORD
# Configure builder setup.
builder:
arch: amd64
kamal setup
kamal deploy
However, if you do this, you will just deploy in CPU mode which takes a long time to process each request.
You will need to do some preparation to the server to enable GPU support via Docker, which is not enabled by default.
nvidia-smi
and make sure you get output like this)+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 54C P0 29W / 70W | 399MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 90115 C /usr/local/bin/python3.12 396MiB |
+-----------------------------------------------------------------------------------------+
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt -y install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
For subsequent deployments, you can simply run kamal deploy
to update the application on the server.
Lets hit the endpoint with a test image:
curl --location --request POST 'https://inference.oslo.vision/predict' \
--form 'file=@"~/Downloads/motorcycles-bikes/train/images/114.jpg"'
{
"detections": [
{
"box": [
96.507568359375,
0.0,
625.1235961914062,
620.34521484375
],
"class": "bike",
"confidence": 0.8724297881126404
},
{
"box": [
592.53369140625,
175.58172607421875,
640.0,
322.6899719238281
],
"class": "bike",
"confidence": 0.3001745939254761
}
],
"inference_time": 0.049813032150268555
}
The inference time is around 50ms per request which is a significant improvement over the CPU only deployment!
Thanks for reading!
In this guide, we walked you through the process of building and deploying an inference server. We started by creating a simple Flask API that performs object detection using the YOLOv11 model. We then Dockerized the application and deployed it to Google Cloud Platform with GPU support. The final application is capable of processing real-time predictions with an inference time of around 50ms per request.
In the future we will look at how to build your own server for inference and training tasks. Also we will explore colocation options and related topics.