Build and Deploy an Inference Server

Build and Deploy an Inference Server

Published on

Introduction

In this guide, we’ll walk you through the process of building and deploying an inference server. By the end of this guide, you’ll have a better understanding of how to set up a server for inference tasks and how to deploy your models for real-time predictions.

Step 1: Simple Flask API

First, we’ll create a simple Flask API that will serve as our inference server. We will load the YOLOv11 model we trained earlier and use it to perform object detection on images uploaded to the server. The full code is available on GitHub.

The meat of the application is in the /predict endpoint, where we read the image file from the request, perform inference using the YOLOv11 model, and return the results as JSON. We also measure the inference time to give us an idea of how long it takes to process an image.

For high-traffic inference it would be better to integrate a queue and callback mechanism to handle the requests asynchronously. This would allow the server to handle multiple requests concurrently without blocking the main thread.

You should also implement authentication and rate limiting to prevent abuse of the API.

from flask import Flask, request, jsonify
from PIL import Image
import io
from ultralytics import YOLO
import logging
from logging.handlers import RotatingFileHandler
import sys
import time

app = Flask(__name__)

# Set up logging
logging.basicConfig(level=logging.DEBUG)
handler = RotatingFileHandler("app.log", maxBytes=10000, backupCount=1)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
app.logger.addHandler(handler)
app.logger.addHandler(logging.StreamHandler(sys.stdout))

# Load the YOLOv11 model
try:
    model = YOLO("./bikeweights.pt")
    app.logger.info("YOLOv11 model loaded successfully")
except Exception as e:
    app.logger.error(f"Failed to load YOLOv11 model: {str(e)}")


@app.route("/up", methods=["GET"])
def up():
    return "Service is up and running"


@app.route("/predict", methods=["POST"])
def predict():
    if "file" not in request.files:
        return jsonify({"error": "No file part in the request"}), 400

    file = request.files["file"]
    if file.filename == "":
        return jsonify({"error": "No file selected for uploading"}), 400

    if file:
        # Read the image file
        img_bytes = file.read()
        img = Image.open(io.BytesIO(img_bytes))

        # Perform inference and measure time
        start_time = time.time()
        results = model(img)
        inference_time = time.time() - start_time

        # Process results
        detections = []
        for result in results:
            boxes = (
                result.boxes.xyxy.tolist()
            )  # get box coordinates in (top, left, bottom, right) format
            classes = result.boxes.cls.tolist()  # get class labels
            confidences = result.boxes.conf.tolist()  # get confidence scores

            for box, cls, conf in zip(boxes, classes, confidences):
                detections.append(
                    {"box": box, "class": model.names[int(cls)], "confidence": conf}
                )

        return jsonify({"detections": detections, "inference_time": inference_time})


if __name__ == "__main__":
    app.run(debug=True)

Step 2: Dockerize the Application

Next, we’ll create a Dockerfile to package our Flask application into a Docker container. This will make it easier to deploy the application on Digitalocean and other cloud platforms.

# Use an official Python runtime as a parent image
FROM python:3.12-slim

# Set the working directory in the container
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the current directory contents into the container at /app
COPY . .

# Make port 5000 available to the world outside this container
EXPOSE 5050

# Update the CMD to point to the correct location of main.py
CMD ["gunicorn", "--bind", "0.0.0.0:5050", "--chdir", "/app/src/inference-http", "main:app"]

We can test the Docker container locally by building the image and running it:

docker build -t inference-http .
docker run -p 5050:5050 inference-http

Then in another terminal, we can test the API using curl: (linux / osx)

curl --location --request POST 'http://localhost:5050/predict' \
--form 'file=@"~/Downloads/motorcycles-bikes/train/images/114.jpg"'

An example json response would look like this:

{
  "detections": [
    {
      "box": [
        96.507568359375,
        0.0,
        625.12353515625,
        620.3452758789062
      ],
      "class": "bike",
      "confidence": 0.8724299073219299
    },
    {
      "box": [
        592.53369140625,
        175.58175659179688,
        640.0,
        322.6899719238281
      ],
      "class": "bike",
      "confidence": 0.3001745045185089
    }
  ],
  "inference_time": 1.596113920211792
}

Step 3: Deploy to Google Cloud Platform

Initially I had planned to experiement with Digitalocean’s new GPU droplets, but they would not give us access to them despite being Digitalocean customers for over ten years. So I decided to deploy the application to Google Cloud Platform which has a mature GPU offering.

Initially I deployed to a standard-2 droplet on Digitalocean without GPU and the performance is not great on CPU as expected. The inference time is around 1.5 seconds per request.

On GCP, the GPU instance I selected is a n1-standard-1 with a Tesla T4 GPU.

We will deploy the application with Kamal a simple tool for deploying web applications to servers.

Step 4: Install Kamal

Please refer to the Kamal documentation for more information on how to install and use Kamal.

bundle init
bundle add kamal
kamal init
ssh-agent
ssh-add ~/.ssh/(yourkey)

Alternatively you can run kamal in a docker container if you don’t want to install it locally - refer to docs.

Then we need to configure the config/deploy.yml file with similar content:

# Name of your application. Used to uniquely configure containers.
service: inference

# Name of the container image.
image: youruser/inference-http

# Deploy to these servers.
servers:
  web:
    hosts:
        - ipaddressorhostname
    options:
        gpus: all

# Enable SSL auto certification via Let's Encrypt (and allow for multiple apps on one server).
# If using something like Cloudflare, it is recommended to set encryption mode
# in Cloudflare's SSL/TLS setting to "Full" to enable end-to-end encryption.
proxy:
  ssl: true
  host: [yourendpoint]
  # kamal-proxy connects to your container over port 80, use `app_port` to specify a different port.
  app_port: 5050

# Credentials for your image host.
registry:
  # Specify the registry server, if you're not using Docker Hub
  # server: registry.digitalocean.com / ghcr.io / ...
  server: ghcr.io
  username: [registryusername]

  # Always use an access token rather than real password (pulled from .kamal/secrets).
  password:
    - KAMAL_REGISTRY_PASSWORD

# Configure builder setup.
builder:
  arch: amd64

Step 5: Deploy the Application

kamal setup
kamal deploy

However, if you do this, you will just deploy in CPU mode which takes a long time to process each request.

You will need to do some preparation to the server to enable GPU support via Docker, which is not enabled by default.

  1. Make sure nvidia drivers are installed on your server. (test with nvidia-smi and make sure you get output like this)
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   54C    P0             29W /   70W |     399MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     90115      C   /usr/local/bin/python3.12                     396MiB |
+-----------------------------------------------------------------------------------------+
  1. Make sure docker is installed on your server. If not, you can install it with the following commands:
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce
  1. Install the NVIDIA container toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt -y install nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

For subsequent deployments, you can simply run kamal deploy to update the application on the server.

Lets hit the endpoint with a test image:

curl --location --request POST 'https://inference.oslo.vision/predict' \
--form 'file=@"~/Downloads/motorcycles-bikes/train/images/114.jpg"'
{
  "detections": [
    {
      "box": [
        96.507568359375,
        0.0,
        625.1235961914062,
        620.34521484375
      ],
      "class": "bike",
      "confidence": 0.8724297881126404
    },
    {
      "box": [
        592.53369140625,
        175.58172607421875,
        640.0,
        322.6899719238281
      ],
      "class": "bike",
      "confidence": 0.3001745939254761
    }
  ],
  "inference_time": 0.049813032150268555
}

The inference time is around 50ms per request which is a significant improvement over the CPU only deployment!

Conclusion

Thanks for reading!

In this guide, we walked you through the process of building and deploying an inference server. We started by creating a simple Flask API that performs object detection using the YOLOv11 model. We then Dockerized the application and deployed it to Google Cloud Platform with GPU support. The final application is capable of processing real-time predictions with an inference time of around 50ms per request.

In the future we will look at how to build your own server for inference and training tasks. Also we will explore colocation options and related topics.