Why Not Just Use a Flask App?

A common first step for deploying a model is to wrap it in a simple web framework like Flask or FastAPI. While this works for small projects, it quickly falls short in a serious production environment. Dedicated serving platforms offer critical advantages:

  • High Performance: They are written in high-performance languages like C++ and are optimized for low-latency inference, GPU utilization, and high throughput.
  • Batching: They can automatically batch incoming requests together to better utilize hardware (especially GPUs), dramatically increasing throughput.
  • Model Versioning: You can deploy multiple versions of a model simultaneously and control which version serves traffic. This allows for safe rollouts and A/B testing (e.g., canary deployments).
  • Concurrency: They are built to handle thousands of concurrent requests without the need for complex manual configuration.

1. TensorFlow Serving (TFS)

TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments.

How it Works: You first export your trained TensorFlow model into a specific file structure called the SavedModel format. Then, you point TensorFlow Serving at this model directory. It automatically loads the model and exposes RESTful and gRPC API endpoints for you to call.

Step 1: Export Your Model Your model must be saved in the SavedModel format. This is the standard format for serializing TensorFlow models. It should have a version number directory (e.g., 1).

/my_models/
└── my_classifier/
    └── 1/
        ├── saved_model.pb
        └── variables/
            ├── variables.data-00000-of-00001
            └── variables.index

Step 2: Start TFS with Docker (The Easiest Way) Docker is the simplest way to run TFS. This command starts a TFS container, maps the prediction port, and mounts your local model directory into the container.

Bash


# Pull the TensorFlow Serving Docker image
docker pull tensorflow/serving

# Run the server
docker run -p 8501:8501 \
  --mount type=bind,source=/path/to/my_models/my_classifier,target=/models/my_classifier \
  -e MODEL_NAME=my_classifier -t tensorflow/serving

Step 3: Make a Prediction Request TFS now exposes a REST API endpoint. You can send it a POST request with your input data.

Python


import requests
import json

# Example input data for a model expecting a 1x4 feature vector
data = json.dumps({"instances": [[1.0, 2.0, 5.0, 3.0]]})

# The URL format is: http://host:port/v1/models/your_model_name:predict
url = 'http://localhost:8501/v1/models/my_classifier:predict'

response = requests.post(url, data=data)
predictions = response.json()['predictions']

print(predictions)

2. TorchServe

TorchServe is the equivalent serving solution for the PyTorch ecosystem, developed and maintained by PyTorch.

How it Works: TorchServe uses a "model archiver" tool to package everything it needs into a single .mar (Model Archive) file. This file contains the serialized model (.pt file), code for handling pre-processing and post-processing (a "handler" script), and other metadata.

Step 1: Create a Handler File This is a Python script that defines how to handle requests. You must define a class that handles preprocessing, inference, and postprocessing.

Python


# handler.py
from ts.torch_handler.base_handler import BaseHandler

class MyModelHandler(BaseHandler):
    # Here you would override the preprocess, inference,
    # and postprocess methods to match your model's needs.
    # For simplicity, we'll use the default ImageClassifier handler.
    pass

Step 2: Archive the Model Use the torch-model-archiver command-line tool to create the .mar file.

Bash


torch-model-archiver \
  --model-name "my_image_classifier" \
  --version 1.0 \
  --model-file model.py \
  --serialized-file model_state_dict.pt \
  --handler image_classifier \
  --extra-files index_to_name.json

Step 3: Start TorchServe and Register the Model First, start the TorchServe server. Then, use its Management API to load (register) your model.

Bash


# Start the server
torchserve --start --model-store /path/to/model_store

# Register the model
curl -X POST "http://localhost:8081/models?url=my_image_classifier.mar"

Step 4: Make a Prediction You can now send data to the prediction endpoint.

Bash


curl http://localhost:8080/predictions/my_image_classifier -T my_cat_image.jpg