Image Classification vs. Object Detection
In computer vision, it's important to distinguish between these two tasks:
- Image Classification: Answers "What is in this image?" It assigns a single label to the entire image (e.g., "cat").
- Object Detection: Answers "What objects are in this image, and where are they?" It identifies multiple objects and provides their locations with bounding boxes.
Object detection models are significantly more complex than classifiers. They need to solve two problems simultaneously: locating objects and classifying them.
Core Concepts of Object Detection
- Bounding Box: The rectangle drawn around a detected object. It's typically defined by four values: the coordinates of the top-left corner (x_min, y_min) and the bottom-right corner (x_max, y_max).
- Class Label: The predicted class of the object within the box (e.g., "person," "car," "dog").
- Confidence Score: A value between 0 and 1 that represents the model's confidence in its prediction. We typically ignore any detections with a score below a certain threshold (e.g., 50%).
The Practical Path: Using a Pre-trained Detector
Training an object detection model like YOLO (You Only Look Once) or SSD (Single Shot MultiBox Detector) from scratch is a monumental task that requires massive, meticulously labeled datasets and weeks of GPU time.
For most practical applications, the workflow is to use a model that has already been pre-trained on a large dataset like COCO (Common Objects in Context). We can then use this model directly for inference.
Let's walk through an end-to-end example using Python and OpenCV's DNN (Deep Neural Network) module, which makes it surprisingly easy to load and run pre-trained models.
End-to-End Example
Our goal is to take an input image and produce an output image with all detected objects outlined and labeled.
Step 1: Get a Pre-trained Model We'll use a MobileNet-SSD model, which is lightweight and fast. You'll need two files:
- A .pb file containing the frozen model weights.
- A .pbtxt file describing the model's architecture. (These files are widely available online for download).
Step 2: The Code The process involves four main steps in the code:
- Load the model and class names.
- Prepare the image by converting it into a blob that the model expects. A blob is a 4D tensor with a specific size, scale, and color ordering.
- Perform a forward pass through the network to get the raw detection data.
- Post-process the results by looping through the detections, filtering out weak ones, and drawing the final boxes.
Code Snippet: Object Detection with Python and OpenCV
Python
import cv2
# --- CONFIGURATION ---
MODEL_PATH = "frozen_inference_graph.pb"
CONFIG_PATH = "ssd_mobilenet_v3_large_coco_2020_01_14.pbtxt"
CLASSES_PATH = "coco_class_labels.txt"
CONF_THRESHOLD = 0.5 # Ignore detections below this confidence
# --- 1. LOAD THE MODEL AND CLASS LABELS ---
print("[INFO] Loading model...")
net = cv2.dnn.readNetFromTensorflow(MODEL_PATH, CONFIG_PATH)
with open(CLASSES_PATH, 'rt') as f:
class_names = f.read().rstrip('\n').split('\n')
# --- LOAD THE INPUT IMAGE ---
image = cv2.imread("example_image.jpg")
(h, w) = image.shape[:2]
# --- 2. PREPARE IMAGE BLOB ---
# The model expects 300x300 images
blob = cv2.dnn.blobFromImage(image, size=(300, 300), swapRB=True, crop=False)
# --- 3. PERFORM INFERENCE ---
print("[INFO] Performing detection...")
net.setInput(blob)
detections = net.forward()
# --- 4. POST-PROCESS AND DRAW RESULTS ---
# The 'detections' object is a large array of shape (1, 1, N, 7),
# where N is the number of potential detections.
# Each detection has 7 values: [batchId, classId, confidence, x_min, y_min, x_max, y_max]
for i in range(detections.shape[2]):
confidence = detections[0, 0, i, 2]
# Filter out weak detections
if confidence > CONF_THRESHOLD:
# Get the class label and box coordinates
class_id = int(detections[0, 0, i, 1])
box_x = int(detections[0, 0, i, 3] * w)
box_y = int(detections[0, 0, i, 4] * h)
box_w = int(detections[0, 0, i, 5] * w)
box_h = int(detections[0, 0, i, 6] * h)
# Prepare the label text
label = f"{class_names[class_id]}: {confidence:.2f}"
# Draw the bounding box and label on the image
cv2.rectangle(image, (box_x, box_y), (box_w, box_h), (0, 255, 0), 2)
cv2.putText(image, label, (box_x, box_y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
# --- DISPLAY THE OUTPUT ---
cv2.imshow("Output", image)
cv2.waitKey(0)
cv2.destroyAllWindows()
A key post-processing step that is often needed (but sometimes handled internally by the library) is Non-Max Suppression (NMS). This algorithm takes multiple, overlapping boxes for the same object and suppresses all but the one with the highest confidence, cleaning up the final output.