Object Detection and Tracking using YOLOv11 and SAM2

Introduction

Me and my team tried to develop an innovative solution developed for Challenge 2: Object Detection at TechFest 2024 by Qualcomm VisionX. The project integrates YOLOv11 for real-time object detection, BOT_sort for object tracking, and SAM2 (Segment Anything Model) for precise object segmentation, showcasing cutting-edge advancements in computer vision.

The objective of this solution is to deliver a robust and versatile object detection and tracking system capable of operating in diverse environments. From surveillance to autonomous driving, the solution is tailored for applications requiring high precision, real-time performance, and adaptability to challenging conditions.

Objectives

The project is designed with the following goals:

High Precision: Detect objects with minimal false positives and negatives.
Real-Time Performance: Achieve speeds suitable for dynamic applications.
Versatility: Operate effectively under varying lighting, angles, and occlusions.
Real-World Applications: Demonstrate use cases in surveillance, autonomous vehicles, and retail automation.
Object Tracking: Maintain object identities across video frames.

Why YOLOv11?

YOLOv11 is the latest iteration in the YOLO series, renowned for its balance between speed and accuracy.

Real-Time Detection: Capable of processing images and videos swiftly without sacrificing accuracy.
High Precision: Excels in detecting small and overlapping objects, reducing false positives and negatives.
End-to-End Design: Utilizes a unified architecture to predict class probabilities, bounding box coordinates, and confidence scores simultaneously.
Scalability: Supports diverse object classes and environments, making it adaptable for various applications.
Efficiency: Outperforms traditional methods like Faster R-CNN, particularly in dynamic scenarios such as surveillance and autonomous driving.

Why SAM2?

SAM2 (Segment Anything Model) complements YOLOv11 by enhancing object segmentation capabilities:

Superior Segmentation: Achieves pixel-level precision, crucial for cluttered or occluded scenes.
Zero-Shot Learning: Segments novel object categories without additional training.
Enhanced Localization: Provides precise object boundaries, aiding in accurate localization.
Integration Benefits: When combined with YOLOv11, SAM2 significantly boosts accuracy in multi-object scenarios.

Methodology

The project employs a multi-step approach, integrating detection, tracking, and segmentation.

Data Preparation

Input images or video frames are preprocessed to align with YOLOv11 and SAM2 requirements.

Custom Training of YOLOv11

The YOLOv11 model is fine-tuned on a custom basketball dataset using 75 epochs and high-resolution images (1088x1088).
Training metrics include box loss (0.5878), classification loss (0.313), and Distribution Focal Loss (DFL, 0.9233).
Results achieved:
Precision: 0.991
Recall: 0.985
mAP@50: 0.993
mAP@50-95: 0.756

Object Detection and Tracking

YOLOv11 detects objects, outputting bounding boxes, class labels, and confidence scores.
BOT_sort handles tracking, maintaining object identities across frames and managing occlusions effectively.

Object Segmentation

SAM2 segments detected objects, refining boundaries for precise localization.
Pixel-wise masks isolate objects from the background, enhancing clarity.

Integration of YOLOv11 and SAM2

Detection results from YOLOv11 are merged with segmentation outputs from SAM2.
BOT_sort ensures continuity in object tracking, even during temporary occlusions.
Annotation formats are aligned for a seamless final output.

Architecture Overview

YOLOv11 Architecture

Input Layer: Accepts high-resolution images.
Backbone: Employs an advanced convolutional neural network for feature extraction.
Neck: Integrates feature pyramid networks for multi-scale detection.
Head: Outputs bounding boxes, class probabilities, and confidence scores.

SAM2 Architecture

Prompt Encoder: Adapts user inputs like bounding boxes or key points.
Image Encoder: Encodes the image using a Vision Transformer (ViT).
Mask Decoder: Generates segmentation masks based on encoded features and user prompts.

Features

YOLOv11: Real-time object detection with high precision.
SAM2: Pixel-perfect segmentation for complex scenes.
BOT_sort: Efficient object tracking across video frames.
Robustness: Handles diverse environments and occlusions.
Scalability: Supports multiple object classes and real-world applications.

Results and Applications

The integrated solution excels in various domains:

Surveillance: Tracks individuals and objects in crowded or dimly lit environments.
Autonomous Driving: Detects and localizes vehicles, pedestrians, and obstacles.
Retail Automation: Identifies products and tracks customer movements efficiently.

Performance Metrics

Inference Speed: 32.34 ms per frame.
Precision: 0.991
Recall: 0.985
mAP@50-95: 0.756

Installation Guide

Clone the repository and navigate to the project folder:

git clone https://github.com/Pin4sf/TechFest_YOLO11_SAM2.git cd TechFest_YOLO11_SAM2

Conclusion

This project demonstrates a significant leap in object detection and tracking by integrating YOLOv11, BOT_sort, and SAM2. With applications ranging from security to automation, it highlights the potential of combining cutting-edge technologies for real-world challenges.

For more details, visit the repository.

Advanced Object Tracking using YOLOv11 and SAM2