Real-Time Semantic and Occupancy Prediction: A Deep Dive into SOccDPT

Aditya NG
3 min readJun 9, 2024

--

SOccDPT Project Page: https://adityang.github.io/SOccDPT

SOccDPT in Action

In the ever-evolving landscape of autonomous vehicles and robotics, the ability to understand and navigate complex environments in real-time is paramount. One crucial aspect of this capability is semantic occupancy prediction, which involves determining the presence, location, and type of objects in a 3D space from 2D images. This task is especially challenging in unstructured environments such as chaotic urban streets, which are typical in developing regions. Existing methods often struggle with the high variability and lack of structure in such settings, leading to inaccurate and unreliable predictions.

Semantic Occupancy prediction not only enhances the situational awareness of autonomous systems but also improves their decision-making processes. It allows for more informed and dynamic path planning, obstacle avoidance, and interaction with the environment. Given the significant role of this technology in ensuring the safety and efficiency of autonomous systems, there is a pressing need for models that can perform these predictions reliably and in real-time.

The Challenges with Current SOTA Approaches

Current state-of-the-art methods in semantic occupancy prediction face several limitations:

  • Manual Labeling Requirement: Many models rely heavily on well-structured datasets for training, which limits their ability to generalize to unstructured environments. The need for extensive manually labeled data makes it difficult to scale these models to new domains.
  • Poor Inference Speed: Achieving real-time performance is often a trade-off with the accuracy and complexity of the model, limiting its practical applicability.

Introducing SOccDPT: Overcoming the Challenges

SOccDPT, or Semi-Supervised 3D Semantic Occupancy from Dense Prediction Transformers, introduces a novel solution to these challenges. Developed by a team of researchers and published in the “Advances in Artificial Intelligence and Machine Learning Journal” in 2024, SOccDPT employs a memory-efficient approach to predict 3D semantic occupancy from monocular image inputs.

Architecture

SOccDPT employs a relatively simple model architecture, using a ViT backbone and having two heads for semantics and depth. Camera calibration is then used to produce the semantic occupancy grid.

SOccDPT Model Architecture

Semi-Supervised Learning with Pseudo-Ground Truth

One of the standout features of SOccDPT is its semi-supervised training pipeline, which addresses the challenge of limited labeled data. The model is trained on unstructured datasets such as the Indian Driving Dataset and the Bengaluru Driving Dataset. These datasets are characterized by complex and unpredictable traffic scenarios, providing a more realistic training environment for the model.

Boosting employed for Semantics and Depth auto-labeling to generate labels from the teacher model

To overcome the scarcity of labeled data, SOccDPT employs pseudo-ground truth labeling using boosting techniques. This involves generating labels automatically from the data using a teacher-trained model, which is boosted to improve accuracy. This works in a knowledge distillation fashion with boosting to aid the teacher model. This approach significantly reduces the need for manual labeling and enables the model to learn from a broader range of scenarios. As a result, SOccDPT is better equipped to handle the variability and complexity of unstructured environments.

Achieving High Frame Rates

One of the critical metrics for evaluating the performance of a model in real-time applications is its frame rate. SOccDPT achieves a remarkable frame rate of 69.47 Hz, making it well-suited for real-time applications in autonomous driving and robotics. This high frame rate ensures that the model can process and analyze data quickly, providing timely and accurate predictions that are crucial for dynamic decision-making.

Conclusion

SOccDPT presents a jointly trained semantic and occupancy prediction model that is able to operate at real-time speeds. The presented auto-labeling pipeline allows the use of synthetic data for knowledge distillation from larger models and boosting techniques into the smaller and faster model, bridging the gap for domain transfer.

For more information on SOccDPT and to access the code and dataset, visit the project page.

--

--

Aditya NG
Aditya NG

Written by Aditya NG

Computer Vision and Autonomous Robotics Research

No responses yet