D³NavZero: AV Navigation with AI-Guided Graph Search

5 min readJul 21, 2024

D³NavZero proposes a navigation system that integrates MyuZero-style intelligent graph search with D³Nav [1] and TrajNet [2]. This fusion of the following technologies has the potential to revolutionize how autonomous vehicles navigate and make decisions in complex environments.

D³Nav [1] is a generative video model that takes the past history of video frames (driving footage) as input and predicts the future frames as output
TrajNet [2] is a simple image-based end-to-end path planner for AVs
MyuZero [3] applies a learned world simulator with Monte Carlo Tree Search to decide on an optimal action

D³Nav generating random scenes

The Inspiration Behind D³NavZero

The concept of D³NavZero draws inspiration from three significant developments in the field of artificial intelligence and autonomous driving:

Tesla’s application of AI-guided tree search for path planning in their Full Self-Driving (FSD) v12 software [4] .
The success of MyuZero [3] , an AI system that uses guided Monte Carlo tree search to achieve superhuman performance in games like Atari, Go, Chess, and Shogi.
Comma AI’s Learning a Driving Simulator: Many approaches have shown it is possible to build a neural simulator for the AV task

While Tesla’s approach likely relies on a classical hard-coded simulator (at the time of writing) and vast amounts of user data for training, D³NavZero aims to take this concept further by incorporating a learned world engine, D³Nav, as a simulator.

TrajNet produces the desired vehicle trajectory

The Building Blocks of D³NavZero

At its core, D³NavZero comprises three main components, each represented by a neural network:

Environment Encoder (h): This network takes an image as input and produces a latent space representation of the environment.
Policy-Value Function (f): This network takes the environment state as input and outputs a distribution of policies to take, along with their corresponding future reward values.
Dynamics Model (g): This network takes a state-action pair for a given frame as input and predicts the next state and the reward for the action.

These networks work in tandem with a Monte Carlo Tree Search framework to make informed decisions by exploring potential future scenarios and selecting the path with the highest reward.

Translating MyuZero to Autonomous Driving

In the context of autonomous vehicles, D³NavZero adapts the MyuZero framework as follows:

The Environment Encoder is implemented using a VQ-VAE image encoder.
The Policy-Value Function is based on TrajNet [2], an end-to-end image-to-trajectory planner for AVs
The Dynamics Model is D³Nav [1], which takes an image and desired trajectory as input and produces the next video frame as output. This model will be adapted to also output a reward.

The Monte Carlo Tree Search, guided by these three networks, explores potential trajectories the vehicle can take in the near future (1 to 5 seconds), ultimately selecting the trajectory with the highest reward.

Training: Dynamics Model and Environment Encoder

These components can be trained on large-scale unlabeled video datasets, including:

Berkeley Deep Dive Video (10,000 hours)
CommaVQ (1,666 hours)
OpenDV (1,700 hours)

This massive dataset, totaling approximately 13,366 hours or 200 billion tokens at 30 FPS, provides a diverse range of driving scenarios for training.

D³Nav is trained to predict the future video tokens given the past context

Training: Policy-Value Function

The Policy-Value Function (TrajNet [2]) is pre-trained on high-quality ego-motion datasets like NuScenes and Berkeley Deep Dive. After pre-training, it’s fine-tuned using the Dynamics Model as a simulator on frames from the larger video datasets.

Training: Cost

Estimating the training cost can be done with some simple back-of-the-napkin math. Andrej Karpathy [6] showed us that it is possible to replicate GPT-2 results today by training on 10B tokens for 90 minutes for ~$20. Extrapolating a bit, we will be training on 200B tokens for about 30 hours for ~$400 per experiment. Assuming we run around 250 experiments (which is a LOT), it would sum up to $100,000!

Inference

To run the model in inference mode, we implement the Monte-Carlo Tree search to use the Policy-Value Function (TrajNet) to sample a set of future trajectories and then use the Dynamics Model (D³Nav) to predict the next frame for each of these frontiers. We apply this dynamic to explore this for the next N steps in each frontier and broaden our search at each level to expand the search tree. Finally, we pick the trajectory which shows the most promise.

The Promise of D³NavZero

While the computational demands of D³NavZero are currently high for online inference, with an estimated inference time of about 10 seconds depending on the search depth and width, its potential is immense. The system can be used offline to supervise smaller, more efficient models that run on vehicles. We may even alter the search depth and width to optimize for inference speed. As hardware performance improves and optimization techniques advance, we can expect D³NavZero to become increasingly viable for real-time applications.

The future of autonomous vehicles is bright, with D³NavZero representing just one of the many exciting technologies waiting to be explored and refined. As we continue to push the boundaries of what’s possible in AI and self-driving cars, innovations like D³NavZero will play a crucial role in creating safer, more efficient, and more capable autonomous vehicles.

References

D³Nav: Data-Driven Driving Agents for Autonomous Vehicles in Unstructured Traffic https://adityang.github.io/D3Nav
Thermal Voyager: A Comparative Study of RGB and Thermal Cameras for Night-Time Autonomous Navigation https://adityang.github.io/TrajNet
MuZero: Mastering Go, chess, shogi and Atari without rules https://deepmind.google/discover/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules/
Tesla’s application of AI-guided tree search for path planning in their Full Self-Driving (FSD) v12 software https://www.youtube.com/watch?v=JhraWiBuBbs&ab_channel=Dr.Know-it-allKnowsitall
comma ai | Learning a Driving Simulator https://www.youtube.com/watch?v=-KMdo9AWJaQ&ab_channel=georgehotzarchive
Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481 https://github.com/karpathy/llm.c/discussions/481