DriveLLaVA: Using Large Vision Models as Driving Agents for AVs

4 min readApr 8, 2024

Autonomous vehicles (AVs) have the potential to revolutionize transportation, but developing safe and reliable AVs remains a challenge. One key hurdle is designing an effective driving agent, the software component that controls the vehicle’s behavior. This article explores DriveLLaVA, an approach that leverages Large Vision Models (LLMs) for autonomous driving.

The DriveLLaVA Concept

DriveLLaVA capitalizes on the world model that an LLM possesses to navigate the driving environment. Here’s a breakdown of the concept:

The Core: A Large Vision Model (LLaVA)
DriveLLaVA utilizes an LLM as its foundation. This LLM is pre-trained on a vast amount of visual data, allowing it to understand the visual world.
Trajectory Tokenization: The concept hinges on converting the desired driving path (trajectory) into a sequence of tokens. These tokens represent ego-motion which translates to specific driving maneuvers, such as lane changes or turns.
Training Dataset Generation: To train DriveLLaVA, a dataset of paired elements is created: Each element consists of a driving frame (image captured by the vehicle’s camera). The corresponding tokenized trajectory represents the desired driving path for that frame.
Fine-tuning DriveLLaVA: DriveLLaVA is then fine-tuned using the generated dataset. During this process, DriveLLaVA learns to predict the correct trajectory tokens based on the input driving frames.

The power of this approach lies in leveraging the LLM’s inherent understanding of the world. This pre-existing knowledge allows DriveLLaVA to interpret visual information and make informed driving decisions without the need for extensive hand-crafted rules.

Prompting the Model

The trajectory is quantized by fitting a K-Means clustering model on the dataset, which was inspired by TrajNet. The quantized index is then mapped to a Unicode character by using a lookup into the Model’s Dictionary. To ensure some diversity in the prompts, we use an LLM to generate various versions of the same prompt with different words. Following is an example of a training sample in the dataset.

{
  "id": "10009100841",
  "image": "img_val/b57ccb9f2a966667dfe6c831ae88dca7_410/841.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image> You, as DriveLLaVA, are at the forefront of 
        autonomous navigation. Assess the situation depicted in the 
        image and select the trajectory that best aligns with safe and 
        efficient driving principles from the options provided. The 
        trajectory tokens are sorted from left to center to right
        Left: ̂,ペ,何,宇,張
        Center: 语,老,例,Ṭ,鉄
        Right: 克,☉,™,ɹ,ἱ,ⴰ\n"
    },
    {
      "from": "gpt",
      "value": "Selected Trajectory: 宇"
    }
  ]
}

Quantized Trajectory

The Quantized Trajectory Templates Produced by K-Means Clustering; Left to right K = [64, 128, 256]

We quantize the trajectory by fitting a K-Means clustering model on the dataset to produce the following trajectory templates. We set the experiment by setting K to 64, 128, and 256. This quantization technique was inspired by the TrajNet paper.

Results

The DriveLLaVA project reports promising results. The model achieved a training loss of 0.722 and was able to predict reasonable driving trajectories as shown in the visual in the beginning of this article.

The training proceeds smoothly for an initial learning rate of 2e-8 and a batch size of 16. This consumed around 38 GB VRAM and ran for about 6 hours and 30 minutes.

Limitations

The quantization method, while functional, is limited. Performance might improve with a dedicated trajectory token vocabulary and exploration of sequential token-based construction. Additionally, incorporating a temporal sequence as context during training is a promising avenue for further improvement.

Code

The project is open source and has been made available at https://github.com/AdityaNG/DriveLLaVA

Contributions are welcome!

References

Credits to the LLaVA repo: https://github.com/haotian-liu/LLaVA
Fine-tuning LLaVA: https://ubiai.tools/how-to-fine-tune-llava-on-your-custom-dataset/
TrajNet for quantized Trajectories: https://adityang.github.io/TrajNet
Wayve’s LINGO-1 as an Inspiration: https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/'