How to build your Foundation Model’s Perception system

3 min readOct 14, 2024

Large vision models like LLaVA, AV models like D³Nav, GAIA-1, and image generation models like DALL-E, are built on vast amounts of data and complex neural architectures. But how do they actually “see” and understand our world? The answer is VQ-VAEs! Vector Quantized Variational Autoencoders are powerful generative models that are used to condense a domain’s signal down to a quantized embedding space. This compression retails all the important information in a compact token set which can be used to train a foundation model!

While working on D³Nav, a generative driving model, I worked deeply on VQ-VAEs and understood how to train them. I was also struggling to find a generic implementation of a PyTorch VQ-VAE that works on multidimensional data (1D, 2D, 3D, 4D). So I made my own library for it!

In this article, we’ll explore how to fine-tune a VQ-VAE using the nd_vq_vae library, with a focus on both image and video data.

Setting Up Your Environment

First, let’s set up our development environment. You can install the nd_vq_vae library directly from PyPI:

pip install nd_vq_vae

For those who prefer to work with the latest development version, you can clone the repository and set up the environment as follows:

git clone https://github.com/AdityaNG/nD_VQ_VAE
cd nD_VQ_VAE/
make virtualenv
source .venv/bin/activate
make install

Preparing Your Dataset

The nd_vq_vae library supports multidimensional data. We will be working with image and video datasets. Let's look at how to structure your data for each case:

For Image Training:

Create a directory structure like this:

data/image_dataset/
├── test
│   ├── image1.png
│   ├── image2.png
│   └── ...
└── train
    ├── image1.png
    ├── image2.png
    └── ...

For Video Training: Set up your video dataset as follows:

data/video_dataset/
├── test
│   ├── video1.mkv
│   ├── video2.mp4
│   └── ...
└── train
    ├── video1.mp4
    ├── video2.mkv
    └── ...

Fine-Tuning Your VQ-VAE Now that we have our environment set up and data prepared, let’s dive into the fine-tuning process.

Define Your Model

The NDimVQVAE class is the core of our model. Here’s how to initialize it:

from nd_vq_vae import NDimVQVAE

# For image data (2D)
input_shape = (3, 128, 256)  # (channels, height, width)

# For video data (3D)
sequence_length = 3

# (channels, time, height, width)
input_shape = (3, sequence_length, 128, 256)

model = NDimVQVAE(
    embedding_dim=64,
    n_codes=64,
    n_dims=2,  # Use 3 for video data
    downsample=[4, 4],  # Adjust based on your needs
    n_hiddens=64,
    n_res_layers=2,
    codebook_beta=0.10,
    input_shape=input_shape,
)

2. Train Your Model

For image training:

python scripts/train_image.py --data_path data/image_dataset/

For video training:

python scripts/train_video.py --data_path data/video_dataset/

If you want to feed in say point cloud data or audio signals, you can do that! Feel free to modify the above scripts for your use case!

3. Monitor and Adjust Hyperparameters During training, keep a close eye on these key metrics: Reconstruction Loss, Commitment Loss, and Perplexity

Here are some tips for fine-tuning based on these metrics:

High Reconstruction Loss: Increase model capacity (n_hiddens, n_res_layers) or adjust the learning rate.
High Commitment Loss: Decrease codebook_beta.
Low Perplexity: Increase n_codes or decrease embedding_dim.
High Perplexity: Decrease n_codes or increase embedding_dim.

4. Balancing Losses: Adjust codebook_beta and recon_loss_factor to achieve a good balance between reconstruction and commitment losses.

5. Addressing Overfitting: If you notice the validation loss plateauing while the training loss continues to decrease:

Introduce dropout in the encoder/decoder
Reduce model capacity Increase batch size or use data augmentation

6. Fine-tuning the Attention Mechanism: Adjust n_head and attn_dropout to improve long-range dependencies.

7. Codebook Update Strategy: Fine-tune ema_decay for optimal codebook stability and adaptation speed.

Best Practices

Make incremental changes to hyperparameters.
Perform ablation studies, changing one parameter at a time.
Consider using learning rate scheduling or cyclical learning rates.
Regularly save checkpoints and log experiments for comparison.

Conclusion

Fine-tuning a VQ-VAE is an iterative process that requires patience and careful monitoring of key metrics. By following these guidelines and using the nd_vq_vae library, you can effectively optimize your VQ-VAE for both image and video data. Remember, the best hyperparameters often depend on your specific dataset and task, so don’t be afraid to experiment and iterate!

Happy fine-tuning!

References

[1] N-Dimensional VQ-VAE: PyTorch implementation

[2] D³Nav: Data-Driven Driving Agents for Autonomous Vehicles in Unstructured Traffic

[3] VQ-VAE: Neural Discrete Representation Learning