Boosting has remained a popular approach the squeeze out a few more fractions of performance. Chasing the long tail of nines in the accuracy metric will bring most to some form of boosting solution, which is a quick way to improve accuracy.
The authors of “Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging”[1] propose one such boosting method. Their approach improves the final depth map given a model without ever having to retrain said model. Instead, the input image is converted to N resolutions and these N images are fed into the model to obtain N depth maps. These N depth maps are cleverly merged into a structurally consistent high-resolution depth map.
It is interesting to note that at lower resolutions many fine details in the scene are missing from the depth map. At higher resolutions, inconsistencies in overall structure begin to arise but the model can generate high frequency details. This shows that there is a tradeoff between structural consistency and high-frequency details concerning input resolution.
The authors explain this behavior by pointing out two problems with Convolutional Neural Networks:
1. Limited Network Capacity, this limits the detail that can be captured at low resolutions
2. Limited Receptive Field Size, at high resolution some points will not receive sufficient contextual information, resulting in structural inconsistencies
The Authors have trained an image-to-image network to merge the low-resolution depth maps with the higher-resolution depth maps so as to not inherit the structural inconsistencies of the high resolution map while retaining the high-frequency details that come with it.
Below we have compared Manydepth[3], MiDAS[2] and Boosted MiDAS