27 February 2025
13:30 Doctoral defense By video conference
Topic
An Effective Approach for Self-Supervised Stereo Depth Estimation
Student
Alexandre Ribeiro Lopes
Advisor / Teacher
Helio Pedrini - Co-advisor: Roberto Medeiros de Souza
Brief summary
Depth estimation is a crucial component of understanding the three-dimensional (3D) geometry of a scene. In recent years, convolutional neural networks have opened up new possibilities in this field. Depth estimation systems are typically categorized as stereo or monocular, using images or video frames as input, and can be trained using supervised or unsupervised techniques. Unsupervised approaches have emerged due to the high cost of depth sensors and the laborious process of refining the reference depth maps generated by these sensors to produce data for training. Unsupervised or self-supervised methods offer significant advantages for several commercial applications, such as self-driving car systems, drones, and other autonomous vehicles, mainly because they eliminate the need for expensive sensors to build dense depth maps of the environment. Additionally, relying solely on images allows companies to utilize pre-existing image databases to train their models. For example, car manufacturers focused on autonomous vehicles can easily benefit from these techniques, as they already have thousands of camera-equipped vehicles in production, providing an abundant source of data for fine-tuning self-supervised algorithms without human intervention. In this context, stereo self-supervised depth estimation stands out as an effective solution, as it eliminates the need for sensors and instead uses a known camera system to estimate the disparity or depth of each element captured by the camera pair. This approach simplifies the problem compared to systems that rely on monocular sequences and generally achieves superior performance. Recent research presents Transformer-based architectures, which offer state-of-the-art metric results but exhibit poor execution times. Consequently, most of these models are impractical for real-world applications. In this thesis, we propose a novel self-supervised convolutional approach that outperforms existing convolutional neural networks and Transformer models while balancing the computational cost. The proposed architecture, named CCNeXt, integrates a state-of-the-art feature extractor with a novel windowed epipolar cross-attention module in the encoder, complemented by a redesign of the depth estimation decoder. Our experiments show that CCNeXt achieves competitive metrics on the KITTI Eigen Split test data, being 10,18x faster than the current best model. We also achieve superior results across all metrics on the KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets when compared to recently proposed techniques.
Examination Board
Headlines:
Hélio Pedrini IC / UNICAMP
David Menotti Gomes INF/UFPR
Ronaldo Cristiano Prati CMCC / UFABC
Marcelo da Silva Reis IC / UNICAMP
Rafael de Oliveira Werneck IC / UNICAMP
Substitutes:
Alexandre Mello Ferreira EEP
William Robson Schwartz DCC / UFMG
Moacir Antonelli Ponti ICMC / USP