IMAGE AND VISION COMPUTING, cilt.163, 2025 (SCI-Expanded, Scopus)
Deep learning-based monocular visual odometry has gained importance in robotics and autonomous navigation due to its robustness in visually challenging environments and minimal sensor requirements. However, many existing deep learning-based MVO methods suffer from high computational costs and large model sizes, making them less suitable for real-time applications in resource-limited systems. In this study, we propose DeepDCT-VO, a lightweight visual odometry method that combines three-dimensional directional coordinate transformation with a compact deep learning architecture. Unlike traditional approaches that estimate translation in a global coordinate system and are prone to drift accumulation, DeepDCT-VO uses local directional motion derived from composite rotations. This approach avoids global trajectory reconstruction, thereby improving the method's stability and reliability. The proposed model operates on input images at multiple resolutions (120 x 120, 240 x 240, 360 x 360, and 480 x 480), leveraging attention-guided residual learning to extract robust features. Additionally, it incorporates multi-modal information-specifically depth and semantic maps-to further improve the accuracy of pose estimation. Evaluations on the KITTI odometry benchmark demonstrate that DeepDCT-VO achieves competitive trajectory estimation accuracy while maintaining real-time performance-8 ms per frame on GPU and 12 ms on CPU. Compared to the existing method with the lowest translational drift (trel), DeepDCT-VO reduces model size by approximately 96.3% (from 37.5 million to 1.4 million parameters). Conversely, when compared to the lightest model in terms of parameter count, DeepDCT-VO reduces trel from 8.57% to 1.69%, achieving an 80.3% reduction in translational drift. These results underscore the effectiveness of DeepDCT-VO in delivering accurate and efficient monocular visual odometry, particularly suited for embedded and resource-limited applications, while the proposed transformation method offers an auxiliary function in reducing translational complexity.