Abstract:To address the issues of target detection accuracy degradation and small target miss detection caused by speckle noise interference, low signal-to-noise ratio, and multi-scale scattering characteristics of targets in Synthetic Aperture Radar images, this paper proposes a lightweight detection model named XMNet, which balances feature representation capability and real-time performance. XMNet incorporates an improved single-Head vision Transformer into the backbone network to strengthen contextual semantic correlations through global attention mechanisms. A cross-layer multi-path aggregation network is designed as the neck structure, integrating dynamic upsampling and a parallel multi-scale convolution module to optimize multi-scale feature representation. An additional high-resolution detection layer is introduced to leverage shallow high-resolution features, enhancing detail capture capability for small targets. Experiments on the MSAR-1.0 dataset demonstrate that XMNet achieves a mean average precision of 90.4% across all categories, representing an increase of 8.7% over the baseline model. Detection accuracy for small aircraft targets significantly improves by 20.1%, with only a 2-million parameter increase while achieving an inference speed of 185 FPS. When compared against nine advanced methods including FCOS and CenterNet, XMNet ranks first in comprehensive metrics balancing detection accuracy and computational efficiency. Through the design of cross-layer attention mechanisms and multi-scale feature fusion, XMNet effectively resolves the challenge of balancing feature preservation for multi-scale targets and real-time processing in SAR imagery. Its lightweight and high detection accuracy provide a viable engineering-ready solution for real-time remote sensing monitoring across various SAR platforms, demonstrating significant advantages particularly in complex scenes with dense small targets.