Cross-Attention Dual-Stream Network for Dense Blob Detection: A Multi-Scale and Edge-Aware Approach

<p>Lujian Song<sup>1</sup>, Jin Lu<sup>2</sup></p>

doi:10.25236/AJCIS.2026.090307

Academic Journal of Computing & Information Science, 2026, 9(3); doi: 10.25236/AJCIS.2026.090307.

Cross-Attention Dual-Stream Network for Dense Blob Detection: A Multi-Scale and Edge-Aware Approach

Author(s)

Lujian Song¹, Jin Lu²

Corresponding Author:

Jin Lu

Affiliation(s)

¹College of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an, China

²College of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an, China

Download PDF
|
Download: 10
|
View: 394

Abstract

Dense blob detection in high-resolution images presents significant challenges due to overlapping receptive fields, scale variations, and ambiguous boundaries between adjacent structures. While existing deep learning methods have achieved remarkable progress in general feature detection, they struggle to maintain discrimination capability in dense blob configurations where spatial proximity creates feature interference. This paper introduces a novel Cross-Attention Dual-Stream Network (CADSN) that addresses these challenges through complementary processing pathways: a Multi-Scale Feature Stream (MSFS) that captures blob appearance across hierarchical resolutions, and an Edge-Aware Stream (EAS) that explicitly encodes boundary information for precise localization. Unlike conventional fusion strategies, we propose a Cross-Stream Attention Mechanism (CSAM) that enables bidirectional information exchange between streams, allowing edge cues to guide multi-scale feature selection while appearance features refine boundary predictions. The architecture incorporates a Scale-Adaptive Pyramid Pooling module for handling extreme scale variations and a Contrastive Blob Discrimination loss that explicitly maximizes inter-blob separability while minimizing intra-blob variance. Extensive experiments demonstrate superior performance: 75.8% repeatability on HPatches, 82.7% adjacent blob discrimination accuracy, and 0.88-pixel localization error. Cross-domain evaluations on medical cell imaging and industrial defect detection validate practical applicability. Our architecture establishes a new paradigm for dense blob detection by synergistically combining multi-scale appearance modeling with explicit boundary awareness through learnable cross-stream interactions.

Keywords

Dense Blob Detection, Cross-Attention Mechanism, Multi-Scale Feature Learning, Edge-Aware Processing, Dual-Stream Architecture, Contrastive Learning

Cite This Paper

Lujian Song, Jin Lu. Cross-Attention Dual-Stream Network for Dense Blob Detection: A Multi-Scale and Edge-Aware Approach. Academic Journal of Computing & Information Science (2026), Vol. 9, Issue 3: 54-62. https://doi.org/10.25236/AJCIS.2026.090307.

References

[1] T. Lindeberg, "Scale-space theory: A basic tool for analyzing structures at different scales," Journal of Applied Statistics, vol. 21, no. 1-2, pp. 225-270, 1994.

[2] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

[3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, "Speeded-up robust features (SURF)," Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346-359, 2008.

[4] P. F. Alcantarilla, A. Bartoli, and A. J. Davison, "KAZE features," in Proc. ECCV, 2012, pp. 214-227.

[5] S. Leutenegger, M. Chli, and R. Y. Siegwart, "BRISK: Binary robust invariant scalable keypoints," in Proc. ICCV, 2011, pp. 2548-2555.

[6] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, "ORB: An efficient alternative to SIFT or SURF," in Proc. ICCV, 2011, pp. 2564-2571.

[7] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, "LIFT: Learned invariant feature transform," in Proc. ECCV, 2016, pp. 467-483.

[8] D. DeTone, T. Malisiewicz, and A. Rabinovich, "SuperPoint: Self-supervised interest point detection and description," in Proc. CVPR Workshops, 2018.

[9] M. Dusmanu et al., "D2-Net: A trainable CNN for joint description and detection of local features," in Proc. CVPR, 2019.

[10] A. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk, "Key.Net: Keypoint detection by handcrafted and learned CNN filters," in Proc. ICCV, 2019.

[11] J. Revaud et al., "R2D2: Repeatable and reliable detector and descriptor," in Proc. NeurIPS, 2019.

[12] J. Sun et al., "LoFTR: Detector-free local feature matching with transformers," in Proc. CVPR, 2021.

[13] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proc. NeurIPS, 2014.

[14] K. Kamnitsas et al., "Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation," Medical Image Analysis, vol. 36, pp. 61-78, 2017.

[15] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proc. CVPR, 2018.

[16] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, "CBAM: Convolutional block attention module," in Proc. ECCV, 2018.

[17] X. Wang, R. Girshick, A. Gupta, and K. He, "Non-local neural networks," in Proc. CVPR, 2018, pp. 7794-7803.

[18] A. Vaswani et al., "Attention is all you need," in Proc. NeurIPS, 2017.

[19] F. Schroff, D. Kalenichenko, and J. Philbin, "FaceNet: A unified embedding for face recognition and clustering," in Proc. CVPR, 2015, pp. 815-823.

[20] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, "A simple framework for contrastive learning of visual representations," in Proc. ICML, 2020.

[21] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, "Momentum contrast for unsupervised visual representation learning," in Proc. CVPR, 2020.