Cross-Modal Adaptive Fusion and Enhancement Model for Noise-Robust Scenarios

<p>Wenhui Zhang<sup>1</sup>, Qianxi Li<sup>1</sup></p>

doi:10.25236/AJCIS.2025.080711

Academic Journal of Computing & Information Science, 2025, 8(7); doi: 10.25236/AJCIS.2025.080711.

Cross-Modal Adaptive Fusion and Enhancement Model for Noise-Robust Scenarios

Author(s)

Wenhui Zhang¹, Qianxi Li¹

Corresponding Author:

Wenhui Zhang

Affiliation(s)

¹School of Information Science and Engineering, Chongqing Jiaotong University, Chongqing, 400074, China

Download PDF
|
Download: 16
|
View: 13541

Abstract

With the rapid advancement of computer technology, multi-modality has emerged as a critical area of research. The fusion and alignment of multi-modal data not only enhance the intelligence level of Internet of Things (IoT) devices but also provide users with enriched and precise service experiences. However, most existing studies primarily focus on managing two to three modalities, which often proves inadequate in real-world complex and dynamic scenarios. To address this limitation, this paper conducts an in-depth investigation into multi-modal learning with the aim of overcoming the constraints associated with current modality quantities. In practical applications, coordinating multiple modalities remains a significant challenge, particularly in dynamic environments where noise factors can lead to fluctuating modality dominance. Consequently, achieving effective multi-modal fusion and alignment has become a key research challenge. This paper proposes a novel multi-modal fusion framework that emphasizes both inter-modal complementarity and collaboration while introducing a modality enhancement mechanism designed to mitigate noise interference across modalities. Experimental results validate the effectiveness of our proposed method across four benchmark datasets.

Keywords

Multi-Modal Learning, Cross-Modal Alignment, Modality Fusion, Contrastive Learning, Noise Robustness

Cite This Paper

Wenhui Zhang, Qianxi Li. Cross-Modal Adaptive Fusion and Enhancement Model for Noise-Robust Scenarios. Academic Journal of Computing & Information Science (2025), Vol. 8, Issue 7: 87-94. https://doi.org/10.25236/AJCIS.2025.080711.

References

[1] Yuan Y, Li Z, Zhao B. A Survey of Multimodal Learning: Methods, Applications, and Future[J]. ACM Computing Surveys, 2025.DOI:10.1145/3713070.

[2] Baltrusaitis T, Ahuja C, Morency L P. Multimodal Machine Learning: A Survey and Taxonomy[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, PP(99):1-1.DOI:10. 1109/TPAMI. 2018. 2798607.

[3] Qiao Y, Zhou H, Liu Y, et al. A multi-modal fusion model with enhanced feature representation for chronic kidney disease progression prediction[J]. Briefings in Bioinformatics, 2024, 26(1). DOI:10.1093/bib/bbaf003.

[4] Jiang J. Research on Ship-Type Recognition Based on Fusion of Ship Trajectory Image and AIS Time Series Data[J].Electronics, 2025, 14.DOI:10.3390/electronics14030431.

[5] Xue Z, Marculescu R. Dynamic Multimodal Fusion[J]. Computer Vision and Pattern Recognition Workshops, 2022.DOI:10.48550/arXiv.2204.00102.

[6] Chen C F, Fan Q, Panda R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification[J]. International Conference on Computer Vision, 2021.DOI:10.48550/arXiv.2103.14899.

[7] Park G Y , Kim J , Kim B ,et al.Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models[J]. Advances in Neural Information Processing Systems, 2023.

[8] Cao Y, Long M, Wang J, et al. Deep Visual-Semantic Quantization for Efficient Image Retrieval[C] //Computer Vision & Pattern Recognition.IEEE, 2017.DOI:10.1109/CVPR.2017.104.

[9] Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models From NaturalLanguage Supervision[J]. International Conference on Machine Learning, 2021.DOI:10.48550/arXiv.2103.00020.

[10] Jia C, Yang Y, Xia Y, et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision[J]. International Conference on Machine Learning, 2021. DOI:10.48550/arXiv.2102.05918.

[11] Xiao X, Wu B, Wang J, et al. Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment[J]. Computer Vision and Pattern Recognition (2024).

[12] Xuelong Li. Multimodal Cognitive Computing [J]. Chinese Journal of Science: Information Science, 2023,53(01):1-32.

[13] Lu J, Batra D, Parikh D, et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Advances in Neural Information Processing Systems 32, Volume 1 of 20: 32nd Conference on Neural Information Processing Systems (NeurIPS 2019).Vancouver(CA).8-14 December 2019.2020.

[14] Tan H, Bansal M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers [J]. Conference on Empirical Methods in Natural Language Processing, 2019. DOI:10.18653/v1/D19-1514.

[15] Park D S , Chan W , Zhang Y ,et al.SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition[J]. 2019.DOI:10.21437/Interspeech.2019-2680.

[16] Cubuk E D , Zoph B , Shlens J ,et al.Randaugment: Practical automated data augmentation with a reduced search space[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).IEEE, 2020.DOI:10.1109/CVPRW50498.2020.00359.

[17] Wei Z M, Pan H Y, Qiao L B, et al. Cross-modal knowledge distillation in multi-modal fake news detection[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2022: 4733-4737. DOI: 10.1109/ICASSP43922.2022.9747280.

[18] Shi B , Hsu W N , Lakhotia K ,et al.Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction[J]. In Proc. International Conference on Learning Representations, 2022. DOI:10.48550/arXiv.2201.02184.

[19] Liang P P, Lyu Y, Fan X, et al. HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning[J]. Transactions on Machine Learning Research, 2022. DOI:10.48550/arXiv.2203.01311.

[20] Jaegle A, Gimeno F, Brock A, et al. Perceiver: General Perception with Iterative Attention[J]. 2021. DOI:10.48550/arXiv.2103.03206.

[21] Tsai Y H H, Bai S, Liang P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences[J]. Proceedings of the conference. Association for Computational Linguistics. Meeting, 2019. DOI:10.18653/v1/P19-1656.

[22] Lu J, Batra D, Parikh D, et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Advances in Neural Information Processing Systems 32, Volume 1 of 20: 32nd Conference on Neural Information Processing Systems (NeurIPS 2019).Vancouver(CA).8-14 December 2019.2020.

[23] Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations[J]. Machine Learning. 2020.DOI:10.48550/arXiv.2002.05709.

[24] Liang P P, Lyu Y, Fan X, et al. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning[J]. Machine Learning. 2021.DOI:10.48550/arXiv.2107.07502.

[25] Akbari H, Yuan L, Qian R, et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text[J]. Computer Vision and Pattern Recognition. 2021.DOI: 10. 48550/arXiv. 2104.11178.