中国科学院机构知识库网格系统: A fine-tuned multimodal large model for power defect image-text question-answering

A fine-tuned multimodal large model for power defect image-text question-answering

文献类型：期刊论文


作者	Wang, Qiqi 1; Zhang, Jie 2; Du, Jianming 2; Zhang, Ke 3; Li, Rui 2; Zhao, Feng 4; Zou, Le 1; Xie, Chengjun2
刊名	SIGNAL IMAGE AND VIDEO PROCESSING
出版日期	2024-09-28
关键词	Power system Defect detection Multimodal large model LoRA Q-former
ISSN号	1863-1703
DOI	10.1007/s11760-024-03539-w
通讯作者	Zhang, Jie(zhangjie@iim.ac.cn) ; Xie, Chengjun(cjxie@iim.ac.cn)
英文摘要	In power defect detection, the complexity of scenes and the diversity of defects pose challenges for manual defect identification. Considering these issues, this paper proposes utilizing a multimodal large model to assist power professionals in identifying power scenes and defects through image-text interactions, thereby enhancing work efficiency. This paper presents a fine-tuned multimodal large model for power defect image-text question-answering, addressing challenges such as training difficulties and the lack of image-text knowledge specific to power defects. This paper utilizes the YOLOv8 to create a dataset for multimodal power defect detection, enriching the image-text information in the power defect domain. By integrating the LoRA and Q-Former methods for model fine-tuning, the algorithm enhances the extraction of visual and semantic features and aligns visual and semantic information. The experimental results demonstrate that the proposed multimodal large model significantly outperforms other popular multimodal models in the domain of power defect question-answering.
资助项目	Major Science and Technology Project of Anhui Province[202203a05020023] ; Anhui Provincial Natural Science Foundation[2108085UD12]
WOS研究方向	Engineering ; Imaging Science & Photographic Technology
语种	英语
WOS记录号	WOS:001320983900001
出版者	SPRINGER LONDON LTD
资助机构	Major Science and Technology Project of Anhui Province ; Anhui Provincial Natural Science Foundation
源URL	[http://ir.hfcas.ac.cn:8080/handle/334002/135622]
专题	中国科学院合肥物质科学研究院
通讯作者	Zhang, Jie; Xie, Chengjun
作者单位	1.Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei 230601, Peoples R China 2.Chinese Acad Sci, Inst Intelligent Machines, Hefei Inst Phys Sci, Hefei 230031, Peoples R China 3.Anhui Nari Jiyuan Power Grid Technol Co Ltd, Hefei 230088, Peoples R China 4.Univ Sci & Technol China, Hefei 230026, Peoples R China
推荐引用方式 GB/T 7714	Wang, Qiqi,Zhang, Jie,Du, Jianming,et al. A fine-tuned multimodal large model for power defect image-text question-answering[J]. SIGNAL IMAGE AND VIDEO PROCESSING,2024.
APA	Wang, Qiqi.,Zhang, Jie.,Du, Jianming.,Zhang, Ke.,Li, Rui.,...&Xie, Chengjun.(2024).A fine-tuned multimodal large model for power defect image-text question-answering.SIGNAL IMAGE AND VIDEO PROCESSING.
MLA	Wang, Qiqi,et al."A fine-tuned multimodal large model for power defect image-text question-answering".SIGNAL IMAGE AND VIDEO PROCESSING (2024).

入库方式： OAI收割

来源：合肥物质科学研究院

下载0

A fine-tuned multimodal large model for power defect image-text question-answering

其他版本