AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing
文献类型:期刊论文
| 作者 | Ji, Huawei1; Deng, Cheng1; Xue, Bo1; Jin, Zhouyang1; Ding, Jiaxin1; Gan, Xiaoying1; Fu, Luoyi1; Wang, Xinbing1; Zhou, Chenghu2 |
| 刊名 | 2025 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)
![]() |
| 出版日期 | 4565 |
| 卷号 | N/A页码:6007 |
| 关键词 | Academic literature parsing Benchmark dataset Vision-language model Data-centric |
| ISSN号 | 1520-6149 |
| DOI | 10.1109/ICASSP49660.2025.10889977 |
| 产权排序 | 2 |
| 文献子类 | Proceedings Paper |
| 英文摘要 | With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at https://github.com/JHW5981/AceParse. |
| URL标识 | 查看原文 |
| WOS研究方向 | Acoustics ; Computer Science ; Engineering |
| 语种 | 英语 |
| WOS记录号 | WOS:001611517600751 |
| 出版者 | IEEE |
| 源URL | [http://ir.igsnrr.ac.cn/handle/311030/219421] ![]() |
| 专题 | 资源与环境信息系统国家重点实验室_外文论文 |
| 通讯作者 | Ding, Jiaxin |
| 作者单位 | 1.Shanghai Jiao Tong Univ, Shanghai, Peoples R China; 2.Chinese Acad Sci, Inst Geog Sci & Nat Resources Res, Beijing, Peoples R China |
| 推荐引用方式 GB/T 7714 | Ji, Huawei,Deng, Cheng,Xue, Bo,et al. AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing[J]. 2025 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP),4565,N/A:6007. |
| APA | Ji, Huawei.,Deng, Cheng.,Xue, Bo.,Jin, Zhouyang.,Ding, Jiaxin.,...&Zhou, Chenghu.(4565).AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing.2025 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP),N/A,6007. |
| MLA | Ji, Huawei,et al."AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing".2025 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) N/A(4565):6007. |
入库方式: OAI收割
来源:地理科学与资源研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。

