Poster #DD03




ComProScanner: Multi-agent based composition property data extraction framework

A. Roy, E. Grisan, J. Buckeridge, C. Gattinoni



Modern materials discovery using data-driven techniques relies heavily on large and structured databases of material compositions and properties; however, the majority of information regarding experimentally synthesised materials lies buried within millions of scientific articles. Large language models and agents have now made it possible to extract structured knowledge from scientific text, but, despite several approaches designed for this aim, no highly accurate approach focused on com- position and property extraction—the bare minimum for data-driven methods—to create machine learning-ready databases without the need for human assistance has been developed. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties for comprehensive database creation. ComProScanner is a publisher-to-database framework which incorporates publisher APIs bypassing the need to manually upload papers into the framework and it is capable of scanning thousands of papers without human intervention. We evaluated our frame- work using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. Even with this small journal sample, the vast majority of the piezoelectric materials we extracted are not included in commonly available databases and we identified one system with a significantly high piezoelectric coefficient. This framework provides a simple, user-friendly, readily usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.






 Aritra Roy

  •   South Bank University of London · Department of Chemical Process and Energy Engineering · London (UK)