The DECIMER.ai Project

  • Over the past few decades, the number of publications describing chemical structures and their metadata has increased significantly. Chemists have published the majority of this information as bitmap images along with other important information as human-readable text in printed literature and have never been retained and preserved in publicly available databases as machine-readable formats. Manually extracting such data from printed literature is error-prone, time-consuming, and tedious. The recognition and translation of images of chemical structures from printed literature into machine-readable format is known as Optical Chemical Structure Recognition (OCSR). In recent years, deep-learning-based OCSR tools have become increasingly popular. While many of these tools claim to be highly accurate, they are either unavailable to the public or proprietary. Meanwhile, the available open-source tools are significantly time-consuming to set up. Furthermore, none of these offers an end-to-end workflow capable of detecting chemical structures, segmenting them, classifying them, and translating them into machine-readable formats. To address this issue, we present the DECIMER.ai project, an open-source platform that provides an integrated solution for identifying, segmenting, and recognizing chemical structure depictions within the scientific literature. DECIMER.ai comprises three main components: DECIMER-Segmentation, which utilizes a Mask-RCNN model to detect and segment images of chemical structure depictions; DECIMER-Image Classifier EfficientNet-based classification model identifies which images contain chemical structures and DECIMER-Image Transformer which acts as an OCSR engine which combines an encoder-decoder model to convert the segmented chemical structure images into machine-readable formats, like the SMILES string. A key strength of DECIMER.ai is that its algorithms are data-driven, relying solely on the training data to make accurate predictions without any hand-coded rules or assumptions. By offering this comprehensive, open-source, and transparent pipeline, DECIMER.ai enables automated extraction and representation of chemical data from unstructured publications, facilitating applications in chemoinformatics and drug discovery.

Export metadata

Metadaten
Author:Kohulan Rajan, Achim Zielesny, Christoph Steinbeck
Parent Title (English):Workshop Computer Vision for Science (CV4Science) at the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), June 17, 2024, Seattle, USA
Document Type:Conference Proceeding
Language:English
Date of Publication (online):2024/07/15
Year of first Publication:2024
Publishing Institution:Westfälische Hochschule Gelsenkirchen Bocholt Recklinghausen
Release Date:2024/07/23
Departments / faculties:Institute / Institut für biologische und chemische Informatik
Licence (German):License LogoEs gilt das Urheberrechtsgesetz

$Rev: 13159 $