Also in Chemistry, Deep Learning Models Love Really Big Data

Rajan, Kohulan; Zielesny, Achim; Steinbeck, Christoph

Inspired by the super-human performance of deep learning models in playing the game of Go after being presented with virtually unlimited training data, we looked into areas in chemistry where similar situations could be achieved. Encountering large amounts of training data in chemistry is still rare, so we turned to two areas where realistic training data can be fabricated in large quantities, namely a) the recognition of machine-readable structures from images of chemical diagrams and b) the conversion of IUPAC(-like) names into structures and vice versa. In this talk, we outline the challenges, technical implementation and results of this study. Optical Chemical Structure Recognition (OCSR): Vast amounts of chemical information remain hidden in the primary literature and have yet to be curated into open-access databases. To automate the process of extracting chemical structures from scientific papers, we developed the DECIMER.ai project. This open-source platform provides an integrated solution for identifying, segmenting, and recognising chemical structure depictions in scientific literature. DECIMER.ai comprises three main components: DECIMER-Segmentation, which utilises a Mask-RCNN model to detect and segment images of chemical structure depictions; DECIMER-Image Classifier EfficientNet-based classification model identifies which images contain chemical structures and DECIMER-Image Transformer which acts as an OCSR engine which combines an encoder-decoder model to convert the segmented chemical structure images into machine-readable formats, like the SMILES string. DECIMER.ai is data-driven, relying solely on the training data to make accurate predictions without hand-coded rules or assumptions. The latest model was trained with 127 million structures and 483 million depictions (4 different per structure) on Google TPU-V4 VMs Name to Structure Conversion: The conversion of structures to IUPAC(-like) or systematic names has been solved algorithmically or rule-based in satisfying ways. This fact, on the other side, provided us with an opportunity to generate a name-structure training pair at a very large scale to train a proof-of-concept transformer network and evaluate its performance. In this work, the largest model was trained using almost one billion SMILES strings. The Lexichem software utility from OpenEye was employed to generate the IUPAC names used in the training process. STOUT V2 was trained on Google TPU-V4 VMs. The model's accuracy was validated through one-to-one string matching, BLEU scores, and Tanimoto similarity calculations. To further verify the model's reliability, every IUPAC name generated by STOUT V2 was analysed for accuracy and retranslated using OPSIN, a widely used open-source software for converting IUPAC names to SMILES. This additional validation step confirmed the high fidelity of STOUT V2's translations.

Verfasserangaben:	Kohulan Rajan, Achim Zielesny, Christoph Steinbeck
Titel des übergeordneten Werkes (Englisch):	Beilstein Bozen Symposium 2024 - AI in Chemistry and Biology: Evolution or Revolution?, Rüdesheim, Germany
Dokumentart:	Konferenzveröffentlichung
Sprache:	Englisch
Datum der Veröffentlichung (online):	15.07.2024
Jahr der Erstveröffentlichung:	2024
Veröffentlichende Institution:	Westfälische Hochschule Gelsenkirchen Bocholt Recklinghausen
Datum der Freischaltung:	23.07.2024
Lizenz (Deutsch):	Es gilt das Urheberrechtsgesetz

Also in Chemistry, Deep Learning Models Love Really Big Data

Metadaten exportieren

Weitere Dienste