CLOUD: A Scientific Foundation Model for Crystal Property Prediction

Date:

Spotlight Paper at MoML 2024

Abstract

Property prediction of crystals is crucial for material design. However, developing machine learning models for these tasks is hampered by the need for labeled data from costly experiments or Density Functional Theory (DFT), resulting in limited data size and poor generalization to new crystals. Foundation models (FMs) present a potential solution with their self-supervised pretraining on unlabeled datasets for better representation learning and transferability. Yet, applying FMs to crystals is challenging due to the sparse number of valid structures for pretraining and the inadequacy of existing representations to capture critical structural information like symmetry. Herein, We propose the CrystaL fOUnDation model (CLOUD), a Transformer-based foundation model for crystal property prediction. CLOUD utilizes a novel symmetry-aware string representation that efficiently encodes symmetry, equivalent sites, and constituting atoms, eliminating the need for coordinate information or equivariant models. Pretrained on million-scale crystal data from various databases via Masked Language Modeling (MLM), CLOUD is then fine-tuned and assessed on eight MatBench datasets. The model not only significantly outperforms structure-agnostic models and achieves near state-of-the-art results on two datasets, but also demonstrates robust scaling with data and model size. This suggests CLOUD’s potential as a scalable solution for crystal foundation models, capable of learning from billions of unlabeled crystal data.