CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning

Published in Nature Communications, 2026

Paper available here


Abstract

Predicting crystal properties is essential for understanding structure-property relationships and accelerating material discovery. However, conventional approaches like experimental measurements or density functional theory calculations are resource-intensive, limiting their scalability. While machine learning offers a promising alternative by learning complex structure-property relationships from data, existing models often rely on labeled data, adopt representations insufficiently capturing essential structural characteristics, and lack integration of physics, limiting their generalizability and interpretability. Here, we introduce CLOUD (Crystal Language mOdel for Unified and Differentiable materials modeling), a transformer-based framework trained on a Symmetry-Consistent Ordered Parameter Encoding (SCOPE) that encodes crystal symmetry, Wyckoff positions, and composition in a compact, coordinate-free string representation. Pre-trained on over six million crystals, CLOUD is fine-tuned on downstream tasks and achieves competitive performance across diverse material properties, demonstrating strong scaling with respect to data and model size. Furthermore, as a proof-of-concept of differentiable materials modeling, CLOUD is applied to predict the phonon-related properties by integrating with the Debye model. This approach enforces thermodynamic consistency and enables temperature-dependent property prediction without requiring additional data. These results demonstrate CLOUD’s potential as a scalable and physics-informed foundation model for crystalline materials, unifying symmetry-consistent representations with physics-grounded learning for property prediction and materials discovery.