Diffusion Language Models Are Versatile Protein Learners

Published in ICML, 2024

DPLM is a protein language model that unifies generation and understanding. It uses discrete diffusion to provide a global receptive field, making it well suited for modeling 3D spatial dependencies among amino acids. After generative pre-training on ~45M protein sequences, DPLM achieves SOTA protein sequence generation performance, outperforms ESM2 on protein understanding benchmarks, supports conditional/classifier-guided generation, and scales effectively from 150M to 650M and 3B parameters.

Code: https://github.com/bytedance/dplm

Keywords: diffusion protein language models, protein sequence generation, protein representation learning, controllable protein design

Recommended citation: Xinyou Wang*, Zaixiang Zheng*, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. (2024). "Diffusion Language Models Are Versatile Protein Learners." Proceedings of the 41st International Conference on Machine Learning, 52309-52333.
Download Paper