ESM 蛋白质语言模型

ESM 系列把蛋白序列当作”语言”，用大型 Transformer 学习其”语法”。核心思想：从大量蛋白序列中，模型自己学会理解什么是”合理的”蛋白序列。

核心模型

读 Greener 2022 中的 PLM 章节
  → 看 Hugging Face ESM 教程
  → 跑一次 ESM-2 嵌入提取（demo）
  → 学突变效应预测（zero-shot）
  → 学微调（如果需要）

📘 Lin et al. (2023) — Evolutionary-scale prediction of atomic-level protein structure — Science

📘 Rives et al. (2021) — Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences — PNAS

📘 Hayes et al. (2024) — Simulating 500 million years of evolution with a language model

Hugging Face ESM 教程：https://huggingface.co/docs/transformers/model_doc/esm
facebookresearch/esm：https://github.com/facebookresearch/esm — 官方仓库
AI 技术丨ESM3：当多模态蛋白质语言模型遇上 Scaling Law - 大湾生物： https://www.greatbay-bio.com.cn/ndetail/130.html

任务	用什么	怎么做
提取蛋白特征	ESM-2 embeddings	用 GitHub esm 库一行代码
突变效应预测	ESM-2 likelihood	比较 wild-type 和 mutant 的 log-likelihood
结构预测（无 MSA）	ESMFold	API 或本地跑
蛋白功能注释	ESM 微调	用少量标签数据
蛋白生成	ProGen2	生成新序列