Data bias로 인해 generalization이 어렵고, 많은 양의 data set을 모으는게 불가능하다.
이런 상황에서 모델의 성능을 높이는 방법이 필요.
Model capacity (# parameters)
- Optimal capacity
: Control hidden layers
: Gives some constaints, e.g., use kernel
- A choice of model specifies which family of functions the learning algorithm can choose
-> representational capacity of the model.
Inductive bias = weight sharing = regularization
- Weight sharing -> much smaller number of parameters -> low model capacity
Physics-informed GCN
- Limited data & Intrinsic data bias
- Practically impossible to acquire sufficient amount of high quality data
- But physics is universal (generalized over all systems)
- Let us use physics as an inductive bias (or constraints in learning process)
Empirical force field
- Total interaction energy btw protein and ligand
$$ E_{tot} = \sum_{i,j} E_{ij} $$
where i, j are atom indice.
$$ E_{ij} = VDW + Hydrophobic + EDA, $$
- Physical formula for each energy component
$$ VDW = A\left( (A/r)^{12} - (B/r)^6 \right), $$
where r is distance btw atoms and A and B are predicted from NN.
where R = r-D. C is a learnable scalar, and D is predicted from NN.
- Deep learning based parameters
: Each parameter in the equations is obtained from NN trained with experimental data.
Energy-wise constraint
Structural constraint
Data augmentation
: 데이터가 부족한 경우, 데이터를 변형, 가공, 생성하여 데이터의 양을 늘리는 기술.
- Docking augmentation
$$ L_{docking} = \sum_i max(y_{exp,i} - y_{decoy,i} ,0) $$
: The energy of true binding pose should be lower than the predicted energy of decoy structures with wrong poses.
- Cross-screening augmentation, ligands from PDBbind training set
$$ L_{cross-screening} = \sum_i max(-y_{cross,i} - 6.8, 0) $$
: Actives of one target would have low binding affinity to other targets.
- Random-screening augmentation, ligands from other dataset(STOCKS)
$$ L_{random-screening} = \sum_i max(-y_{random,i} - 6.8, 0) $$
: Most random sampled molecules would have low binding affinity with a specific target.
Datasets
- PDBbind v.2019 refined set : 4,212 samples for the training set and the 259 samples for the test set.
- Docking augmentation : generating 202,035 decoy structures using the PDBbind v.2016 dataset.
- Random screening augmentation : generating 773,623 complexes using the IBS molecules
- Cross screening augmentation : generating 386,876 complexes based on the random cross binding
- Benchmarking : CASF2016 & CSAR
Bayesian inference
- A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification, https://doi.org/10.1039/C9SC01992H
- posterior의 확률분포와 variance를 uncertainty로 다양한 모델값을 활용하여 예측의 평균값을 얻는다.
Self-supervised learning
- Strategies for Pre-training Graph Neural Networks, https://arxiv.org/abs/1905.12265
https://github.com/snap-stanford/pretrain-gnns
- 데이터 자체를 label로 만들어 학습에 활용.
- Pretrained with 2 milion ZINC15, 456K ChEMBL
- Model < 10k parameters
Representation of 3D structure
- Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges, https://arxiv.org/abs/2104.13478
- https://geometricdeeplearning.com/
Reference
- Physics-informed GCN for virtual screening l 김우연, https://youtu.be/v9E1QYoXsKw
- https://github.com/jaechanglim/DTI_PDBbind
- 3D graph neural network, https://e3nn.org/, https://github.com/e3nn/e3nn
'Drug' 카테고리의 다른 글
[DRUG] Reference (0) | 2021.12.13 |
---|---|
Open target platform (0) | 2021.12.13 |
Virtual screening 7 - Generalization (0) | 2021.12.12 |
Virtual screening 6 - Hybrid (0) | 2021.12.12 |
Virtual screening 5 - GCN (0) | 2021.12.12 |
댓글