본문 바로가기
Drug

Virtual screening 8 - Physics-informed GCN

by wycho 2021. 12. 13.

Data bias로 인해 generalization이 어렵고, 많은 양의 data set을 모으는게 불가능하다.

이런 상황에서 모델의 성능을 높이는 방법이 필요.

 

Model capacity (# parameters)

- Optimal capacity

  : Control hidden layers

  : Gives some constaints, e.g., use kernel

- A choice of model specifies which family of functions the learning algorithm can choose

  -> representational capacity of the model.

 

Inductive bias = weight sharing = regularization

- Weight sharing -> much smaller number of parameters -> low model capacity

https://arxiv.org/abs/1806.01261

 

Physics-informed GCN

- Limited data & Intrinsic data bias

- Practically impossible to acquire sufficient amount of high quality data

- But physics is universal (generalized over all systems)

- Let us use physics as an inductive bias (or constraints in learning process)

 

Empirical force field

- Total interaction energy btw protein and ligand

$$ E_{tot} = \sum_{i,j} E_{ij} $$

where i, j are atom indice.

$$ E_{ij} = VDW + Hydrophobic + EDA, $$

- Physical formula for each energy component

$$ VDW = A\left( (A/r)^{12} - (B/r)^6 \right), $$

where r is distance btw atoms and A and B are predicted from NN.

where R = r-D. C is a learnable scalar, and D is predicted from NN.

- Deep learning based parameters

  : Each parameter in the equations is obtained from NN trained with experimental data.

 

Energy-wise constraint

https://youtu.be/v9E1QYoXsKw

Structural constraint

https://youtu.be/v9E1QYoXsKw

 

Data augmentation

: 데이터가 부족한 경우, 데이터를 변형, 가공, 생성하여 데이터의 양을 늘리는 기술.

 

- Docking augmentation

$$ L_{docking} = \sum_i max(y_{exp,i} - y_{decoy,i} ,0) $$

  : The energy of true binding pose should be lower than the predicted energy of decoy structures with wrong poses.

 

- Cross-screening augmentation, ligands from PDBbind training set

$$ L_{cross-screening} = \sum_i max(-y_{cross,i} - 6.8, 0) $$

  : Actives of one target would have low binding affinity to other targets.

 

- Random-screening augmentation, ligands from other dataset(STOCKS)

$$ L_{random-screening} = \sum_i max(-y_{random,i} - 6.8, 0) $$

  : Most random sampled molecules would have low binding affinity with a specific target.

 

Datasets

- PDBbind v.2019 refined set : 4,212 samples for the training set and the 259 samples for the test set.

- Docking augmentation :  generating 202,035 decoy structures using the PDBbind v.2016 dataset.

- Random screening augmentation : generating 773,623 complexes using the IBS molecules

- Cross screening augmentation : generating 386,876 complexes based on the random cross binding

- Benchmarking : CASF2016 & CSAR

https://arxiv.org/abs/2008.12249
https://arxiv.org/abs/2008.12249
https://arxiv.org/abs/2008.12249

 

Bayesian inference

- A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification, https://doi.org/10.1039/C9SC01992H

- posterior의 확률분포와 variance를 uncertainty로 다양한 모델값을 활용하여 예측의 평균값을 얻는다.

 

Self-supervised learning

- Strategies for Pre-training Graph Neural Networks, https://arxiv.org/abs/1905.12265

https://github.com/snap-stanford/pretrain-gnns

- 데이터 자체를 label로 만들어 학습에 활용.

- Pretrained with 2 milion ZINC15, 456K ChEMBL

- Model < 10k parameters

 

Representation of 3D structure

- Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges, https://arxiv.org/abs/2104.13478

- https://geometricdeeplearning.com/

 

 

 

Reference

- Physics-informed GCN for virtual screening l 김우연, https://youtu.be/v9E1QYoXsKw

- https://github.com/jaechanglim/DTI_PDBbind

- 3D graph neural network, https://e3nn.org/, https://github.com/e3nn/e3nn

 

'Drug' 카테고리의 다른 글

[DRUG] Reference  (0) 2021.12.13
Open target platform  (0) 2021.12.13
Virtual screening 7 - Generalization  (0) 2021.12.12
Virtual screening 6 - Hybrid  (0) 2021.12.12
Virtual screening 5 - GCN  (0) 2021.12.12

댓글