Virtual screening 8 - Physics-informed GCN

Data bias로 인해 generalization이 어렵고, 많은 양의 data set을 모으는게 불가능하다.

이런 상황에서 모델의 성능을 높이는 방법이 필요.

Model capacity (# parameters)

- Optimal capacity

: Control hidden layers

: Gives some constaints, e.g., use kernel

- A choice of model specifies which family of functions the learning algorithm can choose

-> representational capacity of the model.

Inductive bias = weight sharing = regularization

- Weight sharing -> much smaller number of parameters -> low model capacity

Physics-informed GCN

- Limited data & Intrinsic data bias

- Practically impossible to acquire sufficient amount of high quality data

- But physics is universal (generalized over all systems)

- Let us use physics as an inductive bias (or constraints in learning process)

Empirical force field

- Total interaction energy btw protein and ligand

$$ E_{tot} = \sum_{i,j} E_{ij} $$

where i, j are atom indice.

$$ E_{ij} = VDW + Hydrophobic + EDA, $$

- Physical formula for each energy component

$$ VDW = A\left( (A/r)^{12} - (B/r)^6 \right), $$

where r is distance btw atoms and A and B are predicted from NN.

where R = r-D. C is a learnable scalar, and D is predicted from NN.

- Deep learning based parameters

: Each parameter in the equations is obtained from NN trained with experimental data.

Energy-wise constraint

Structural constraint

Data augmentation

: 데이터가 부족한 경우, 데이터를 변형, 가공, 생성하여 데이터의 양을 늘리는 기술.

- Docking augmentation

$$ L_{docking} = \sum_i max(y_{exp,i} - y_{decoy,i} ,0) $$

: The energy of true binding pose should be lower than the predicted energy of decoy structures with wrong poses.

- Cross-screening augmentation, ligands from PDBbind training set

$$ L_{cross-screening} = \sum_i max(-y_{cross,i} - 6.8, 0) $$

: Actives of one target would have low binding affinity to other targets.

- Random-screening augmentation, ligands from other dataset(STOCKS)

$$ L_{random-screening} = \sum_i max(-y_{random,i} - 6.8, 0) $$

: Most random sampled molecules would have low binding affinity with a specific target.

Datasets

- PDBbind v.2019 refined set : 4,212 samples for the training set and the 259 samples for the test set.

- Docking augmentation : generating 202,035 decoy structures using the PDBbind v.2016 dataset.

- Random screening augmentation : generating 773,623 complexes using the IBS molecules

- Cross screening augmentation : generating 386,876 complexes based on the random cross binding

- Benchmarking : CASF2016 & CSAR

Bayesian inference

- A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification, https://doi.org/10.1039/C9SC01992H

- posterior의 확률분포와 variance를 uncertainty로 다양한 모델값을 활용하여 예측의 평균값을 얻는다.

Self-supervised learning

- Strategies for Pre-training Graph Neural Networks, https://arxiv.org/abs/1905.12265

https://github.com/snap-stanford/pretrain-gnns

- 데이터 자체를 label로 만들어 학습에 활용.

- Pretrained with 2 milion ZINC15, 456K ChEMBL

- Model < 10k parameters

Representation of 3D structure

- Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges, https://arxiv.org/abs/2104.13478

- https://geometricdeeplearning.com/

Reference

- Physics-informed GCN for virtual screening l 김우연, https://youtu.be/v9E1QYoXsKw

- https://github.com/jaechanglim/DTI_PDBbind

- 3D graph neural network, https://e3nn.org/, https://github.com/e3nn/e3nn

저작자표시 비영리 변경금지 (새창열림)

'Drug' 카테고리의 다른 글

[DRUG] Reference (0)	2021.12.13
Open target platform (0)	2021.12.13
Virtual screening 7 - Generalization (0)	2021.12.12
Virtual screening 6 - Hybrid (0)	2021.12.12
Virtual screening 5 - GCN (0)	2021.12.12

Analytic reasoning

Virtual screening 8 - Physics-informed GCN

'Drug' 카테고리의 다른 글

댓글

티스토리툴바

Virtual screening 8 - Physics-informed GCN

'Drug' 카테고리의 다른 글

관련글

댓글

티스토리툴바