Virtual screening 7 - Generalization

Generalization

- Making the training error small

$$ 1/m^{train} ||X^{train}ω - y^{train}||^2 $$

- Make the gap between training and test error small

$$ 1/m^{test} ||X^{test}ω - y^{test}||^2 $$

- Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.

- Overfitting occurs when the gap between the training error and test error is too large.

- The main challenge is to find a right model complexity for a given task.

Diagnosis of overfitting

- training (70%), test (30%)

- 충분하지 않은 데이터.

Model capacity

- capacity = # of unknown parameter

Limited data

이미지 데이터셋

ImageNet 14,197,122 / COCO 330,000

언어 데이터셋

Enron 500,000 / Amazon Reviews 35,000,000

단백질-리간드 데이터셋

PDBbind 11,987 / BACE 208

독성 데이터셋

Tox21 ~10,000 / ToxCast ~1,800 /SIDER 1,430

Problem by limited data

- Overfitting in regression

- Overconfidence in classification

Overconfidence

- A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification, https://doi.org/10.1039/C9SC01992H

- MAP (Maximum A Posteriori) : Only use a single set of model parameters (only one opinion)

- Bayesian inference : Use a multiple sets of model parameters (integrate various opinions)

Data bias

DUD-E

- Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening, https://dx.doi.org/10.1371%2Fjournal.pone.0220113

: Receptor(=protein)-ligand CNN model vs. Ligand-only CNN model -> R^2 = 0.9819

: 두 모델은 차이가 없다. 그 이유로

1) Analogue bias : active compound 의 similar scaffold = similar topological features

2) Decoy bias : 실험적이지 않은 scaffold의 topological fingerprint-based Tanimoto correlation으로 분석하여 구분하는데, ZINC에서 random으로 골라낸 셋중에서 similarity가 비슷한 75%를 제거하여, 25%의 dissiliar decoy를 활용, 즉 active와 inactive를 binding affinity가 아닌 similarity만으로 구분하여 데이터 셋을 구성.

PDBbind

- Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets, https://doi.org/10.3389/fphar.2020.00069 (=GCN)

: use DeepChem package, https://deepchem.io/, https://github.com/deepchem/deepchem

: Test to predict binding affinity with protein only or ligand only (physically impossible) -> gives high correlation

: Test R^2, Random > Ligand scaffold-based > Protein sequence-based

Unbiased dataset

- LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening, https://doi.org/10.1021/acs.jcim.0c00155

1) Data retrieval from the PybChem BioAssay database = count of tested substances ≥ 10,000, count of active substances ≥ 50.

2) Data cleaning = removal of inorganic compounds, false positives, frequent hitters, assy artifacts, and compounds with extreme molecular properties.

3) Virtual screening = removing structural bias

4) Performance assessment = ROC, BEDROC, Enrichment Factor 1%.

5) Dataset with natural hit rates btw active and inactive

Reference

- Generalization and data bias in virtual screening l 김우연, https://youtu.be/5PEpTOm2ZSY

- [DL] 1. Learning Algorithms and basic terms of DL, https://medium.com/temp08050309-devpblog/dl-1-learning-algorithms-and-basic-terms-of-dl-65d46ceb1b0a

저작자표시 비영리 변경금지 (새창열림)

'Drug' 카테고리의 다른 글

Open target platform (0)	2021.12.13
Virtual screening 8 - Physics-informed GCN (0)	2021.12.13
Virtual screening 6 - Hybrid (0)	2021.12.12
Virtual screening 5 - GCN (0)	2021.12.12
Virtual screening 4 - 3D CNN (0)	2021.12.12

Analytic reasoning

Virtual screening 7 - Generalization

'Drug' 카테고리의 다른 글

댓글

티스토리툴바

Virtual screening 7 - Generalization

'Drug' 카테고리의 다른 글

관련글

댓글

티스토리툴바