본문 바로가기
Drug

Virtual screening 7 - Generalization

by wycho 2021. 12. 12.

Generalization

- Making the training error small

$$ 1/m^{train} ||X^{train}ω - y^{train}||^2 $$

- Make the gap between training and test error small

$$ 1/m^{test} ||X^{test}ω - y^{test}||^2 $$

 

- Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.

- Overfitting occurs when the gap between the training error and test error is too large.

- The main challenge is to find a right model complexity for a given task.

 

Diagnosis of overfitting

- training (70%), test (30%)

http://mlwiki.org/index.php/Overfitting

- 충분하지 않은 데이터.

 

Model capacity

- capacity = # of unknown parameter

 

Limited data

이미지 데이터셋

ImageNet 14,197,122 / COCO 330,000

언어 데이터셋

Enron 500,000 / Amazon Reviews 35,000,000

단백질-리간드 데이터셋

PDBbind 11,987 / BACE 208

독성 데이터셋

Tox21 ~10,000 / ToxCast ~1,800 /SIDER 1,430

 

Problem by limited data

- Overfitting in regression

- Overconfidence in classification

 

Overconfidence

- A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification, https://doi.org/10.1039/C9SC01992H

- MAP (Maximum A Posteriori) : Only use a single set of model parameters (only one opinion)

- Bayesian inference : Use a multiple sets of model parameters (integrate various opinions)

 

 

Data bias

DUD-E

- Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening, https://dx.doi.org/10.1371%2Fjournal.pone.0220113

  : Receptor(=protein)-ligand CNN model vs. Ligand-only CNN model -> R^2 = 0.9819

  : 두 모델은 차이가 없다. 그 이유로

    1) Analogue bias : active compound 의 similar scaffold = similar topological features

    2) Decoy bias : 실험적이지 않은 scaffold의 topological fingerprint-based Tanimoto correlation으로 분석하여 구분하는데, ZINC에서 random으로 골라낸 셋중에서 similarity가 비슷한 75%를 제거하여, 25%의 dissiliar decoy를 활용, 즉 active와 inactive를 binding affinity가 아닌 similarity만으로 구분하여 데이터 셋을 구성.

 

PDBbind

- Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets, https://doi.org/10.3389/fphar.2020.00069 (=GCN)

  : use DeepChem package, https://deepchem.io/, https://github.com/deepchem/deepchem

  : Test to predict binding affinity with protein only or ligand only (physically impossible) -> gives high correlation

  : Test R^2, Random > Ligand scaffold-based > Protein sequence-based

 

Unbiased dataset

- LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening, https://doi.org/10.1021/acs.jcim.0c00155

  1) Data retrieval from the PybChem BioAssay database = count of tested substances ≥ 10,000, count of active substances ≥ 50.

  2) Data cleaning = removal of inorganic compounds, false positives, frequent hitters, assy artifacts, and compounds with extreme molecular properties.

  3) Virtual screening = removing structural bias

  4) Performance assessment = ROC, BEDROC, Enrichment Factor 1%.

  5) Dataset with natural hit rates btw active and inactive

 

 

 

Reference

- Generalization and data bias in virtual screening l 김우연, https://youtu.be/5PEpTOm2ZSY 

- [DL] 1. Learning Algorithms and basic terms of DL, https://medium.com/temp08050309-devpblog/dl-1-learning-algorithms-and-basic-terms-of-dl-65d46ceb1b0a

 

 

'Drug' 카테고리의 다른 글

Open target platform  (0) 2021.12.13
Virtual screening 8 - Physics-informed GCN  (0) 2021.12.13
Virtual screening 6 - Hybrid  (0) 2021.12.12
Virtual screening 5 - GCN  (0) 2021.12.12
Virtual screening 4 - 3D CNN  (0) 2021.12.12

댓글