Generalization
- Making the training error small
$$ 1/m^{train} ||X^{train}ω - y^{train}||^2 $$
- Make the gap between training and test error small
$$ 1/m^{test} ||X^{test}ω - y^{test}||^2 $$
- Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.
- Overfitting occurs when the gap between the training error and test error is too large.
- The main challenge is to find a right model complexity for a given task.
Diagnosis of overfitting
- training (70%), test (30%)
- 충분하지 않은 데이터.
Model capacity
- capacity = # of unknown parameter
Limited data
이미지 데이터셋
ImageNet 14,197,122 / COCO 330,000
언어 데이터셋
Enron 500,000 / Amazon Reviews 35,000,000
단백질-리간드 데이터셋
PDBbind 11,987 / BACE 208
독성 데이터셋
Tox21 ~10,000 / ToxCast ~1,800 /SIDER 1,430
Problem by limited data
- Overfitting in regression
- Overconfidence in classification
Overconfidence
- A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification, https://doi.org/10.1039/C9SC01992H
- MAP (Maximum A Posteriori) : Only use a single set of model parameters (only one opinion)
- Bayesian inference : Use a multiple sets of model parameters (integrate various opinions)
Data bias
DUD-E
- Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening, https://dx.doi.org/10.1371%2Fjournal.pone.0220113
: Receptor(=protein)-ligand CNN model vs. Ligand-only CNN model -> R^2 = 0.9819
: 두 모델은 차이가 없다. 그 이유로
1) Analogue bias : active compound 의 similar scaffold = similar topological features
2) Decoy bias : 실험적이지 않은 scaffold의 topological fingerprint-based Tanimoto correlation으로 분석하여 구분하는데, ZINC에서 random으로 골라낸 셋중에서 similarity가 비슷한 75%를 제거하여, 25%의 dissiliar decoy를 활용, 즉 active와 inactive를 binding affinity가 아닌 similarity만으로 구분하여 데이터 셋을 구성.
PDBbind
- Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets, https://doi.org/10.3389/fphar.2020.00069 (=GCN)
: use DeepChem package, https://deepchem.io/, https://github.com/deepchem/deepchem
: Test to predict binding affinity with protein only or ligand only (physically impossible) -> gives high correlation
: Test R^2, Random > Ligand scaffold-based > Protein sequence-based
Unbiased dataset
- LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening, https://doi.org/10.1021/acs.jcim.0c00155
1) Data retrieval from the PybChem BioAssay database = count of tested substances ≥ 10,000, count of active substances ≥ 50.
2) Data cleaning = removal of inorganic compounds, false positives, frequent hitters, assy artifacts, and compounds with extreme molecular properties.
3) Virtual screening = removing structural bias
4) Performance assessment = ROC, BEDROC, Enrichment Factor 1%.
5) Dataset with natural hit rates btw active and inactive
Reference
- Generalization and data bias in virtual screening l 김우연, https://youtu.be/5PEpTOm2ZSY
- [DL] 1. Learning Algorithms and basic terms of DL, https://medium.com/temp08050309-devpblog/dl-1-learning-algorithms-and-basic-terms-of-dl-65d46ceb1b0a
'Drug' 카테고리의 다른 글
Open target platform (0) | 2021.12.13 |
---|---|
Virtual screening 8 - Physics-informed GCN (0) | 2021.12.13 |
Virtual screening 6 - Hybrid (0) | 2021.12.12 |
Virtual screening 5 - GCN (0) | 2021.12.12 |
Virtual screening 4 - 3D CNN (0) | 2021.12.12 |
댓글