Virtual screening 7 - Generalization

Generalization

- Making the training error small

$1 / m^{t r a i n} | | X^{t r a i n} ω - y^{t r a i n} | |^{2}$

- Make the gap between training and test error small

$1 / m^{t e s t} | | X^{t e s t} ω - y^{t e s t} | |^{2}$

- Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.

- Overfitting occurs when the gap between the training error and test error is too large.

- The main challenge is to find a right model complexity for a given task.

Diagnosis of overfitting

- training (70%), test (30%)

- 충분하지 않은 데이터.

Model capacity

- capacity = # of unknown parameter

Limited data

이미지 데이터셋

ImageNet 14,197,122 / COCO 330,000

언어 데이터셋

Enron 500,000 / Amazon Reviews 35,000,000

단백질-리간드 데이터셋

PDBbind 11,987 / BACE 208

독성 데이터셋

Tox21 ~10,000 / ToxCast ~1,800 /SIDER 1,430

Problem by limited data

- Overfitting in regression

- Overconfidence in classification

Overconfidence

- A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification, https://doi.org/10.1039/C9SC01992H

- MAP (Maximum A Posteriori) : Only use a single set of model parameters (only one opinion)

- Bayesian inference : Use a multiple sets of model parameters (integrate various opinions)

Data bias

DUD-E

- Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening, https://dx.doi.org/10.1371%2Fjournal.pone.0220113

: Receptor(=protein)-ligand CNN model vs. Ligand-only CNN model -> R^2 = 0.9819

: 두 모델은 차이가 없다. 그 이유로

1) Analogue bias : active compound 의 similar scaffold = similar topological features

2) Decoy bias : 실험적이지 않은 scaffold의 topological fingerprint-based Tanimoto correlation으로 분석하여 구분하는데, ZINC에서 random으로 골라낸 셋중에서 similarity가 비슷한 75%를 제거하여, 25%의 dissiliar decoy를 활용, 즉 active와 inactive를 binding affinity가 아닌 similarity만으로 구분하여 데이터 셋을 구성.

PDBbind

- Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets, https://doi.org/10.3389/fphar.2020.00069 (=GCN)

: use DeepChem package, https://deepchem.io/, https://github.com/deepchem/deepchem

: Test to predict binding affinity with protein only or ligand only (physically impossible) -> gives high correlation

: Test R^2, Random > Ligand scaffold-based > Protein sequence-based

Unbiased dataset

- LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening, https://doi.org/10.1021/acs.jcim.0c00155

1) Data retrieval from the PybChem BioAssay database = count of tested substances ≥ 10,000, count of active substances ≥ 50.

2) Data cleaning = removal of inorganic compounds, false positives, frequent hitters, assy artifacts, and compounds with extreme molecular properties.

3) Virtual screening = removing structural bias

4) Performance assessment = ROC, BEDROC, Enrichment Factor 1%.

5) Dataset with natural hit rates btw active and inactive

Reference

- Generalization and data bias in virtual screening l 김우연, https://youtu.be/5PEpTOm2ZSY

- [DL] 1. Learning Algorithms and basic terms of DL, https://medium.com/temp08050309-devpblog/dl-1-learning-algorithms-and-basic-terms-of-dl-65d46ceb1b0a

저작자표시 비영리 변경금지

'Drug' 카테고리의 다른 글

Open target platform (0)	2021.12.13
Virtual screening 8 - Physics-informed GCN (0)	2021.12.13
Virtual screening 6 - Hybrid (0)	2021.12.12
Virtual screening 5 - GCN (0)	2021.12.12
Virtual screening 4 - 3D CNN (0)	2021.12.12

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Analytic reasoning

Virtual screening 7 - Generalization

'Drug' 카테고리의 다른 글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

Virtual screening 7 - Generalization

'Drug' 카테고리의 다른 글

관련글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역