본문 바로가기
Drug

Virtual screening 3 - Data sets

by wycho 2021. 12. 11.

Data formats

- FASTA : sequence similarity calculation.

- Structure Data File (SDF) includes 3D atomic coordinates, atom connectivity, molecular weight, logP, etc.

- mol2 : comment, info, elements, coordinate, bond

- Protein Data Bank (PDB) : element, amino acid, chain name, sequence number, coordinates - docking.

 

Database

DUD-E : http://dude.docking.org/, Virtual screening, Classification

- Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking, https://pubs.acs.org/doi/abs/10.1021/jm300687e, https://pubs.acs.org/doi/pdf/10.1021/jm300687e

- 22,886 active compounds and their affinities against 102 targets, an average of 224 ligands per target
- 50 decoys for each active having similar physico-chemical properties but dissimilar 2-D topology.
mol2 and SDF format now available in all packages for actives and decoys.

-Actives (Ligands) annotation by ChEMBL, Inactive (Decoys) by ZINC Database

 

LIT-PCBA : https://drugdesign.unistra.fr/LIT-PCBA/, Virtual screening, Classification

- LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening, https://pubs.acs.org/doi/10.1021/acs.jcim.0c00155,

- 실험적으로 밝혀진 것들. DUD-E에 비해 자연스러운 hit ratio.

- Provide training and validation sets.

 

ChEMBL : https://www.ebi.ac.uk/chembl/, Virtual screening, Binding affinity prediction

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

- ChEMBL : a large-scale bioactivity database for drug discovery, https://doi.org/10.1093/nar/gkr777

- Drug discovery에 적합.

 

PDBbind : http://www.pdbbind.org.cn/, Binding affinity prediction, Binding pose prediction

- The PDBbind Database:  Methodologies and Updates, https://doi.org/10.1021/jm048957q

- Forging the Basis for Developing Protein–Ligand Interaction Scoring Functions, https://doi.org/10.1021/acs.accounts.6b00491

- Bind structure와 binding affinity도 함께 제공하여 regression 및 binding pose prediction에 활용.

- Ligand 가 포함된 pocket을 제공.

- Original literature information, 3D structure view, description of the data, classification, organism, expression system.

- Macromolecules (target), Small molecules (ligands).

- Experimental data, validation, history.

 

CASF Benchmark : http://www.pdbbind.org.cn/casf.php

- Comparative Assessment of Scoring Functions: The CASF-2016 Update, https://doi.org/10.1021/acs.jcim.8b00545

- PDBbind Core set for virtual screening

- To be good at virtual screening, PLI models shoud

  1. Accurately predict the affinity -> Scoring power

  2. Correctly rank the known ligands by affinity -> Rank power

  3. Identify the native ligand binding pose among computer-generated decoys -> Docking power

  4. Identify the true binders to a given protein among a pool of random molecules -> Screening power

 

CSAR : http://www.csardock.org/,

This effort aimed to improve docking and scoring through participation of the entire scientific community. CSAR disseminated experimental datasets of crystal structures and binding affinities for diverse protein-ligand complexes. Some datasets were generated in house at the University of Michigan while others were collected from the literature or deposited by academic labs, national centers, and the pharmaceutical industry.

Computational drug design techniques are very successful at enriching hit rates when identifying sets of compounds for experimental testing. However, it is not possible to reliably rank nanomolar-level compounds over those with micromolar affinities. To improve our approaches, we need better datasets to train scoring functions and develop new docking algorithms.

- CSAR-HiQ set

  : Protein-ligand complex unminumized structure (*_complex.mol2)

  : Protein-ligand complex after minumization (*_complex_min.mol2)

  : Water molecules that were removed (*_water.pdb) when docking.

  : Binding affinity (kd.dat)

 

MUV : https://github.com/skearnes/muv

- Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data, https://doi.org/10.1021/ci8002649

 

Alphafold : https://alphafold.ebi.ac.uk/, Protein fold prediction

- https://github.com/deepmind/alphafold

 

 

Useful tools

RDKit : https://www.rdkit.org/

- https://github.com/rdkit/rdkit

 

Open Babel :http://openbabel.org/wiki/Main_Page

- https://github.com/openbabel/openbabel

 

PyMOL : https://pymol.org/2/

- https://github.com/schrodinger/pymol-open-source

 

 

 

Reference

- Data sets for virtual screening? l 김우연, https://youtu.be/EHaK4TNSMSQ

 

 

'Drug' 카테고리의 다른 글

Virtual screening 5 - GCN  (0) 2021.12.12
Virtual screening 4 - 3D CNN  (0) 2021.12.12
Virtual screening 2 - AI  (0) 2021.12.11
Virtual screening 1 - Intro  (0) 2021.12.10
Molecular dynamics software  (0) 2021.09.29

댓글