본문 바로가기

전체 글225

Methylprep Methylation 분석을 위한 data preprocessing에 유용한 툴이다. - https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/ - https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/_modules/index.html - https://github.com/FoxoTech/methylprep - https://github.com/FoxoTech/methylprep/blob/master/docs/general_walkthrough.md - https://github.com/FoxoTech/methylprep/blob/master/docs/special.. 2022. 1. 19.
ZINC database ZINC database - ZINC20 : https://zinc20.docking.org/ - ZINC15 : https://zinc.docking.org/ - Papers : ZINC: A Free Tool to Discover Chemistry for Biology (2005), https://doi.org/10.1021/ci3001277 : ZINC 15 – Ligand Discovery for Everyone (2015), https://doi.org/10.1021/acs.jcim.5b00559 : ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery (2020), https://doi.org/10.1021/acs.jcim.. 2022. 1. 10.
[R] table information # list objects in the working environment ls() # list the variables in mydata names(mydata) # list the structure of mydata str(mydata) # list levels of factor v1 in mydata levels(mydata$v1) # dimensions of an object dim(object) # class of an object (numeric, matrix, data frame, etc) class(object) # print mydata mydata # print first 10 rows of mydata head(mydata, n=10) # print last 5 rows of myda.. 2022. 1. 6.
Molecule design using Deep generative models 강화학습 - 시나리오 1 : 분자생성 (agent) -> 물성개성 -> reward (environment) -> 동일한 상황에서 agent는 같은 분자를 만들어낼 확률이 높아짐. - 시나리오 2 : 분자생성 (agent) -> 물성 악화 -> penalty (environment) -> 동일한 상황에서 agent는 같은 분자를 만들어낼 확률이 낮아짐. - Deep reinforcement learning for de novo drug design, https://doi.org/10.1126/sciadv.aap7885 > Tanh activation 값을 통해서 해석 가능성을 보여줌. (chemically sensible groups, syntactic groups) 화학반응기반 분자 생성 모델 - 분자.. 2022. 1. 4.
Molecule design using Graph model 2 골격기반 분자 그래프 생성모델 (scaffold-based molecular generative model) - 여러 물성을 만족하는 분자를 만들기위해 기본 골격을 유지하면서 분자를 추가하고 분석하여 생성. 3D linker design model - DeLinker : 2개 fragment의 3차원 배양을 고려하여 최적의 liner를 디자인하는 모델 - Deep Generative Models for 3D Linker Design, https://doi.org/10.1021/acs.jcim.9b01120 - https://github.com/oxpig/DeLinker - 기존의 데이터베이스 기반 방법들보다 실제와 유사한 분자들을 생성함. - Docking 계산시 더 안정한 구조들이 만들어짐. - 실험으.. 2022. 1. 4.
Molecule design using Graph model 1 Graph vs. SMILES - SMILES : 유사한 분자가 매우 다른 smiles로 표현됨. (학습에 어려움이 가중됨) - Graph가 smiles보다 분자를 표현할 수 있어 보다 자연스러운 representation. 분자 그래프 - 원자 -> node, 공유결합 -> edge - 원자와 공유결합 정보를 node와 edge에 vector로 표현. - 모델 학습에 필요한 다양한 정보를 표현 가능. 순차적 분자그래프 생성모델 - Fixed order 또는 random order 로 분자생성. 경험적으로 차이없음. Fragment based molecule generation using Language Model - 사람의 관점에서 분자는 substructure의 집합. - 원자보다는 작용기 (frag.. 2022. 1. 4.
Molecule design using SMILES Language model using SMILES - Validity by using RDKit - Uniqueness - Novelty, not included in training set Pros - 구현이 쉬움. (library가 잘 구축되어 있음) - 학습이 쉬움. Cons - Latent space 분석이 불가능. (laternt vector modification이 안됨) (Conditional) Variational autoencoder Pros - 구현하기가 상대적으로 수월함. - 난이도 대비 상대적으로 우수한 결과를 보여줌. - Latent space analysis (or optimization)가 가능. Cons - Prior assumption이 큰 restriction으로 작용함... 2022. 1. 4.
Cell ratio 2022. 1. 3.
[Company] Standigm Standigm : https://www.standigm.com/main 2021 대한민국 바이오 투자 콘퍼런스 - 스탠다임, https://youtu.be/VmF7a7ROBOE Standigm pipeline ASK process - 질병과 타겟의 알려진 지식정도 : NLP 기술로 논문 수집. - 관심 유전자 포함여부. - 생물학적 경로 분석 : Biological pathway - gene 의 관계를 weight 로 부여 - 환자 특이적 발현도 - 조직 특이적 발현도 - 경쟁 상황 : 임상단계 집입된 타겟 제외. BEST process - DB - Hit ID : 결합력 예측 - Hit to lead : novel scaffold - Lead optimization : Moiety 기반 부분 구조변경.. 2021. 12. 30.
Signature matrix RNA-expression signature matrix reference CD4 - GSE107011 (2019) : RNA-Seq profiling of 29 immune cell types and peripheral blood mononuclear cells - GSE113891 (2018) : Transcriptomic profile of circulating CD4+ T cells from TCM and TEM memory compartments from donors vaccinated at birth either with whole or acellular Pertussis vaccine - GSE114407 (2018) : Cell type specific gene expression patt.. 2021. 12. 29.
ngrok - local PC에 접속하기 Local PC에서 작업한 내용을 외부에서 확인하거나 시연을 해야할 상황이 있을 경우, IP가 외부접속이 불가능할 경우 이것을 우회하여 접속하는 방법을 소개한다. - Homepage : https://ngrok.com/ - Download : https://ngrok.com/download - Tokenkey : https://dashboard.ngrok.com/get-started/your-authtoken - 세션상태확인 : http://127.0.0.1:4040 $ curl -s https://ngrok-agent.s3.amazonaws.com/ngrok.asc | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null && echo "deb https://n.. 2021. 12. 27.
numpy - ravel_multi_index arr = np.array([[3,6,6],[4,5,1]]) np.ravel_multi_index(arr, (7,6)) ## array([22, 41, 37]) r = 7 c = 6 print(np.arange(r*c)) print(np.arange(r*c).reshape(r,c)) print(np.arange(r*c).reshape(r,c)[[3,6,6],[4,5,1]]) ## [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41] ## [[ 0 1 2 3 4 5] ## [ 6 7 8 9 10 11] ## [12 13 14 15 16 17] ## .. 2021. 12. 21.
Post Hoc tests Reference - Comparison of post hoc tests for unequal variance ,https://www.ijntse.com/upload/1447070311130.pdf 2021. 12. 16.
[paper] Identification of SARS-CoV-2–induced pathways reveals drug repurposing strategies Identification of SARS-CoV-2–induced pathways reveals drug repurposing strategies (2021) - Science Advances, https://doi.org/10.1126/sciadv.abh3032 - bioRxiv, https://doi.org/10.1101/2020.08.24.265496 - Presentation : https://youtu.be/dqyzbC5ZSZA (Korean) - Presentation : https://youtu.be/SE3dGRKp5s0 (English) - Github : https://github.com/wchwang/Method_Pancorona Method SARS-CoV-2 와 직접적으로 관련있는 .. 2021. 12. 14.
[DRUG] Reference - From machine learning to deep learning: Advances in scoring functions for protein–ligand docking, 2019, https://doi.org/10.1002/wcms.1429 - Deep Learning for Drug Design: an Artificial Intelligence Paradigm for Drug Discovery in the Big Data Era, 2018, https://doi.org/10.1208/s12248-018-0210-0 - 2021. 12. 13.
Open target platform The Open Targets Platform The Open Targets Platform integrates over 20 different public data sources, and uses this data to systematically build and score associations between drug targets and diseases. Users investigating particular associations can rapidly sift through all the available evidence from genetic associations, somatic mutations, pathways and systems biology, RNA expression, animal .. 2021. 12. 13.
Virtual screening 8 - Physics-informed GCN Data bias로 인해 generalization이 어렵고, 많은 양의 data set을 모으는게 불가능하다. 이런 상황에서 모델의 성능을 높이는 방법이 필요. Model capacity (# parameters) - Optimal capacity : Control hidden layers : Gives some constaints, e.g., use kernel - A choice of model specifies which family of functions the learning algorithm can choose -> representational capacity of the model. Inductive bias = weight sharing = regularization - Weight s.. 2021. 12. 13.
Virtual screening 7 - Generalization Generalization - Making the training error small $$ 1/m^{train} ||X^{train}ω - y^{train}||^2 $$ - Make the gap between training and test error small $$ 1/m^{test} ||X^{test}ω - y^{test}||^2 $$ - Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. - Overfitting occurs when the gap between the training error and test error is too large. - Th.. 2021. 12. 12.
Virtual screening 6 - Hybrid Autoencoder - Dimensionality reduction : input data의 dimension을 줄이기 위해 사용. - 그 과정을 encoding 이라고 함. - Dimensional reduction을 통해 핵심 feature를 잘 학습했는지 확인하기 위한 방법은, 원래의 data를 reconstruction할 수 있어야 함. Classification with separate graphs GCN with autoencoder for virtual screening - Graph Convolutional Neural Networks for Predicting Drug-Target Interactions, https://doi.org/10.1021/acs.jcim.9b00628 - Li.. 2021. 12. 12.
Virtual screening 5 - GCN Graph Convolutional Networks - 비정형화된 구조에 사용가능. System - Structure : Representation, Computation. - Entity : element, size, mass, ... - Relation : property btw entities. - Rule : relational inductive biases, structure를 바탕으로 DL 설계. Graph representation $$ Graph = G(X,A) $$ X : Node, Vertex - Atoms in a molecule A : Adjacency matrix - Edges of a graph - Connectivity, relationship Molecular graphs X.. 2021. 12. 12.
Virtual screening 4 - 3D CNN Deep Neural Network (fully connected) - Large number of parameters -> easy to be overfitted when data is small. or large memory consumption (GPU) - Does not enforce any structure, e.g., local information (local feature를 찾아내는게 어렵다.) Convolution Neural Network (weight sharing and convolving) - Reduce the number of parameters (less overfitting) 3D CNN for virtual screening - Grid representation - B.. 2021. 12. 12.
Virtual screening 3 - Data sets Data formats - FASTA : sequence similarity calculation. - Structure Data File (SDF) includes 3D atomic coordinates, atom connectivity, molecular weight, logP, etc. - mol2 : comment, info, elements, coordinate, bond - Protein Data Bank (PDB) : element, amino acid, chain name, sequence number, coordinates - docking. Database DUD-E : http://dude.docking.org/, Virtual screening, Classification - Dir.. 2021. 12. 11.
Virtual screening 2 - AI Molecular structure-property relationship by supervised learning - Input : Structure or protein (X) - Method : Convention, $$ Y = f(X) $$ where f = Schrodinger equation or Hamiltonian : Modern, feature extraction (L), $$ Y = f_θ(X)$$ where f = AI or machine learning (DNN, CNN, RNN, GNN, etc), θ = a set of learnable parameters - Output data : Property (Y), biding affinity. Modeling (θ) = Maximum .. 2021. 12. 11.
Virtual screening 1 - Intro Protein-Ligand interaction 예측이 중요. - Assumptions : Rigid protein structure, no explicit solvation, no explicit pH dependence, etc. : 가정하고 있는 한계를 고려하여 계산한다. - Step1. Structure preparation (protein, ligand) : target 단백질에 대한 3D 구조를 준비. PDB, X-ray analysis, homology modeling, folding prediction. - Step2. Ligand preparation. : conformer search, charging state, protonation of aicds, etc - Step3. Bindi.. 2021. 12. 10.
Combine files 여러개의 파일을 하나로 합치고 싶을 때 사용하는 방법이다. import pandas as pd import pickle as pk import gzip import glob flist = sorted(glob.glob('*.tsv')) with gzip.open('data.db','wb') as f: for fname in flist: df = pd.read_table(fname) pk.dump(df,f) with gzip.open('data.db','rb') as f: for _ in range(len(flist)): df = pk.load(f) print(df) 2021. 12. 3.
Heroku - App publication Streamlit 으로 만든 서비스를 web으로 publish 하기 위한 방법을 소개한다. 준비해야할 파일들 (4) $ cat requirements.txt lifelines==0.26.4 matplotlib==3.4.3 numpy==1.20.0 pandas==1.3.4 scikit_learn==1.0.1 seaborn==0.11.2 streamlit==1.2.0 $ cat setup.sh mkdir -p ~/.streamlit/ echo "\ [server]\n\ headless = true\n\ port = $PORT\n\ enableCORS = false\n\ \n\ " > ~/.streamlit/config.toml $ Procfile web: sh setup.sh && streamlit run .. 2021. 12. 2.
WEB-based analysis TIMER : A Web Server for Comprehensive Analysis of Tumor-Infiltrating Immune Cells > TIMER, https://doi.org/10.1158/0008-5472.CAN-17-0307 > TIMER2.0, https://doi.org/10.1093/nar/gkaa407 http://timer.cistrome.org/ http://timer.comp-genomics.org/ GEPIA : Gene Expression Profiling Interactive Analysis > GEPIA, https://doi.org/10.1093/nar/gkx247 > GEPIA2, https://doi.org/10.1093/nar/gkz430 > GEPIA20.. 2021. 11. 30.
[paper] Bioinformatics screening of biomarkers related to liver cancer https://doi.org/10.1186/s12859-021-04411-1 0. TCGA DB : Ready data 1. DESeq2 : Upregulated and Downregulated genes. - |logFC| > 2 , FDR < 0.05 2. GSEA : Enrichment (GO) and Pathway (KEGG, Reactome) analysis. - FDR(GO) < 0.01 , P-value(KEGG) < 0.05 3. STRING DB and Cytoscape : PPI network hub for key genes. - top 15 hub genes 4. Oncomine (gene chip) DB : Differential expression analysis and meta-.. 2021. 11. 11.
RNA velocity 보통 우리가 다루는 RNA 데이터는 한 시점에 채취한 혈액이나 tissue에서 얻은 것이다. 즉, 시간에 따라 cell이 어떻게 변화하는지 알 수 없는 정적인 데이터이다. 이러한 정적인 데이터로부터, unspliced mRNA 와 spliced mRNA 의 비율을 통해 cell fate, cell lineage, dynamic pathway 또는 cellular differentiation 를 추정하는 방법이 있다. (RNA velocity of single cells, 2018) 좀 더 robust한 방법으로 likelihood-based dynamic model이 있다. (Generalizing RNA velocity to transient cell states through dynamical mod.. 2021. 11. 2.
Webpage with Streamlit Python에서 다루는 데이터의 정보를 web에서 볼 수 있게 해주는 Streamlit 이라는 편리한 툴이 있다. python과 연동하여 작업하므로 HTML을 몰라도 되고, 보여주고 싶은 결과에 대한 홈페이지를 만들기 쉽다. 단점은 여러개의 tab으로 이루어진 multi-page를 만드는게 어렵다는 것이다. $ pip install streamlit $ pip install streamlit-aggrid 더보기 import pandas as pd import numpy as np import streamlit as st import matplotlib.pyplot as plt import altair as alt # remove github icon st.markdown( """ """, unsafe_a.. 2021. 10. 28.