Methylation 분석을 위한 data preprocessing에 유용한 툴이다.
- https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/
- https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/_modules/index.html
- https://github.com/FoxoTech/methylprep
- https://github.com/FoxoTech/methylprep/blob/master/docs/general_walkthrough.md
- https://github.com/FoxoTech/methylprep/blob/master/docs/special_cases.md
- https://giters.com/adamritter/methylprep
$ pip install methylsuite
usage: methylprep [-h] [-v] [-d]
{process,beta_bake,download,meta_data,composite,sample_sheet,alert} ...
Utility to process methylation data from Illumina IDAT files. There are two types of processing:
"process" IDAT files or read a "sample_sheet" contents. Example of usage: `python -m methylprep -v
process -d <path to your samplesheet.csv and idat files>` Try our demo dataset: `python -m methylprep
-v process -d docs/example_data/GSE69852`
positional arguments:
{process,beta_bake,download,meta_data,composite,sample_sheet,alert}
process Finds idat files and calculates raw, beta, m_values for a batch of samples.
beta_bake All encompasing pipeline that will find GEO datasets in any form, download,
and convert into a pickled dataframe of beta-values. Just specify the GEO_ID.
download Downloads the specified series from GEO or ArrayExpress.
meta_data Creates a meta_data dataframe from GEO MINiML XML file. Specify the GEO id.
composite Create a single dataset from a group of public GEO or ArrayExpress datasets,
and apply filters to sample meta data at same time.
sample_sheet Finds and validates a SampleSheet for a given directory of idat files.
alert Command line or Cron function to search GEO for datasets, updating only if new
data found.
optional arguments:
-h, --help show this help message and exit
-v, --verbose Display more detailed messages during processing.
-d, --debug Display VERY detailed messages during processing.
Download idat information from GEO.
$ python -m methylprep -v download -i GSE147391 -d <download_path>
Transform idat files to values.
$ python -m methylprep process -d <idat_filepath> --all --minfi -s user_samplesheet.csv -bs 300 -np 4
The --all option at the end tells methylprep to save output for ALL of the associated processing steps.
- beta_values.pkl
- poobah_values.pkl
- control_probes.pkl
- m_values.pkl
- noob_meth_values.pkl
- noob_unmeth_values.pkl
- meth_values.pkl
- unmeth_values.pkl
- sample_sheet_meta_data.pkl
By default, the output is usually:
- beta_values.pkl
- noob_meth_values.pkl
- noob_unmeth_values.pkl
- control_probes.pkl
Default normalization follows with SeSAMe parameters. The '--minfi' input minfi parameters.
'-s' is for user's samplesheet.
'-bs' is batch size. When you run with too many samples, there will be memory allocation problem.
Then split the sample size. For example, '-bs 300' means you are going to run with 300 samples each.
'-np' is for multi-processing. Default is maximum threads in your CPUs.
Find GEO dataset which title has "glioblastoma".
$ python -m methylprep alert -k "glioblastoma"
glioblastoma.csv
Combine datasets.
$ python -m methylprep composite -l glioblastoma.txt -d <filepath>
$ cat glioblastoma.txt
GSE161175
GSE122808
GSE143842
GSE143843
Create a sample sheet from GEO data information.
$ python -m methylprep meta_data -i GSE147391 -d <filepath>
2.0K Feb 4 14:19 GSE147391_GPL21145_meta_data.pkl
1.1K Feb 4 14:19 GSE147391_GPL21145_samplesheet.csv
45K Jan 19 15:07 GSE147391_family.xml
184M Feb 4 14:19 GSE147391_family.xml.tgz
Filter samples with '-k'.
$ python -m methylprep meta_data -i GSE147391 -d <filepath> -k "grade II"
GSM_ID Sample_Name Sentrix_ID Sentrix_Position source histological diagnosis gender description Sample_ID
GSM4429896 Grade II rep1 203175700025 R01C01 Resected glioma Diffuse astrocytoma (II) Female Glioma 203175700025_R01C01
GSM4429897 Grade II rep2 203175700025 R02C01 Resected glioma Diffuse astrocytoma (II) Male Glioma 203175700025_R02C01
GSM4429898 Grade II rep3 203163220027 R01C01 Resected glioma Diffuse astrocytoma (II) Female Glioma 203163220027_R01C01
GSM4429900 Grade II rep4 203175700025 R05C01 Resected glioma Oligodendroglioma (II) Female Glioma 203175700025_R05C01
GSM4429901 Grade II rep5 203175700025 R06C01 Resected glioma Oligodendroglioma (II) Male Glioma 203175700025_R06C01
GSM4429902 Grade II rep6 203163220027 R05C01 Resected glioma Oligodendroglioma (II) Male Glioma 203163220027_R05C01
Select only control samples.
$ python -m methylprep meta_data --control -i GSE163970
Performance test
전체 코드는 serial로 처리되도록 만들어져있다. 결과를 얻기까지 시간이 오래걸려 병렬 처리되도록 바꾸고, 28 threads와 256GB 메모리의 컴퓨터에서 소요시간을 테스트해 보았다. 800명 정도의 샘플을 처리하는데 2일 정도 소요되던 것이 2시간 정도로 단축 되었다. Serial 처리시 memory allocation error가 나서 swap을 늘려 보았지만, 큰 데이터를 처리하는데 시간이 걸리고 효율적이지 않았다. 그래서 batch 로 나누어 처리하고 코드 수정을 거쳐 몇십 기가 단위로 안정적이게 되었다. 몇 가지 손을 보면, 더 빠르고 메모리 사용도 더 줄일 수 있지만 여기까지. 소스코드를 포크하고 개인 github에서 작업하였다.
TCGA (450k)
- KICH (66) ~ 7 min (serial 35 min)
- COAD (353) ~ 30 min
EPIC
- Normal (351) ~ 58 min
- Cancer (826) ~ 2 hr 15 min (serial 43 hr)
Output format change
- Pickle to Parquet
( beta_values : 2.7G -> 809M )
After preprocessing
2022.01.20 - [Tools] - Methylcheck
표현형에 영향을 주는 DNA에서의 요인으로는 염기서열의 돌연변이(mutation)와 메틸화(methylation)이 있다.
메틸화는 DNA 염기서열에서 사이토신(cytosine)과 구아닌(guanine)이 붙어있는 사이토신에 메틸기(methyl group, H3C-)가 붙은 것을 말한다. 메틸기가 붙으면 유전자 발현에 영향을 주고 표현형(phenotype)에 영향을 주게된다. 주로 DNA의 유전정보를 mRNA로 옮기는 전사(transcript)과정의 시작부분인 프로모터(promoter)에서 발생하게 된다. 프로모터의 메틸화는 mRNA로의 전사를 도와주는 효소의 활동을 막아 결과적으로 해당 유전자의 발현이 일어나지 않거나 낮게 일어난다.
Reference
- https://support.illumina.com/downloads/infinium_humanmethylation450_product_files.html
'Tools' 카테고리의 다른 글
PBS - Workload manager (0) | 2022.03.13 |
---|---|
Methylcheck (0) | 2022.01.20 |
ngrok - local PC에 접속하기 (0) | 2021.12.27 |
Combine files (0) | 2021.12.03 |
Heroku - App publication (0) | 2021.12.02 |
댓글