본문 바로가기
Tools

Methylprep

by wycho 2022. 1. 19.

Methylation 분석을 위한 data preprocessing에 유용한 툴이다.

 

- https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/

- https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/_modules/index.html

- https://github.com/FoxoTech/methylprep

- https://github.com/FoxoTech/methylprep/blob/master/docs/general_walkthrough.md

- https://github.com/FoxoTech/methylprep/blob/master/docs/special_cases.md

- https://giters.com/adamritter/methylprep

 

$ pip install methylsuite
더보기
usage: methylprep [-h] [-v] [-d]
                  {process,beta_bake,download,meta_data,composite,sample_sheet,alert} ...

Utility to process methylation data from Illumina IDAT files. There are two types of processing:
"process" IDAT files or read a "sample_sheet" contents. Example of usage: `python -m methylprep -v
process -d <path to your samplesheet.csv and idat files>` Try our demo dataset: `python -m methylprep
-v process -d docs/example_data/GSE69852`

positional arguments:
  {process,beta_bake,download,meta_data,composite,sample_sheet,alert}
    process             Finds idat files and calculates raw, beta, m_values for a batch of samples.
    beta_bake           All encompasing pipeline that will find GEO datasets in any form, download,
                        and convert into a pickled dataframe of beta-values. Just specify the GEO_ID.
    download            Downloads the specified series from GEO or ArrayExpress.
    meta_data           Creates a meta_data dataframe from GEO MINiML XML file. Specify the GEO id.
    composite           Create a single dataset from a group of public GEO or ArrayExpress datasets,
                        and apply filters to sample meta data at same time.
    sample_sheet        Finds and validates a SampleSheet for a given directory of idat files.
    alert               Command line or Cron function to search GEO for datasets, updating only if new
                        data found.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Display more detailed messages during processing.
  -d, --debug           Display VERY detailed messages during processing.

 

 

Download idat information from GEO.

$ python -m methylprep -v download -i GSE147391 -d <download_path>

 

Transform idat files to values.

$ python -m methylprep process -d <idat_filepath> --all --minfi -s user_samplesheet.csv -bs 300 -np 4
The --all option at the end tells methylprep to save output for ALL of the associated processing steps.
- beta_values.pkl
- poobah_values.pkl
- control_probes.pkl
- m_values.pkl
- noob_meth_values.pkl
- noob_unmeth_values.pkl
- meth_values.pkl
- unmeth_values.pkl
- sample_sheet_meta_data.pkl

By default, the output is usually:
- beta_values.pkl
- noob_meth_values.pkl
- noob_unmeth_values.pkl
- control_probes.pkl

Default normalization follows with SeSAMe parameters. The '--minfi' input minfi parameters.

'-s' is for user's samplesheet.

'-bs' is batch size. When you run with too many samples, there will be memory allocation problem.
Then split the sample size. For example, '-bs 300' means you are going to run with 300 samples each.

'-np' is for multi-processing. Default is maximum threads in your CPUs.

 

Find GEO dataset which title has "glioblastoma".

$ python -m methylprep alert -k "glioblastoma"
glioblastoma.csv

 

Combine datasets.

$ python -m methylprep composite -l glioblastoma.txt -d <filepath>
$ cat glioblastoma.txt
GSE161175
GSE122808
GSE143842
GSE143843

 

Create a sample sheet from GEO data information.

$ python -m methylprep meta_data -i GSE147391 -d <filepath>
2.0K Feb  4 14:19 GSE147391_GPL21145_meta_data.pkl
1.1K Feb  4 14:19 GSE147391_GPL21145_samplesheet.csv
 45K Jan 19 15:07 GSE147391_family.xml
184M Feb  4 14:19 GSE147391_family.xml.tgz

 

Filter samples with '-k'.

$ python -m methylprep meta_data -i GSE147391 -d <filepath> -k "grade II"
GSM_ID      Sample_Name     Sentrix_ID    Sentrix_Position  source           histological diagnosis              gender  description  Sample_ID
GSM4429896  Grade II rep1   203175700025  R01C01            Resected glioma  Diffuse astrocytoma (II)            Female  Glioma       203175700025_R01C01
GSM4429897  Grade II rep2   203175700025  R02C01            Resected glioma  Diffuse astrocytoma (II)            Male    Glioma       203175700025_R02C01
GSM4429898  Grade II rep3   203163220027  R01C01            Resected glioma  Diffuse astrocytoma (II)            Female  Glioma       203163220027_R01C01
GSM4429900  Grade II rep4   203175700025  R05C01            Resected glioma  Oligodendroglioma (II)              Female  Glioma       203175700025_R05C01
GSM4429901  Grade II rep5   203175700025  R06C01            Resected glioma  Oligodendroglioma (II)              Male    Glioma       203175700025_R06C01
GSM4429902  Grade II rep6   203163220027  R05C01            Resected glioma  Oligodendroglioma (II)              Male    Glioma       203163220027_R05C01

 

Select only control samples.

$ python -m methylprep meta_data --control -i GSE163970

 

Performance test

전체 코드는 serial로 처리되도록 만들어져있다. 결과를 얻기까지 시간이 오래걸려 병렬 처리되도록 바꾸고, 28 threads와 256GB 메모리의 컴퓨터에서 소요시간을 테스트해 보았다. 800명 정도의 샘플을 처리하는데 2일 정도 소요되던 것이 2시간 정도로 단축 되었다. Serial 처리시 memory allocation error가 나서 swap을 늘려 보았지만, 큰 데이터를 처리하는데 시간이 걸리고 효율적이지 않았다. 그래서 batch 로 나누어 처리하고 코드 수정을 거쳐 몇십 기가 단위로 안정적이게 되었다. 몇 가지 손을 보면, 더 빠르고 메모리 사용도 더 줄일 수 있지만 여기까지. 소스코드를 포크하고 개인 github에서 작업하였다.

TCGA (450k)
- KICH (66) ~ 7 min (serial 35 min)
- COAD (353) ~ 30 min

EPIC
- Normal (351) ~ 58 min
- Cancer (826) ~ 2 hr 15 min (serial 43 hr)

Output format change
- Pickle to Parquet
  ( beta_values : 2.7G -> 809M )

 

 

After preprocessing

2022.01.20 - [Tools] - Methylcheck

 

 

 

https://en.wikipedia.org/wiki/DNA_methylation

 

표현형에 영향을 주는 DNA에서의 요인으로는 염기서열의 돌연변이(mutation)와 메틸화(methylation)이 있다.

메틸화는 DNA 염기서열에서 사이토신(cytosine)과 구아닌(guanine)이 붙어있는 사이토신에 메틸기(methyl group, H3C-)가 붙은 것을 말한다. 메틸기가 붙으면 유전자 발현에 영향을 주고 표현형(phenotype)에 영향을 주게된다. 주로 DNA의 유전정보를 mRNA로 옮기는 전사(transcript)과정의 시작부분인 프로모터(promoter)에서 발생하게 된다. 프로모터의 메틸화는 mRNA로의 전사를 도와주는 효소의 활동을 막아 결과적으로 해당 유전자의 발현이 일어나지 않거나 낮게 일어난다.

 

 

Reference

- https://namu.wiki/w/DNA 메틸화

- https://support.illumina.com/downloads/infinium_humanmethylation450_product_files.html

- https://www.prnewswire.com/news-releases/life-epigenetics-releases-open-source-software-to-advance-epigenetics-research-300906916.html

 

'Tools' 카테고리의 다른 글

PBS - Workload manager  (0) 2022.03.13
Methylcheck  (0) 2022.01.20
ngrok - local PC에 접속하기  (0) 2021.12.27
Combine files  (0) 2021.12.03
Heroku - App publication  (0) 2021.12.02

댓글