DESeq2 - Library normalization

$ conda install gxx_linux-64

DESeq2

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("tximport")
BiocManager::install("DESeq2")

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

https://sites.tufts.edu/biotools/files/2019/04/bioinformatics_for_rnaseq_day2.pdf

library(DESeq2)
library(vsn)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggrepel)
library(DEGreport)
library(pheatmap)
library(org.Sc.sgd.db)
library(clusterProfiler)

# count.tsv is composed of 'gene_id' and samples' IID columns
# sample_info.tsv has 'IID' and 'pheno' columns
setwd("~/rna")
data <- read.table("count.tsv", header=TRUE, check.names=FALSE, row.names="gene_id")
meta <- read.table("sample_info.tsv", header=TRUE, check.names=FALSE, row.name="IID")

dds <- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ pheno)
# Normalize count
dds <- estimateSizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)
# ---------------

dds <- DESeq(dds)

Behind the scenes these steps were run:
* estimating size factors
* gene-wise dispersion estimates
* mean-dispersion relationship
* final dispersion estimates
* fitting model and testing

res <- results(dds)
summary(res)
res <- res[order(res$padj),]

# Volcano plot --------
par(mfrow=c(1,1))
# Make a basic volcano plot
with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-3,3)))

# Add colored points: blue if padj<0.01, red if log2FC>1 and padj<0.05)
with(subset(res, padj<.01 ), points(log2FoldChange, -log10(pvalue), pch=20, col="blue"))
with(subset(res, padj<.01 & abs(log2FoldChange)>2), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))


# PCA ------------------
vsdata <- vst(dds, blind=FALSE)
plotPCA(vsdata, intgroup="dex")

Library normalization

Step 1. Take log_e(depth). log_e is default.

Step 2. Average each row.

: Averages calculated with logs are called "Geometric averages".

Step 3. Filter out genes with infinity.

Step 4. Subtract the average log value from the log(count).

Step 5. Calculate the median of the ratios for each sample.

Step 6. Convert the medians to "normali numbers" to get the final scaling factors for each sample.

: scaling factor = e^median

Step 7. Divide the original read counts by the scaling factors.

Examples

https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html

- Based on this relationship, the dispersion is higher for small mean counts and lower for large mean counts.

- the relationship between mean and variance is linear on the log scale, and for higher means, we could predict the variance relatively accurately given the mean.

- DESeq2 assumes that genes with similar expression levels have similar dispersion.

- given the count values of the replicates, the most likely estimate of dispersion is calculated.

- This shrinkage method is particularly important to reduce false positives in the differential expression analysis.