2DE_Heatmap

Author

Jaakko

Published

November 20, 2025

Quarto

This is a Quarto document. Quarto is a multi-language, next-generation version of R Markdown. Check the tutorial.
N.B. - Learn the IT foundations to master bioinformatics tools and softwares, and prevent getting tangled on the infamous “dependency hell”. Ask for the “Foundations of Bioinformatics Infrastructure”.

Create and activate the environment “bioinfo”

mamba create -n bioinfo python=3.11
mamba activate bioinfo

Download the RNA-seq dataset

Let’s use a real example - RNA-seq of ITPR3 and RELB knockout in SW480 under normoxia and hypoxia (Homo sapiens) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE197576.

N.B. - Structure properly your folder (‘mkdir folder’):

.
├── data/
├── scripts/
└── results/

How to download the files from FTP with UNIX terminal: https://www.ncbi.nlm.nih.gov/geo/info/download.html
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE197nnn/GSE197576/suppl/

# Move to /data folder !!!
cd data
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE197nnn/GSE197576/suppl/GSE197576_raw_gene_counts_matrix.tsv.gz

Alternative use GEOquery https://bioconductor.org/packages/release/bioc/html/GEOquery.html.

Check the data ad extract the samples of interest

Check the data in the compressed RNA-seq file.

zless -S GSE197576_raw_gene_counts_matrix.tsv.gz
# z because is compressed
# less is the faster command to view the conte of a file
# -S maintain one raw column
# press q to quit

Install csvtk annd check the data in a pretty aligned manner.

mamba install -c bioconda csvtk
csvtk pretty -t ../data/GSE197576_raw_gene_counts_matrix.tsv.gz | less -S
# -t tab the limit
# press q to quit
csvtk headers -t ../data/GSE197576_raw_gene_counts_matrix.tsv.gz 
gene
01_SW_sgCTRL_Norm
02_SW_sgCTRL_Norm
03_SW_sgITPR3_1_Norm
04_SW_sgITPR3_1_Norm
07_SW_sgRELB_3_Norm
08_SW_sgRELB_3_Norm
11_SW_sgCTRL_Hyp
12_SW_sgCTRL_Hyp
13_SW_sgITPR3_1_Hyp
14_SW_sgITPR3_1_Hyp
17_SW_sgRELB_3_Hyp
18_SW_sgRELB_3_Hyp

N.B - Get csvtk at https://github.com/shenwei356/csvtk

Extract the gene expression values from:

01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 17_SW_sgRELB_3_Hyp 18_SW_sgRELB_3_Hyp
csvtk cut -t -f1,2,3,12,13 ../data/GSE197576_raw_gene_counts_matrix.tsv.gz | column -t | head
gene                       01_SW_sgCTRL_Norm  02_SW_sgCTRL_Norm  17_SW_sgRELB_3_Hyp  18_SW_sgRELB_3_Hyp
DDX11L1                    0                  0                  0                   0
WASH7P                     18                 11                 30                  25
MIR6859-1                  5                  1                  8                   5
MIR1302-2HG                0                  0                  0                   0
MIR1302-2                  0                  0                  0                   0
FAM138A                    0                  0                  0                   0
OR4F5                      0                  0                  0                   0
LOC100996442               9                  3                  15                  20
LOC729737                  3                  3                  33                  45
csvtk cut -t -f1,1,2,12,13 ../data/GSE197576_raw_gene_counts_matrix.tsv.gz > raw.counts.tsv

Your /data folder now contains:

data/
├── GSE197576_raw_gene_counts_matrix.tsv.gz
└── raw.counts.tsv

Read the data into R and make a DESeq2 object

Follow the tutorial http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html.

Setup

Install the R-packages for this project within the ‘bioinfo’ virtual environment.

mamba install -c conda-forge r-tidyverse
mamba install -c conda-forge r-dplyr
mamba install -c conda-forge r-here
mamba install -c bioconda bioconductor-deseq2
mamba intall -c conda-forge r-ggplot2
mamba install -c conda-forge r-ggrepel
mamba install -c conda-forge r-BiocManager   

Save the environment for reproducibility.

mamba env export --no-builds > bioinfo_environment.yml

Open RStudio from the terminal: you are now working with the R software within ‘bioinfo’ and you can find the packages previously installed.

open -na RStudio

Let’s start

library(dplyr) #https://dplyr.tidyverse.org/
library(readr) #https://readr.tidyverse.org/
library(here) #https://cran.r-project.org/web/packages/here/vignettes/here.html
library(DESeq2) #https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8
library(ggplot2)
library(ggrepel)
BiocManager::install("ComplexHeatmap")
library(ComplexHeatmap)

raw_counts <- read_tsv(here("01_Heatmap_Genomics/2_DE_Heatmap/data/raw.counts.tsv")) #import the dataset

raw_counts_mat <- raw_counts[, -1] %>% as.matrix #create the values matrix removing the first column for DESeq2
head(raw_counts_mat)
     01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 17_SW_sgRELB_3_Hyp 18_SW_sgRELB_3_Hyp
[1,]                 0                 0                  0                  0
[2,]                18                11                 30                 25
[3,]                 5                 1                  8                  5
[4,]                 0                 0                  0                  0
[5,]                 0                 0                  0                  0
[6,]                 0                 0                  0                  0
rownames(raw_counts_mat) <- raw_counts$gene #add row names to the matrix
head(raw_counts_mat)
            01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 17_SW_sgRELB_3_Hyp
DDX11L1                     0                 0                  0
WASH7P                     18                11                 30
MIR6859-1                   5                 1                  8
MIR1302-2HG                 0                 0                  0
MIR1302-2                   0                 0                  0
FAM138A                     0                 0                  0
            18_SW_sgRELB_3_Hyp
DDX11L1                      0
WASH7P                      25
MIR6859-1                    5
MIR1302-2HG                  0
MIR1302-2                    0
FAM138A                      0

Make a dataframe for the metadata.

metaframe <- tibble(condition = c ("normoxia", "normoxia", "hypoxia", "hypoxia"))
rownames(metaframe) <- colnames(raw_counts_mat)
metaframe
# A tibble: 4 × 1
  condition
* <chr>    
1 normoxia 
2 normoxia 
3 hypoxia  
4 hypoxia  

Make a DESeq2 object (tutorial).

all(rownames(metaframe) == colnames(raw_counts_mat)) #check
[1] TRUE
dds <- DESeqDataSetFromMatrix(countData = raw_counts_mat, #input object with design
                              colData = metaframe,
                              design = ~ condition)
dds <- DESeq(dds) #DE analysis

?results
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia")) #extract results from the analysis 

res %>% 
  as.data.frame() %>%
  arrange(padj <= 0.1, abs(log2FoldChange) >= 2) %>% #filter significant and substantial expression changes 
  head(n = 10)
               baseMean log2FoldChange     lfcSE       stat    pvalue      padj
WASH7P        19.868995     0.41727500 0.5998607  0.6956198 0.4866669 0.6105208
MIR6859-1      4.356427     0.65513668 1.3035290  0.5025870 0.6152547 0.7232638
LOC100996442  10.646734     1.06292492 0.8520192  1.2475363 0.2122009 0.3229198
WASH9P        50.042666     0.04451088 0.3788408  0.1174923 0.9064699 0.9408732
LINC01409     12.417663     0.80294905 0.7708893  1.0415880 0.2976028 0.4204900
FAM87B         2.852901    -0.33928196 1.5932534 -0.2129492 0.8313666 0.8900030
LINC00115     24.063112    -0.30645340 0.5400642 -0.5674389 0.5704160 0.6854688
LOC100288175 162.124769     0.20237040 0.2175051  0.9304167 0.3521554 0.4798503
LOC105378948  39.581280     0.61997436 0.4277432  1.4494080 0.1472237 0.2390383
RNF223         9.592991    -0.79713059 0.8827150 -0.9030442 0.3665025 0.4942234
significant_genes <- res %>% #identify the respective genes
  as.data.frame() %>%
  filter(padj <= 0.1, abs(log2FoldChange) >= 2) %>%
  rownames()

significant_genes[1:10]
 [1] "LOC729737"    "LINC02593"    "LOC107985728" "LOC107985376" "LINC02781"   
 [6] "LINC01714"    "C1orf167"     "NPPB"         "PADI2"        "PLA2G5"      

PCA Analysis to verify your results

vsd <- vst(dds, blind = F) #normalization applying a variance stabilizing transformation (VST)

p <- plotPCA(vsd, intgroup = c("condition"))
p + coord_cartesian(ylim = c(-0.7, 0.7))

You can make a PCA by yourself.

# vsd <- vst(dds, blind = F)
head(assay(vsd), 3)
          01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 17_SW_sgRELB_3_Hyp
DDX11L1            10.31070          10.31070           10.31070
WASH7P             10.48764          10.46633           10.51233
MIR6859-1          10.40400          10.35764           10.41488
          18_SW_sgRELB_3_Hyp
DDX11L1             10.31070
WASH7P              10.49477
MIR6859-1           10.39306
?assay

normalized_vsd <- assay(vsd) %>% #from DeSeq object to matrix
 as.matrix()

pca_prcomp <- prcomp(t(normalized_vsd), center = T, scale. = F) #PCA analysis

names(pca_prcomp)
[1] "sdev"     "rotation" "center"   "scale"    "x"       
pca_prcomp$x
                         PC1       PC2       PC3           PC4
01_SW_sgCTRL_Norm  -15.07327  2.234704 -1.602628  9.567716e-14
02_SW_sgCTRL_Norm  -14.90781 -2.238765  1.619684 -4.871971e-15
17_SW_sgRELB_3_Hyp  14.92560 -1.909982 -1.893329  1.475209e-14
18_SW_sgRELB_3_Hyp  15.05548  1.914043  1.876273  1.168549e-13
PC1_PC2 <- data.frame(
  PC1 = pca_prcomp$x[,1],
  PC2 = pca_prcomp$x[,2],
  type = rownames(pca_prcomp$x)
)

var_explained <- (pca_prcomp$sdev^2) / sum(pca_prcomp$sdev^2)
pc1_percent <- round(var_explained[1] * 100, 1)
pc2_percent <- round(var_explained[2] * 100, 1)

ggplot(PC1_PC2, aes(x = PC1, y = PC2, col = type)) +
  geom_point(size = 4) +
  geom_text_repel(
    aes(label = type),
    box.padding = 0.5,
    point.padding = 0.3,
    segment.size = 0.3,
    min.segment.length = 0
  ) +
  
  labs(
    x = paste0("PC1 (", pc1_percent, "% variance)"),
    y = paste0("PC2 (", pc2_percent, "% variance)")
  ) +
  
  coord_cartesian(clip = "off") + 
  theme_bw(base_size = 14) +
  theme(
    plot.margin = margin(10, 5, 10, 5),  
    legend.position = "right"
  )

It is not exactly the same, what’s going on?

?plotPCA #using the top 500 most variable genes

Make a perfect heatmap

significant_mat <- normalized_vsd[significant_genes, ]
dim(significant_mat)
[1] 967   4
Heatmap(t(scale(t(significant_mat))))

You get this perfect looking heatmap because you select the genes that are different. So, no surprise at all!

metaframe
# A tibble: 4 × 1
  condition
* <chr>    
1 normoxia 
2 normoxia 
3 hypoxia  
4 hypoxia  
meta_annot <- HeatmapAnnotation(df = metaframe,
                                col = list(condition = c("hypoxia" = "red", "normoxia" = "blue")))

Heatmap(t(scale(t(significant_mat))),
        top_annotation = meta_annot,
        show_row_names = F,
        name = "scaled normalized\nexpression"
        )

why scaling is important?

Heatmap(significant_mat, 
        top_annotation = meta_annot,
        show_row_names = FALSE,
        name = "scaled normalized\nexpression"
        )