Corey
by Corey
1 min read

Categories

  • articles

Tags

  • Bioinformatics

RNA-Seq is a developed approach to transcriptome profiling that uses deep-sequencing technologies. The raw data from RNA-Seq should perform normalization before analysis. The normalization step is aiming at removing bias from sequencing depth and gene length by using three common gene expression units: RPKM(Reads Per Kilobase Million), FPKM(Fragments Per Kilobase Million) and TPM(Transcripts Per Kilobase Million)1. I want to explain those concepts in this article.

In one word, when comparing different samples, use TPM. Otherwise, RPKM for single-end RNA-Seq and FPKM for paired-end RNA-Seq.

RPKM of a gene:

FPKM of a gene:

TPM of a gene:

image

Here’s an example in the figure above. The RPKM for gene X in sample 1 is 2.0, and the RPKM in sample 2 is 2.0, I would not know if the same proportion of reads in sample 1 mapped to gene X as in sample 2. The reason is that the denominator required to calculate the proportion could be different for the two samples.

TPM is suitable for this situation, the TPM for gene X in sample 2 is 268141, and the TPM in sample 3 is 268907, then I know that the almost same proportion of total reads mapped to gene X in both samples. In a real-world situation, the TPM value will smaller than what you see because the sample generally has over 20000 genes, not just three in the above schematic figure. Besides, TPM is the best performing normalization method based on its preservation of biological signals compared to the other tested methods2.

In summary, if you want to choose a normalization method, I would recommend TPM.