生物信息学_基因表达分析
生物信息学
基因表达分析
陈小伟 chenxiaowei@ 中国科学院生物物理研究所 2014.10.08
Gene Expression Analysis
• Background • Experimental techniques used to measure gene expression
– SAGE – DNA microarray – RNA-Seq
• Long non-coding RNA microarray • Gene expression data analysis
– Experiment design – Microarray data analysis procedure
lncRNA microarray
Remove redundant lncRNA sequences
RefSeq UCSC H-InvDB ……
Xref & Sequence similarity & Genome loci
GENCODE 37,491 lncRNAs (V3) One specific probe for each lncRNA or its isoform
Detected
Gene expression data analysis
• Microarray data analysis procedure
Intensity
Goal: make multiple arrays comparable
expression analysis
• Sources of variation between multiple highExpression density oligonucleotide arrays: profile • Biological • Disease VS. Control • Non-biological Quality control • Total RNA preparation, amplification • Sample labeling differences Normalization • Hybridization • Scanner differences Differential gene • Image analysis
• Blocking
• The process of identifying or building groups of EU which are expected to have similar responses in the absence of any treatment effects
Gene expression data analysis
Experiment design
Gene expression data analysis
• Experimental design “To consult a statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.”
• RNA-seq (Illumina)
Long non-coding RNA microarray
lncRNA microarray
Systematic identification of lncRNAs
• High-throughput Sequencing – ChIP-Seq – CAGE-seq – 3P-seq – RNA-Seq
Transcription
Experimental techniques used to measure gene expression
Experimental techniques
• SAGE (Serial Analysis of Gene Expression)
– Victor Velculescu – 1995, Johns Hopkins University
• • • • Normalization Hypothesis testing Multiple hypothesis testing False positive control
Background
Background
• Human Genome
– Publication of Initial Working Draft Sequence [February 12, 2001]
• ENCODE (Encyclopedia of DNA Elements)
– 74.7% of human genome covered by primary transcripts – 62.1% of human genome covered by processed transcripts – 2.94% of human genome covered by exons of proteincoding genes
Rinn and Chang, 2012
lncRNA microarray
LncRNA dataset
LncRNA datasets NONCODE GENCODE Human lincRNA Catalog lncRNAdb RefSeq UCSC Genes H-InvDB lncRNAs from HOX loci lncRNAs from ultraconserved regions lncRNA count 95,135 26,414 14,353 118 4,814 5,596 1,038 962 407
ProbeName
Control
Tumor
RNA53314
RNA53313 RNA53312 RNA53311 RNA53310
3610.6355
330.27353 2991.578 46.673733 58.98197
7735.4663
230.98158 3540.922 19.396254 16.519632
lncRNA microarray
Data sources of lncRNA microarray
Sources GENCODE/ENSEMBL Human LincRNA Catalog RefSeq UCSC NRED H-InvDB Enhancer-like lncRNA RNAdb Antisense ncRNA pipeline UCRs CombinedLit Hox ncRNAs snoRNA lncRNAdb ncRNAs from Chen lab Total Unique lncRNAs V1 4765 13521 1289 17203 2975 1053 481 529 389 78 848 42283 30,622 V2 12754 8195 4765 13521 1289 17203 2975 1053 481 529 407 389 78 848 63639 35,024 V3 22444 14353 4814 5596 13701 1038 3019 1599 1053 962 529 407 389 104 848 70856 37,491
• Experimental design principle
• Replication
• Biological replicates
Sample1 Sample2 Sample3
Microarray1 Microarray2 Microarray3
• Technical replicates
Sample1
Not randomized
randomized
Gene expression data analysis
• Experimental design principle
• Blocking
Control T1 Exp.1 Exp.2 Exp.3 RNA extracts: Day1 Day2 Day3 T2 Exp.1 Exp.2 Exp.3 Control T1 T2 RNA extracts: Day1 Day2 Day3
• Experimental design principle
• Replication
• The process of applying each treatment to more than one experimental unit (EU)
• Randomization
• Randomly allocating treatments to EU, to ensure fair assessment of the treatments
Gene expression data analysis
• Microarray data analysis procedure
Intensity Expression profile Quality control Normalization Differential gene expression analysis
Microarray1 Microarray2 Microarray3
Gene expression data analysis
• Experimental design principle
• Randomization
• Each gene is spotted in quadruplicate
lncRNA microarray
LncRNA classification
19% Intergenic 4% 11% 8% 58% Divergent Intronic