Update on Gene Expression Analysis,Proteomics,and Network Discovery Gene Expression Analysis,Proteomics,andNetwork Discovery1Sacha Baginsky,Lars Hennig,Philip Zimmermann,and Wilhelm Gruissem*Department of Biology and Zurich-Basel Plant Science Center,ETH Zurich Universita¨tstrasse2, 8129Zurich,SwitzerlandTechnological advances in biological experimenta-tion are now enabling researchers to investigate living systems on an unprecedented scale by studying ge-nomes,proteomes,or molecular networks in their entirety.Genomics technologies have led to a para-digm shift in biological experimentation because they measure(profilemost or even all components of one class(e.g.transcripts,proteins,etc.in a highly parallelway.Whether gene expression analysis using micro-arrays,proteome and metabolome analysis using mass spectrometry,or large-scale screens for genetic interactions,high-throughput profiling technologies provide a rich source of quantitative biological information that allows researchers to move beyond a reductionist approach by both integrating and un-derstanding interactions between multiple compo-nents in cells and organisms(Fig.1;for a recent update of bioinformatics tools,see Pitzschke andHirt,2010.Currently,most genomics experiments involve profiling tr anscripts,proteins,or metabolites. Increasing efforts to complement molecular data with phenotypic information will further advance our un-derstanding of the quantitative relationships between molecules in directing systems behavior and function. In the following Update we will briefly review recent advances in thefield and highlight advantages and limitations of current approaches to develop models of genetic and molecular networks that aim to describe emergent properties of plant systems.GENOMICS TECHNOLOGIES:THE POWER OF GENOME-SCALE QUANTITATIVE DATA RESOLUTION PROFILING TRANSCRIPTOMES Transcriptprofiling offers the largest coverage and a wide dynamic range of gene expression information and can often be performed genome wide.Micro-arrays are currently most popular for transcript profiling and can be readily afforded by many laboratories.Various commercial and academic micro-array platforms exist that vary in genome coverage, availability,specificity,and sensitivity(Table I.Micro-arrays manufactured by Affymetrix are probably most commonly used in plant biology(Redman et al.,2004; Rehrauer et al.,2010,but commercial arrays from Agilent or arrays from the academic Complete Arabi-dopsis Transcriptome MicroArray(CATMAconsor-tium are often used as well(for review,see Busch and Lohmann,2007.Serial analysis of gene expression (SAGEand massively parallel signature sequencing (MPSSare well-established alternatives to microar-rays.Both techniques can be superior to microarrays because they do not depend on prior probe selection. More recently,direct sequencing of transcripts by high-throughput sequencing technologies(RNA-Seq has become an additional alternative to microarrays and is superseding SAGE and MPSS(Busch and Lohmann,2007.Like SAGE and MPPS,RNA-Seq does not depend on genome annotation for prior probe selection and avoids biases introduced dur-ing hybridization of microarrays.On the other hand, RNA-Seq poses novel algorithmic and logistic chal-lenges,and current wet-lab RNA-Seq strategies re-quire lengthy library preparation procedures. Therefore,RNA-Seq is the method of choice in projects using nonmodel organisms and for transcript discov-ery and genome annotation.Because of their robust sample processing and analysispipelines,often micro-arrays are still a preferable choice for projects that involve large numbers of samples for profiling tran-scripts in model organisms with well-annotated ge-nomes.Tools such as Genevestigator(Hruz et al.,2008 and MapMan(Usadel etal.,2009allow researchers to organize large gene expression datasets and analyze them for relational networks within a single experi-ment or across many experiments(contextual meta-analysis.PROFILING EPIGENOMES AND TRANSCRIPTION FACTOR BINDINGMuch control of gene expression occurs at the level of transcription,and information on genome-wide chromatin profiles(epigenomesand transcription factor binding to promoters is needed to decipher1This work was supported by the European Union(EU FrameworkProgram6,AGRON-OMICS;grant no.LSHG–CT–2006–037704,the Swiss National Science Foundation,CTI(Swiss Innovation Promotion Agency,ETH Zurich,and the Functional Genomics Center Zurich for our profiling experiments.*Corresponding author;e-mail wgruissem@ethz.ch.The author responsible for distribution of materials integra l to the findings presented in this article in accordance with the policy described in the Instructions forAuthors(is: Wilhelm Gruissem(wgruissem@ethz.ch./cgi/doi/10.1104/pp.109.150433the inherent logic of transcriptional regulation.Chromatin immunoprecipitation (ChIPcoupled to microarray analysis (ChIP-chipor high-throughput sequencing (ChIP-Seqcan generate such data.In plants,DNA methylation,repressive and activating chromatin marks,as well as histone variants have been mapped onto the genome (for review,see Zhang,2008,but because such marks are expected to differ between cell types and developmental stages,more targeted epigenome profiling is needed in thefuture.Targeted analysis of DNA methylation during seed development,forinstance,revealed unexpected genome-wide demethylation (Gehring et al.,2009;Hsieh et al.,2009.ChIP-chip was also used for global mapping of binding sites of transcription factors such as TGA2and SEPALLATA3and to refine definitions of binding mot ifs that were previously determined by in vitro experiments (Thibaud-Nissen etal.,2006;Kaufmann et al.,2009.It was found that SEPAL-LATA3is a key component in the regulatory tran-scriptional network underlying the formation of floral organs.In a comparative experiment ChIP-chip and ChIP-Seq gave very similar results (Kaufmann etal.,2009.This is encouraging because bias introduced by the profiling technology seems not to severely con-found studies on global protein-binding profiles.Cur-rently,work is going on in several laboratories to establish a compendium of transcription factor binding sites in Arabidopsis (Arabidopsis thaliana .Thus,more genome-wide data sets are in reach that could provide causal explanations for transcriptional profiles.PROFILING PROTEOMESGene expression is a highly regulated,multistep process,and it is impossible to predict the exact protein concentration or activity from the measure-ment of mRNA levels.Proteomics has therefore be-come a key tool in systems biology because it provides quantitative and structural information about pro-teins,which are the major functional determinants of cells.Phenotypic alterations associated with genetic perturbations often result from changes in protein accumulation or stability,or changes in protein p osttranslational modifications,which can disrupt protein-protein interactions and network connectivity (Gstaiger and Aebersold,2009.Quantitative protein information complements data from transcriptional profiling and metabolomics.It represents a key link between different levels of gene expression regulation and provides insights into their causal relationships.Unlike transcriptional profiling,however,comprehen-sive proteome analysis remains challenging,and in-formation about proteome complexity and dynamics is far from complete (Cox and Mann,2007.Moreover,the rate of metabolite synthesis is often controlled by regulatory posttranslational modifications of enzymes and not only by their rmation about quantitative relationships between RNA and protei n accumulation,posttranslational protein modifications,and metabolite levels is therefore required to fully understand regulatory circuits that control systems behavior and function.Protein quantification can be absolute or relative (Table I.While relative protein quantification mostly depends on stable isotopes,absolute quantification of comprehensive protein sets is much more difficult.Recent improvements in statistical dataevaluation and increasing accuracy of mass spectrometry instru-ments allow quantifying large numbers of proteins in shotgun-type experiments on the basis of spectral counting (Lu et al.,2007.This method is reliable and comparable to most other quantification methods,including two-dimensional PAGE-based protein stain-ing;however,the protein dataset must be very large.More accurate information about the exact in vivo concentration of individual proteins requires special-ized targeted approaches.Current methods for absolute protein quantification include isotope dilution strategies using isotopically labeled peptides as internal standards (for acompre-Figure 1.Relationships between supracellular com-ponents (biological systems,intracellular compo-nents,and the function and behavior of these components are revealed by the interaction of indi-vidual components.Systems biological approaches aim at modeling these interactions to find primary relationships and to distinguish causality and effect.The understanding of how these interactions are regulated allows making predictions on function,behavior,and survival.Gene Expression Analysis,Proteomics,and Network Discoveryhensive review,see Brun et al.,2009.Signature pep-tides for internal standardization are characteristic for a protein of interest,and are often referred to as proteotypicpeptides(PTPs.In AQUA,PTPs are added to analytical protein samples in known concen-trations.The protein samples are subsequently scan-ned for PTPs of ing the extracted ion chromatograms the native peptide can then be quan-tified relative to the added PTP(Kus ter et al.,2005.A modification of this strategy accounts for quantifica-tion errors derived from incomplete tryptic digest of the analytical sample.In QconCAT(for quantification concatamer,a synthetic protein with concatenated,isotopically labeled PTPs is expressed as recombinant protein in a biological system,added to the sample prior to Trypsin treatment and carried through the digestion procedure,such that losses from incomplete tryptic digestion will also affect the quantity of the PTPs.Both the AQUA and the QconCAT strategies are incompatible with upstream fractionation techniques, which is a potential problem in biomarker quantifi-cation.A way around this constraint is offered by the protein standard absolute quantification strategy, which uses isotopically labeled protein standards that are added to the sample prior to fractionation. Several prediction tools exist that help to define theTable I.Advantages and disadvantages of various technologies for the measurement of transcript and protein abundanceA sys tematic performance assessment for the different protein quantification techniques was recently conducted(Turck et al.,2007and a detailed description of the different quantification techniques along with examples for application in the plantfield is available(Baginsky,2009.Technologies Advantages Disadvantages TranscriptsMPSS Sequences do not need to be known inadvanceRelatively expensive,laboriousMicroarrays Genome wide,relatively cheap,streamlined handling,oligos Sequences must be known in advance; limited sensitivity due to hybridizationQuantitative reverse transcription-PCR High precision and high sensitivityIncreasingly multiplexed Not genome wide;data normalization sensitive to method/choice of reference genesHigh-throughput sequencing Sequences do not need to be known inadvance;possibility to sequence veryshort sequences Expensive at the moment,few solutions for downstream analysis;direct read outProteinsRelative quantification via iTRAQ Established labeling protocol with stableisotopes,good reproducibility,relevantregulation factor can be determinedfrom the data,multiplexing to up toeight samples,produces good qualitytandem mass spectrometry spectra Cost and effort,the analysis software is still not optimal,fluctuations between different s oftwares possibleRelative quantification via stable isotope labeling with amino acids in cell culture Established protocol for the labelingof cell culture proteins,reliablequantification possibleRestricted to cell cultureRelative quantification via extra cted ion chromatograms Comes at no additional costs,softwaretools for alignment and normalizationare available(e.g.SuperHirn;Mueller et al.,2007Only applicable to very similarsamples and very similar liquidchromatography-mass spectrometryruns,done within a small timewindow,baseline normalization issometimes a problemAbsolute quantification via AQUA peptides Highly sensitive absolute quantificationon the basis of isotopically labeledPTPs,targeted analyses possiblevia specific scan methods(e.g.SRMFi nding suitable PTPs and characteristic parent to daughter ion transitionsnot straightforward,selectivity of the PTP transitions not always unambiguousAbsolute quantification via QconCAT Excellent for the quantification of proteincomplex stoichiometry,lower costcompared to AQUA,PTPs aresynthesized in a biological system Unsuitable for the quantification of posttranslational modifications, optimization necessary,exact quantification of the standardis vital,incompatible with sample fractionationAbsolute quan tification via protein standard absolute quantification Excellent for the quantification ofindividual,low abundance proteins,compatible with fractionationRestricted to few proteins,up scalingdifficult,quantifications ofposttranslational modifications notpossibleAbsolute quantification via normalized spectral counting(APEX;Lu et al.,2007No additional costs,produces reliableresults with large-scale datasetsQuantification of individual proteinsmust be validated by additional tools,unreliable for small datasetsBaginsky et al.most suitable PTPs for the detection and quantification of specificproteins.However,only experimental data provide the necessary reliability for PTP selection because in practice PTP prediction often deviates from experimental observations.Therefore,efforts are under way to catalogue PTPs for model organism proteomes.Proteome maps for Arabidopsis generated PTPs for4,105proteins,many of which may be opti-mal for the detection of proteins in different organs (Baerenfaller et al.,2008.Similar quantitative approaches are also used for metabolites,because in addition to RNA and protein levels,understanding the function and behavior of metabolic networks requires global information about metabolite concentrations andfluxes as well.In recent year s,much progress has been made in metabolic profiling,and the interested reader is referred to recent reviews(e.g.Issaq et al.,2009,and refs.therein.TRANSCRIPTS AND MORE TRANSCRIPTS:WHAT CAN WE LEARN FROM GENE EXPRESSION ANALYSIS?During the analysis of large gene expression data-sets the researcher is often confronted with several questions.How do we interpret a mathematical rela-tionship between genes or between genes and condi-tions?For example,does a high correlation between two genes mean that they are coregulated,or could one of them be the positive regulator of the other?Or can we assume that they are involved in the same pathway or biological process?Although it is not possible to answer these questions conclusively from gene expression data alone,a number of parallel approaches can be useful to distinguish between dif-ferent scenarios.For example,Gene Ontology enrich-ment analysis can provide confidence that a given gene cluster is enriched in genes that areknown to have a common function,cellular location,or biolog-icalprocess.Similarly,conserved cis-regulatory ele-ments in the promoters of genes from the same cluster indicate that they are likely coregulated.Although these methods do not establish proof of the nature of the relationship between genes,they allow formulat-ing hypotheses that can be tested in the laboratory.In summary,although gene expression analysis by itself is rather descriptive(i.e.describing how genes re-spond to various test conditions or tissues,it is a valuable validation tool and an excellent starting point to study novel cellular process and to formulate novel hypotheses.A major challenge of genome-scale transcription analysis is the very large number of predictors(genes compared to a generally small number of measure-ments(microarrays.Without appropriate statistical measures to correct for multiple testing and including false discovery rates,almost any approach will yield significantgenes,including many false positives.The creation of large databases in recent years has brought an additional layer of complexity and precautions to take(see Table II.For example,large databases such as Genevestigator(Hruz et al.,2008not only profile a large number of genes,but also allow contextual meta-Table II.Overview of some of the most popular plant gene expression microarray platforms and the number of available experimentsin ArrayExpressThe Arabidopsis ATH1array is the most frequently used microarray,followed by the CATMA25k and23k arrays.In all,approximately750 Arabidopsis microarray experiments have been published so far.Rice(Oryza sativaand barley(Hordeum vulgareare the second and third plant species in terms of microarray experiments published.Soybean(Glycine maxalso has a high number of arrays,but this is due to a single very large experiment containing2,521arrays.IPK,Leibniz Institute of Plant Genetics and Crop Plant Research;TIGR,The Institute for Genomic Research.Species ProviderArrayFormatArray Name Experiments ArraysArabidopsis Affymetrix8K AG41352Affymetrix22K ATH15548,895Agilent22K Arabidopsis234253Agilent44K Arabidopsis3760CATMA25K CATMA2_URGV to CATMA2.3_URGV83851CATMA23K CATMA Arabidopsis23K array501,290TIGR26K TIGR Arabidopsis whole genome6264 Rice Affymetrix57K GeneChip Rice Genome Array29418Agilent21K Agilent Rice Oligo Microarray22164 Barley Affymetrix22K GeneChip Barley Genome Array351,165IPK6K+4K IPK barley PGRC1_A and B7324 Medicago Affymetrix61K GeneChip Medicago Genome Array19218 Maize Affymetrix17K GeneChip Maize Genome Array22370 Soybean Affymetrix61K GeneChip Soybean Genome Array223,236 Tomato(SolanumlycopersicumAffymetrix10K GeneChip Tomato Genome Array6127 Grape(Vitis viniferaAffymetrix16K GeneChip Vitis vinifera Genome Array6239 Wheat(Triticum aestivumAffymetrix61K GeneChip Wheat Genome Array25811 Total96819,037Gene Expression Analysis,Proteomics,and Network Discoveryanalysis of several hundred conditions,each of which is covered by only a small number of replicates(usu-ally3–5.While some genes will respond to a small number of conditions and therefore their expression is easier to contextualize and interpret,other genes will respond to dozens or hundreds of conditions.It is often very difficult to distinguish primary effects from secondary effects,because the intensity of the effect does not necessarily relate to the direct involvement of the corresponding condition in regulating a specific target gene.Breaking down these effects into local patterns(e.g.by using a biclustering algorithm;Prelic et al.,2006helps infinding out conditions that are more directly linked to the gene of interest.APPROACHING THE TARGET:FROM ORGANS TO TISSUES AND CELLSMost transcript and protein profiling experiments analyze mixtures of tissues containing different cell types and organelles.This approach reveals certain global patterns,but quantitative analyses and model-ing is limited with such complexdata.Therefore meth-ods for organ(or bettercell-type-specific transcript and proteinprofiling as well as for organelle-specific proteomics are needed.Four types of approaches are now commonly used to sample RNA and/or proteins from selected celltypes:(1micropipetting,(2laser capture microdissection(LCM,(3protoplasting and sorting,and(4polysome immunopurification(for review,see Zanetti etal.,2005;Hennig,2007;Nelson et al.,2008.Micropipetting using microcapillaries directly ex-tracts the contents from selected cells.It has been successfully applied to various leaf cell types and for phloem but extraction is more difficult from internal cells.LCM involves sectioning of frozen orembedded tissue,and subsequent dissection of the region of interest using laser excision.Applications of LCM include studies of vascular tissue,epidermis,and pericycle in maize(Zea maysand seed development in Arabidopsis.Micropipetting and LCM are usually very l abor intensive and difficult for isolation of small cells such as in meristems.Because of the limited amount of material that can be captured,they work well for transcript profiling,which can use amplifica-tion steps,but provide only a very small coverage of the proteome.As an alternative,protoplasting and cell sorting offers rapid and accurate isolation of RNA from small cells.Specific tissues or cell types that are labeled by expression of GFP are isolated by proto-plasting and sorted through afluorescence-activated cell lions of cells can be processed within 1to2h,but care has to be taken to exclude changes in gene expression profiles by sample processing. This technique was successfully applied to measure genome-wide expression profiles in more than15root regions,establishing a compendium of digital in situ data(Birnbaum etal.,2003;Cartwright et al.,2009.It will be interesting to test whether this approach can also be used for protein profiling.Polysome immuno-purification is based on the tissue-specific expression of the FLAG-tagged ribosomal protein L18in trans-genicplants(Zanetti et al.,2005.In contrast to micro-pipetting,LCM,and sorting of protoplasts,which all can be used to isolate total cellular RNA,polysomeimmunopurification can be used to isolat e transcripts that are associated with ribosomes(translatome.Dis-crepancies between total RNA levels and representa-tion translatome can reveal regulation at the level of translation(Mustroph et al.,2009.In the future,trans-latome datasets,which bridge transcriptomics and proteomics,can help to interpret unusual transcript-to-protein ratios(see below.Alternatively,it is possible to identify cell-type-specific transcripts and proteins by comparing wild-type plants with mutants that lack specific cells or tiss ue types.In Arabidopsis,for instance,a series of homeotic mutants that lack variousfloral organs was used to identify several hundreds offloral organ-specific genes(Wellmer et al.,2004.If no appropriate mutants exist,specific cell types can be genetically abla ted by expression of acell-autonomous toxin,such as diphtheria toxin subunit A or RNase,under the control of cell-type-specific promoters.Again,these approaches have been proven to work for transcript profiling(Tung et al.,2005but it remains to be tested wh ether they could be useful for protein profiling.DECREASING COMPLEXITY BY ORGANIZING ORGAN AND SUBCELLULAR PROTEOMES Systematic analysis of accurate protein localization is essential to understand cellular networks in the context of compartmentalization,which is a funda-mental design principle of eukaryotic anelle proteomics has therefore become a very active re-searchfield.Until recently,the protein inventory of cell organelles was based on proteins from isolated organ-elles,such as mitochondria,chloroplast,and peroxi-somes(Lilley and Dupree,2007;Baginsky,2009.This approach has limitations because true low-abundant organelle proteins often cannot be distinguished from contaminating proteins.Two approaches have been used to deal with this problem.First,a recently reported isolation procedure for mitochondria used the electrostatic characteristics of the mitochondrial surface to separate mitochondria from other organelles in an electricfield.This procedure results in mito-chondria preparations with higher purity,but the yield is low(Eubel et al.,2007.Second,information about the quantitative distribution of proteins along density gradients has been used to determine if a protein was enriched by the organellar isolation procedure.In practice,the abundance distribution profile of un-known proteins is compared to known organelle marker proteins.This strategy is referred to as protein correlation profiling(Foster et al.,2006or LOPIT (Dunkley et al.,2006.Baginsky et al.Gene Expression Analysis, Proteomics, and Network Discovery Both procedures, however, are of limited use for the analysis of proteome dynamics in response to a stimulus because the long time that is needed to isolate and purify organelles affects their proteome properties. This is especially critical for transient posttranslational proteinmodifications. Thus, proteome dynamics is best analyzed at the cell or tissue level, followed by sorting of proteins into their respective organelle a posteriori. This strategy is now possible because substantial information about the protein complement of different cell organelles has accumulated (a comprehensive collection of proteome databases is for example available in Lu and Last, 2009. The SUBA database is most suitable for this purpose, because it is frequently updated and well maintained. SUBA generates lists of organelle proteins using reliability criteria, for example evidence from several different proteomics studies, targeting prediction, or GFP-localization assays, or a combination of this information (Heazlewood et al., 2007. For the chloroplast, two proteome reference tables have been established (Yu et al., 2008; Reiland et al., 2009. The overlap between these two proteome reference tables has generated a list of 1,156 proteins that can be considered high-confidence chloroplast proteins. Although the number of organelle proteins is constantly increasing, it is not clear when an organelle proteome can be considered complete. Organelle proteomes are dynamic and functional organelle proteomes differ sign ificantly during development, in different cell types or tissues, and in different conditions. This problem can be addressed by considering organelles as cellular subnetworks and applying fluxbalance modeling to assess network consistency. Initial modeling approaches with mitochondria and chloroplasts focused on a limited number of reactions, such as those of the Calvin cycle, amino acid biosynthesis, or the tricarboxylic acid cycle. Also, mitochondrial network reconstructions based on proteomics data are available and the existing models allow prediction of metabolite accumulation for a limited number of metabolites (Vo and Palsson, 2007. A recent flux-balance model of the primary metabolism in Chlamydomonas reinhardtii localized reactions into chloroplasts, mitochondria, and the cytosol and assessed systematically the contribution of different organelles to biomass production (Boyle and Morgan, 2009. The above examples illustrate the excellent suitability of metabolic network reconstruction to identify gaps in existing knowledge. different levels, a comparison between transcript and protein accumulation can provide information about the rate of protein translation and thedegree of posttranscriptional regulation. We have recently analyzed the correlation between protein and transcript abundance in representative samples from different plant organs and found mostly positive correlations in the range from 0.5 to 0.68 (Baerenfaller et al., 2008. The lowest correlation was observed for seeds, which accumulate stable storage proteins whose abundance is largely uncoupled from transcription. The highest correlation was obtained in leaves, suggesting that the most abundant photosynthetic proteins are predominantly regulated at the transcriptional level. It is clear that such a genome-scale analysis only offers a global view of regulatory events and does not allow a systematic assessment of individual enzyme regulation. A more refined comparison of protein and transcript levels showed that the correlation between transcript and protein abundance can vary significantly between different pathways (Kleffmann et al., 2004 and most likely also between different enzymes in the same pathway. Figure 2 shows an example of a correlation analysis of a representative leaf transcriptome and proteome for a selection of 345 genes/proteins from primary and secondary metabolism pathways. Although the data was collected from various sources and summarized (see also Baerenfaller et al., 2008, the protein-to-transcript ratio was similar for most proteins, indicating that this analysis is robust. The ma- THE CHALLENGE OF DATA INTEGRATION: GENOME-SCALE ANALYSIS OF RNA-PROTEIN CORRELATIONS Quantitative information about protein accumulation at genome scale offers entirely new insights into network function and the behavior of organs, tissues, and cells. Because gene expression is regulated at Plant Physiol. Vol. 152, 2010 Figure 2. Correlation analysis of transcript and protein abundance in Arabidopsis leaves based on 345 genes from various primary and secondary metabolism pathways. Transcript abundance was calculated as a representative expression vector derived from multiple Affymetrix ATH1 array measurements from leaf samples (data from Genevestigator, Hruz et al., 2008. The proteome data was obtained from distinct leaf samples. Approximately 20% of these genes/proteins had ratios of protein to transcript abundance deviating strongly from 1. 407。