当前位置：文档之家› 基因家族分析套路

基因家族分析套路

基因家族分析套路（一）近年来，测序价格的下降，导致越来越多的基因组完成了测序，在数据库中形成了大量的可用资源。

如何利用这些资源呢？今天小编带你认识一下不测序也能发文章的思路--全基因组基因家族成员鉴定与分析（现在这一领域可是很热奥）；一、基本分析内容⏹数据库检索与成员鉴定⏹进化树构建⏹保守domain和motif分析.⏹基因结构分析.⏹转录组或荧光定量表达分析.二、数据库检索与成员鉴定1、数据库检索1）首先了解数据库用法，学会下载你要分析物种的基因组相关数据。

一般也就是下面这些数据库了⏹Brachypodiumdb:/⏹TAIR:/⏹Rice Genome Annotation Project ：/.⏹Phytozome:/⏹Ensemble:/genome_browser/index.html⏹NCBI基因组数据库：/assembly/?term=2）已鉴定的家族成员获取。

如何获得其他物种已发表某个基因家族的所有成员呢，最简单的就是下载该物种蛋白序列文件（可以从上述数据库中下载），然后按照文章中的ID，找到对应成员。

对于没有全基因组鉴定的，可以下列数据库中找：a. NCBI: nucleotide and protein db.b. EBI: http://www.ebi.a/.c. UniProtKB:/uniprot/2、比对工具。

一般使用blast和hmmer，具体使用命令如下：⏹Local BLASTformatdb–i db.fas–p F/T；blastall–p blastp(orelse) –i known.fas–d db.fas–m 8 –b 2(or else) e 1e-5 –o alignresult.txt.-b:output two different members in subject sequences (db).⏹Hmmer (hidden Markov Model) search. Thesame as PSI-BLAST in function. It has a higher sensitivity, but the speed islower.Command:hmmbuild--informatafaknown.hmmalignknown.fa;hmmsearchknown.hmmdb.fas>align.out.3、过滤。

⏹Identity: 至少50%.⏹Cover region: 也要超过50%或者蛋白结构域的长度.⏹domain: 必须要有完整的该蛋白家族的。

工具pfamdb (/) 和NCBI Batch CD- search. (/Structure/bwrpsb/bwrpsb.cgi).⏹EST 支持⏹ Blast and Hmmer同时检测到4、通过上述操作获得某家族的所有成员基因家族分析套路（二）本次主要讲解在基因家族分析类文章中，进化部分分析的内容。

主要是进化树的构建与分析。

一、构建进化树的基本步骤１、多序列比对. Muscle program.２、Model 选择. 分别针对蛋白序列和核酸序列的模型选择程序。

ProtTest program for protein and ModelTest or Jmodetlest for DNA(htt p:///58001704/blog).３、算法选择。

三种. NJ, ML and BI.４、软件选择。

MEGA (bootstrap least 1000 replicates), phyML and Mrbayes (http:/ //58001704/main).５、进化树修饰. MEGA: view->options and subtree-> draw options. Also can be deco rated in word (/58001704/main)二、具体步骤2.1 多序列比对。

一般采用muscle。

因为 MUSCLE is one of the best-performing multiple alignment programs according to published benchmark tests, with accuracy and speed that a re consistently better than CLUSTALW.2.2 模型选择。

对于用蛋白序列构建进化树的可以采用下面命令：java -Xmx250m -classpath path/ProtTest.jar prottest.ProtTest -i alig nmfile.phy.运行结果如下图注意：1）“.Phy” format. Only allow ten charaters.注意名字不能重复相同。

2）AIC: Akaike Information Criterion framework.3）Gamma distribution parameter (G): gamma shape.3）proportion of invariable sites: I.2.3 构建进化树2.3.1 意义：a聚类分析。

如亚家族分类。

像MAPKKK基因家族通过进化树可以清楚分为 MEKK, Raf and ZIK三个亚家族.b亲缘关系鉴定。

在进化树上位于同一支的往往暗示这亲缘关系很近c 基因家族复制分析。

研究基因家族复制事件（duplication events），两种复制事件类型常采用的标准：Tandem duplication: Identity and cover region more than 70% and tight ly linked (Holub, 2001).Chromosomal segment duplication: Plant Genome Duplication Databas e (PGDD: /duplication/)2.3.2 进化树。

一般ML树比较准确，但应结合方法，如NJ树，相互验证。

2.3.3 进化部分分析：KaKs计算2.3.3.1 简单的方法. 可以使用下面的网页PAL2NAL(http://www.bork.embl.de/pal2nal/)2.3.3.2 标准方法：.a. ParaAT: ParaAT.pl-h test.homologs -n test.cds -a test.pep -p proc –f axt –k -o outputb. KaKs_Calculator –m NG(or else) -i test.axt -o test.axt.kaksc.分歧时间计算：Divergenttime（T） calculation.T=Ks/2λ. λ : mean 5.1-7.1×10-9 .d. Ka/Ks意义：Ka/Ks=1.中性进化。

.Ka/Ks<>Ka/Ks>1.正选择。

Positively selected genes and produce fitness advantagemutations to ev olve new functions.基因家族分析套路（三）本节主要讲基因结构分析套路1、Motif分析使用软件MEME，命令如下：meme sample.fa -dna –revcomp -nmotifs 10 -mod zoops -minw 6-maxw 50>meme_htmlForm at.html2、基因结构分布图可以使用在线网站GSDS2.0：website:/用法如下：结果展示3、基因结构常见统计信息：自己excel或写程序统计a. The number of intron andexon.b. The splicing intronpattern inculding 0,1,2 phase.c. The marked region. Forexample kinase domain.d. sequence length.e. UTR.4、启动子分析。

网站：主要做植物的：http://bioinformatics.psb.ugent.be/webtools/plantcare/html/注意事项：a. IE brower.b. Only one sequence for oncesearch and the length was limited in 1000 bp.c. DNA sequence origin: 1000 or1500 bp upstream of ATG of one gene. 分析结果：基因家族分析套路（四）一、转录组及芯片原始数据下载网站1、GEO datesets/profile(/gds ).。

用法见下图。

GEO数据ID命名规则：GPL->GSE->GSM.GPL: platformGSE: multiple series.GSM: multiple samples.GDS ≈ GSE. Thedifference concentrated on the data labeled GDS can be ana lyzed for one geneonline. It is simple and easily.The data in the sameGPL can be used to compare inexperiment下面是在线分析转录组数据的用法：2、EBI ArrayExpress(/arrayexpress/) 该数据库下载数据用法如下：3、PLEXdb(/).该数据库下载数据用法如下，注意用户名和密码！4、SRA db(/sra/)5、DRA db（http://trace.ddbj.nig.ac.jp/DRASearch/）二、数据处理拿到原始数据，要进行处理，才能进行后续数据分析。

1、芯片数据。

原始数据格式“.cel”格式。

以AffyMicroarray数据处理为例讲述主要的命令如下：> library(affy);>library(makecdfenv);>library……> barleyGenome = make.cdf.env(“barleyGenome.cdf")>mydata <- ReadAffy() ##choose “.cel “ file analyzed.>eset <- rma(mydata);>write.exprs(eset,file="mydata.txt")>design <- model.matrix(~-1+factor(c(1,1,2,2,3,3))) # Createsappropriate de sign matrix.>colnames(design) <-c("group1", "group2", "group3") # Assigns column na mes.>fit <- lmFit(eset, design) # Fits a linear model for each gene based onthe g iven series of arrays.>contrast.matrix <- makeContrasts(group2-group1,group3-group2, group3 -group1, levels=design) # Creates appropriate contrast matrix toperform all pair wise comparisons.>fit2 <- contrasts.fit(fit, contrast.matrix)# Computes estimatedcoefficients a nd standard errors for a given set of contrasts.>fit2 <- eBayes(fit2) # Computes moderated t-statistics and log-oddsof diff erential expression by empirical Bayes>topTable(fit2, coef=1,adjust="fdr", sort.by="B", number=10) # Generates li st of top 10 ('number=10')differentially expressed genes sorted by B-values ('sor t.by=B') for firstcomparison group.>write.table(topTable(fit2, coef=1,adjust="fdr", sort.by="B", number=500),fi le="limma_complete.xls", s=F, sep="\t") # Exports complete limma sta tistics table forfirst comparison group.>results <- decideTests(fit2,p.value=0.05); vennDiagram(results)2、转录组数据处理。

e商务文档

基因家族分析套路

相关文档推荐：