当前位置：文档之家› 生物序列分析中几个典型算法介绍

生物序列分析中几个典型算法介绍

生物序列分析中几个典型算法介绍生物信息学研究背景与方向序列家族的序列谱隐马尔可夫模型（Profile HMMs for sequence families ）模体识别（Motif Discovery ）刘立芳计算机学院西安电子科技大学生物秀－专心做生物！ｗｗｗ．ｂｂｉｏｏ．ｃｏｍ背景知识DNA脱氧核糖核酸1、DNA的分子组成核甘(nucleotides)•磷酸盐(phosphate)•糖(sugar)•一种碱基9腺嘌呤(A denine)9鸟嘌呤(G uanine)9胞嘧啶(C ytosine)9胸腺嘧啶(T hymine) 2、碱基的配对原则•A(腺嘌呤)—T(胸腺嘧啶)•C(鸟嘌呤)—G(胞嘧啶)3、一个嘌呤基与一个嘧啶基通过氢键联结成一个碱基对。

4、DNA分子的方向性5’→3’5、DNA的双螺旋结构RNA、转录和翻译1、RNA(核糖核酸)：单链结构、尿嘧啶U代替胸腺嘧啶T、位于细胞核和细胞质中。

2、转录: DNA链→RNA链信使RNA(mRNA)，启动子。

3、翻译: mRNA上携带遗传信息在核糖体中合成蛋白质的过程。

变异1、进化过程中由于不正确的复制，使DNA内容发生局部的改变。

2、变异的种类主要有以下三种：9替代(substitution)9插入或删除(insertion or deletion)9重排(rearrangement)基因intronexon基因组任何一条染色体上都带有许多基因，一条高等生物的染色体上可能带有成千上万个基因，一个细胞中的全部基因序列及其间隔序列统称为genomes（基因组）。

人类基因组计划（Human Genome Project）基因的编码1、基因编码是一个逻辑的映射，表明存储在DNA和mRNA中的基因信息决定什么样的蛋白质序列。

2、每个碱基三元组称为一个密码子(codon)3、碱基组成的三元组的排列共有43＝64种，而氨基酸共有20种类型，所以不同的密码子可能表示同一种氨基酸。

分子生物学中心法则带来的问题1、序列排列问题2、基因组的重排问题3、蛋白质结构和功能的预测4、基因(外显子、内含子)查找问题5、序列装配(Sequence Assembly)问题……基因组序列装配基因识别基因功能预报基因多态性分析基因进化mRNA结构预测基因芯片设计基因芯片数据分析疾病相关基因分析蛋白质序列分析蛋白质家族分类蛋白质结构预测蛋白质折叠研究代谢途径分析转录调控机制蛋白质芯片设计蛋白质芯片数据分析药物设计生物信息学–研究方向BioinformaticsBioinformatics is now part of Oxford Open and your work can be made available free online immediately uponpublication.GENOME ANALYSISSEQUENCE ANALYSISSTRUCTURAL BIOINFORMATICSGENE EXPRESSIONGENETICS AND POPULATION ANALYSISSYSTEMS BIOLOGYDATA AND TEXT MININGDATABASES AND ONTOLOGIES多序列比对序列分析—生物信息学的首要任务多序列比对和模体识别—序列分析的两个主要方法多序列比对问题优化模型1）SP 记分函数（weighted sums-of-pairs with affine gap penalties ）n 条序列S 的一个多序列比对A 的SP 记分函数定义如下：∑∑=−==n i i j j i ij s s COST w A COST 211),()(其中(,)i j COST s s 为序列i s 和j s 的比对分值，ij w 为两序列的权重。

如果S 的一个比对'A 满足：'()max (())A COST A COST A =，则称'A 是一个最优比对。

2）COFFEE 记分函数 (consistency based objective function for alignment evaluation)一个多序列比对A 的COFFEE 记分函数定义如下：112121()()/()n i n i ij ij ij ij i j i j COST A w SCORE A w LEN A −−====⎡⎤⎡⎤=⎢⎥⎢⎥⎣⎦⎣⎦∑∑∑∑ (2.6)其中，ij A 为序列i s 和j s 在A 中的双序列比对，)(ij A LEN 是其比对长度，)(ij A SCORE 是ij A 与库中i s 和j s 最优比对的一致性，其值等于ij A 和库中双序列比对残基对的一致性数目。

ij w 为两序列的相似度，即两序列i s 和j s 的最优比对中，残基完全相同且对齐的数目与两序列的最小长度min{,}i j l l 之比。

如果S 的一个比对'A 满足：))((max )('A COST A COST A =，则称'A 是一个最优比对。

3）序列谱隐Markov 模型（Profile HMMs ）Profile HMMsAligned SequencesBuild a Profile HMM (Training)Database searchMultiplealignments(Viterbi) Query against ProfileHMM database(Forward)Including BEGIN and END statesTwo new states can be added to the Markov model description. These are treated as SILENT STATES.The addition of the E state (END) essentially means that we now define a probability distribution over all possible sequences of ANY length.Prediction of fair/loaded dieHMM Formal definitionObjective is to distinguish the sequence of states from the sequence of symbols –call the state sequence the path1The ith state in the path is called the chain is defined by :(|)define emission probabilities ():()(|)so the joint probability of an observed sequence x and a state i kl kl i i k k i i a P l k e b e b P x b k παπππ−======1101 sequence :(,)()i i i Li i P x a e x a ππππππ+==∏However to use this we must already know the path估值问题假设有一个HM M ，其转移概率ij a 和()k i e x 均已知。

计算这个模型产生某一个特定观测序列12L x x x x ="的概率()P x 。

解码问题假设已经有了一个HM M 和它所产生的一个观测序列12L x x x x ="，决定最有可能产生这个观测序列的状态序列，即其路径1...L πππ=。

学习问题假设只知道一个HM M 的大致结构（比如状态数量和每一状态的可见符号数量），但ij a 和()k i e x 均未知。

如何从一组可见符号的训练序列中决定这些参数。

隐马尔可夫模型的3个核心问题A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.ACA C --AGC AGA ---ATCACC G --ATC Transition probabilitiesOutput ProbabilitiesinsertionBuilding –Final TopologyDeletion states Matching statesInsertion statesNo of matching states= average sequence length in the family PFAM Database-of Protein families()The Viterbi AlgorithmGiven an observation sequence x = x1x2….x L and an HMM, we would like to know the most probable state path that led to the generation of the observed sequencei.e., the state sequence that is optimal in somemeaningful senseFormal techniques exist e.g., VITERBI algorithmThe VITERBI algorithm is a dynamic programming algorithm which finds the best path (state sequence)with the highest probability.Viterbi algorithm –finding the pathThe most probable path can be found recursively*arg m ax (,)P x πππ=1πWe want to choose the path with the highest probability –like we did in the dynamic programming examples1suppose the probability () of the most prob. path ending in state with observation is known for all states ,(1)()max (())k l l i k kl kv i k i k v i e x v i a ++=O(C L L )Viterbi Algorithm00All sequences have the same start state,so 1.By keeping pointers backwards, the actual path can be found by backtracking. The full algorithm:Init (i 0): (0)1,(0)0 for 0Recur k v v v k ====>00sion(i 1..L): ()max ((1)) ()arg max ((1))Termination: P(x, *)max (()) *arg max (())Tra l l i k k kl i k k kl k k k L k k k v e x v i a ptr l v i a v L a v L a ππ==−=−==1ceback(i L..1): *()i i ptr l π−==1πThe Forward algorithmWe also are interested in calculating the probability of a sequence P(x) given a model of the systembehaviorThe number of possible paths increases exponentially with length so it is not possible to enumerate all of them This can be calculate by a dynamic programming algorithm, like the Viterbi algorithm, replacingmaximization with a sum.∑=ππ),()(x P x P ),...()(1k x x P i f i i k ==πklk k i l l a i f x e i f ∑+=+)()()1(1Backward algorithm or posterior state probabilities )|...(),...( ),...|...(),...(),(11111k x x P k x x P k x x x x P k x x P k x P i L i i i i i L i ii i =======++πππππ)|...(P )(1k x x i b i L i k ==+πFirst calculate the probability of producing the entire observed Sequence with ith symbol produced by state k:),...()(1k x x P i f i i k ==πFrom forward alg.Backward Algorithm(1))b (x e P(x) :n Terminatio1).(i )b (x e )(b :1,...,1)-L i Recursion(k.)(b :)(tion Initialisa l 1l 0l 1i l k 0k ∑∑=+====+ll lkl k a a i all for a L L i Calculation of backward term is similar to calculation of forward term –except that it starts from the end of the sequencePosterior probability of a fair dieParameter estimation for HMMsl n l n 1 l(x ,...,x |)log (x ,...,x |)log (|)where represents all of the transition and emission probsnjj P P x θθθθ===∑Most difficult problem is specifying the model9what are the states?9how are they connected?9what are the transition and emission probabilities?Start with estimating the probabilities….From training sequences (x 1to x n ) –the log likelihood of the model is:More occasionally dishonest casinoReal model Estimated model (300 rolls)30,000 rolls。

e商务文档

生物序列分析中几个典型算法介绍

相关文档推荐：