生物信息学作业1. Align the leghemoglobin protein from soy bean and myoglobin from human with global and local alignment software (ex. needle and water) respectively and interpret the results.ANSWER:(1)Use Needle to Align the two sequence:Aligned_sequences: 2# 1: CAA38024.1# 2: NP_001157488.1# Matrix: EBLOSUM62# Gap_penalty: 10.0# Extend_penalty: 0.5# Length: 203# Identity: 43/203 (21.2%)# Similarity: 58/203 (28.6%)# Gaps: 90/203 (44.3%)# Score: 30.0(2)Use Water to Align the two sequence:Aligned_sequences: 2# 1: CAA38024.1# 2: NP_001157488.1# Matrix: EBLOSUM62# Gap_penalty: 14# Extend_penalty: 4# Length: 32# Identity: 11/32 (34.4%)# Similarity: 15/32 (46.9%)# Gaps: 0/32 ( 0.0%)# Score: 35两种软件虽然使用同一罚分标准但得分不同。
因为Needle程序实现标准pairwise全局比对,而Water则是局部比对。
全局比对因为是比对全长序列,所以空位罚分多,得分较局部比对低。
2. Evaluate the significance of the local protein alignment score of question 1 with PRSS and interpret the result.参数如下:Statistics: (shuffled [200]) MLE statistics: Lambda= 0.1886; K=0.0575statistics sampled from 1 (1) to 200 sequencesParameters: VT160 matrix (16:-7), open/ext: -12/-2在两个不同网站选不同矩阵均未得到E值,原因可能是两条序列的同源性很低。
如果同源性高则得到的E值小,且前面的比对工作可性度大;反之则说明前置比对工作可性度低,两条序列的同源性低。
一般来说如果E值小于千分之一则证明序列同源性高。
3. Obtain two sequences from Genbank with the accession number P0A7G6 and P25454. align them with LALIGN (EBI or virginia university sever). First try gap penalties of -12 and -2. Note the length of the alignment, E-value, the percent identity, and the score of the alignment, then repeat the alignment with gap penalties of -5 and -1 and note the features of the alignment. Describe what happened when the gap penalties were reduced, and why?ANSWER:(1)First try gap penalties of -12 and -2:Visual output:Alignment:Waterman-Eggert score: 214; 58.4 bits; E(1) < 3.7e-1328.7% identity (57.4% similar) in 230 aa overlap (34-241:153-375)Waterman-Eggert score: 62; 20.7 bits; E(1) < 0.08230.3% identity (55.1% similar) in 89 aa overlap (25-111:178-256)Waterman-Eggert score: 46; 16.7 bits; E(1) < 0.7427.5% identity (56.9% similar) in 51 aa overlap (15-64:9-59)Waterman-Eggert score: 45; 16.4 bits; E(1) < 0.823.0% identity (53.3% similar) in 135 aa overlap (15-148:1-125)Waterman-Eggert score: 41; 15.4 bits; E(1) < 0.9636.4% identity (63.6% similar) in 22 aa overlap (148-169:55-76)Waterman-Eggert score: 39; 14.9 bits; E(1) < 0.9930.0% identity (62.5% similar) in 40 aa overlap (16-55:178-213)Waterman-Eggert score: 36; 14.2 bits; E(1) < 124.3% identity (59.5% similar) in 37 aa overlap (76-112:313-349)Waterman-Eggert score: 35; 14.0 bits; E(1) < 150.0% identity (80.0% similar) in 10 aa overlap (259-268:10-19)353 residues in 1 query sequences400 residues in 1 library sequences(2)repeat the alignment with gap penalties of -5 and -1:Visual output:Alignment;Waterman-Eggert score: 402; 30.1 bits; E(1) < 0.0001231.5% identity (56.6% similar) in 311 aa overlap (2-274:123-394)Waterman-Eggert score: 270; 18.9 bits; E(1) < 0.2524.7% identity (50.0% similar) in 446 aa overlap (15-352:1-399)Waterman-Eggert score: 225; 15.0 bits; E(1) < 0.9826.3% identity (50.8% similar) in 388 aa overlap (17-351:5-326)Waterman-Eggert score: 214; 14.1 bits; E(1) < 126.3% identity (44.9% similar) in 323 aa overlap (8-303:164-396)Waterman-Eggert score: 211; 13.9 bits; E(1) < 123.4% identity (46.7% similar) in 418 aa overlap (2-332:33-395)353 residues in 1 query sequences400 residues in 1 library sequences当罚分较高时,图像中线条多且短;罚分较低时,图像中线条少且长。
原因是罚分高时,为了得到最优分,系统比对时会选取局部序列,尽量避免空位罚分。
罚分较低时,不会去避免空位,而是尽量全局比对。
4. A complex sample contains DNA from many species of bacteria. The species can be divided into two broad categories: (a) High GC content, (b) Low GC content.In (a) the probability that a GC-rich sequence be obtained by randomly sequencing part of the genome is 0.8In (b), it is 0.1. Assume that the sample contains both bacterial types in the proportion of 1:3 (prior knowledge)Suppose that a sequence obtained randomly from the sample is observed GC-rich. What is the probability that it came from (a) and (b)?Answer:假设事件X:observe GC-rich from (a)事件Y: observe GC-rich from (b)事件Z: observe GC-rich from the sample contains both bacterial types in the proportionP(Z/X)=P(X)P(X/Z)/P(Z)=0.25×0.8/(0.25×0.8+0.75×0.1)=72.72%P(Z/Y)=P(Y)P(Y/Z)/P(Z)=0.75×0.1/(0.25×0.8+0.75×0.1)=27.27%。