数据处理光谱分析与数据挖掘
• Global alignable
– pairwise – multiple
• Prove
– optimization
Global alignable
Extreme center star
MapReduce for Center Star Frame
input fasta file
local file system
sum up
update
How to set k for k-band?
Detecting the matching region with Trie
S=AGACGTAGCCTAGCAGCCCGTACT
S1=AGACGT S2=AGCCTA S3=GCAGCC S4=CGTACT
T=AGACCTAGCTAGCAGCCCGTACACT
2. Center star strategy
S1
S3
S2
S4
S5
tree alignment
S1
S3
S5
S2
S4
Center star strategy
Center Star for Multiple Sequence Alignment
input sequences
… search
final result
HDFS
Software
/software/halign/ /soft/halign/
Quan Zou, Qinghua Hu, Maozu Guo, Guohua Wang. HAlign: Fast Multiple Similar DNA/RNA Sequence Alignment Based on the Centre Star Strategy. Bioinformatics. 2015,31(15): 2475-2481
Suffix Tree
S1=AGACGTAGCCTAGCAGCCCGTACT
S2= GACGTAGCCTAGCAGCCCGTACT
S3= ACGTAGCCTAGCAGCCCGTACT
S4= CGTAGCCTAGCAGCCCGTACT
S5=
GTAGCCTAGCAGCCCGTACT
S6=
TAGCCTAGCAGCCCGTACT
Suffix tree Trie
center center
star
star
24.8s
15.6s
K-band center star
10.9s
Extreme Extreme
Trie
suffix tree
19.7s
5.4s
• Our output 1558KB • ClustalΩ 1627KB
Discuss: How to measure the similarity?
… Application
Techniques for similar DNA MSA
j
0
1
2
i
c
a
K-band
0
0
-1
1a
-1 -1
1
2c
-2
1
0
3g
0
0
4c
-1
5t
6g
1. k-band Dynamic Programming
3
4
5
t
Байду номын сангаас
g
t
-4
-5
0
-1
0
-1
-1
2
-1
1
1
1
0
3
3
2
Techniques for similar DNA MSA
Center Star for Multiple Sequence Alignment
input sequences
trie trees
… search
final result
sum up
update
From Trie to Suffix Tree
Trie
S1=AGACGT S2=AGCCTA S3=GCAGCC S4=CGTACT
Multiple Sequence Alignment(MSA): What & Where
• Different from Mapping, Assembly, BLAST
Multiple Sequence Alignment(MSA): What & Where
• Different from Mapping, Assembly, BLAST
input sequences
… search
final result
sum up
update
Experiments
• 100 human mitochondria genome sequences • 16k length (1555KB)
Running time
Center Star
12933.2s
Multiple Sequence Alignment
Phylogenetic tree
Multiple DNA Sequence Alignment
Multiple Similar DNA Sequence Alignment
Our Focus
Virus sequences
Population SNV calling
– BLAST: Basic Local Alignment Search Tool
Query
Database
Output
Multiple Sequence Alignment(MSA): What & Where
input
Output
Multiple Sequence Alignment(MSA): What & Where
S7=
… AGCCTAGCAGCCCGTACT
Greedy search with suffix tree
S=GTCCGAAGCTCCGG
T=GTCCTGAAGCTCCGT 1234567890123456
(1,1,4) (5,6,9)
Extreme MSA for Very Similar DNA Sequences