当前位置:
文档之家› 兰州大学生物信息学课件:6-基因组组装- 王明成
兰州大学生物信息学课件:6-基因组组装- 王明成
(1) Close gap by pair-end information (One end mapped on the contig, the other end fall in the gap)
(2)Do a local assembly using the reads fall in the gap to get a sequence connect with the both edges of two contigs. Note: Gap closure here also means extend contigs.
GTTATTGATCTCCБайду номын сангаас
AGATC
ATCTT TCTTG CTTGT GATCT
TGATC TTGAT ATTGA
TTGTT TATTG
TGTTA TTATT
GTTAT
ATCTC TCTCC
1.liding to take Kmer from reads,storing the links between neighboring Kmers.
Read 1 and read 2 of two paired-end reads that were completely identical (and thus considered to be the products of PCR duplication).
Error correction before assembly
2.If the Kmer is already existent,merge the links of it with the first one's.
De bruijn graph
2、Graph simplification
AGATC
ATCTT TCTTG GATCT
TGATC TTGAT
CTTGT ATTGA
L/G is very small, the n_r is very large, this is obey to Poisson distribution. So,
d_k = (L-K+1)/G*n_r n_k = (L-K+1)*n_r then, G =n_k/d_k
Quality control and filtering
◦基因组组装
王明成 2015.10.29
一、Genome survey
Kmer: a continuous nucleic acid sequences, the length is K bp.
Suppose the genome is unique to K, we can get G different kmers. when generate a read, the possibility of a certain kmer be sequenced is (L-K+1)/G.
二、SOAPdenovo algorithm
SOAPdenovo was developed to assemble large genomes, such as human, it also works well for small genomes like bacteria. Include five major steps:
Overlap:
contig
Ge+en+no+om+mi+ic+cs Genomics
Pair-end: scaffold
nom Genome
sem Genome****assembly assembly
22
De bruijn graph construction
Reads : AGATCTTGTTATT
TTGTT TATTG
TGTTA
TTATT
GTTAT
ATCTC
TCTCC
AGATC 1
2
GATCT ATCTTGTTATTGATC
4 3
ATCTCC
Read1:AGATCTTGTTATT Read2:GTTATTGATCTCC
set -R parameter
Contigs: GATCTTGTTATTGATCT GATCTCC AGATCT
Contigs: AGATCTTGTTATTGATCTCC
3、Pair-end mapping to contig
4、Construct scaffolds
Note: 1. For mate-pair(>=2Kb), the order is just opposite. 2. A reliable link will be built between two contigs, when pair-end/mate-pair reads
• De bruijn graph construction • Graph simplification and obtain contigs • Pair-end reads mapping to contigs • Construct scaffolds • Gap filling with pair-end reads
1、De bruijn graph construction
Sequence assembly refers to aligning and merging fragments to a much longer DNA sequence in order to reconstruct the original sequence.
support larger than the number be set. 3. The gap size is estimated from the insert size of each reads pair.
5、Gap closure
• Get reads located in the gap and then do local assembly.
Reads having a ‘N’ over 10% of its length.
Reads from short insert-size libraries having more than 65% bases with the quality ≤ 7, and the reads from large insert-size libraries that contained more than 80% bases with the quality ≤ 7.