介绍 SOAPaligner/soap2是SOAP(Short Oligonuclotide
Analysis package)的一个主要成员, 它是SOAP
的一个升级版本, 新的程序的特征提高了运行速度和对Illumina/Solexa
GenomeAnalyzer的大数据量的比对的精确度.与死soap
版本1进行比较,远远提高了运行速度,只需要2 分钟就可以实现人类基因组参考序列的1M读长的比对,
另一个soap2的改进是可以同时支持不同长度的读长。 SOAPaligner通过在数据结构和算法上的优化而实现时间和空间上的高度有效性。他的核心算法和索引的数据结构(2way-BWT)是由香港大学的计算科学算法研究组实现的(T.W.Lam,
Alan Tam, Simon Wong, Edward
Wu and S.M. Yiu) 系统要求 硬件: a) 64-bit 带SSE设备的 x86-64
CPU b) 8GB的主要内存(以人类基因组为例) c) 8 Gb的硬盘(以人类基因组为例) 软件:
实例: $
/leofs/noncode/xcl/SOAP2/2bwt-builder
/leofs/noncode/xcl/References/Human/hg19/hg19.fa 将会生成13个不同的索引文件,这些索引文件的前缀为hg19.fa.index,后缀分别为*.amb,
*.ann, *.bwt, *.fmv, *.hot,
*.lkt, *.pac, *.rev.bwt, *.rev.fmv,
*.rev.lkt, *.rev.pac, *.sa, 和
*.sai. 也即生成hg19.fa.index.*的一系列索引文件。 2.序列比对(以双端测序为例,单端测序见SOAP官网上的说明) 单端测序reads序列比对格式: ./soap
–a -D -o
双端测序reads序列比对格式: ./soap –a
-b -D -o -2 -m -x
insert_size> 注意:对于-D参数,程序仅接受上面所述的索引文件前缀,即hg19.fa.index 实例: $
/leofs/noncode/xcl/SOAP2/soap -a
ERR188040_1.fastq -b ERR188040_2.fastq
-D
/leofs/noncode/xcl/References/Human/hg19/hg19.fa.index
-o PE_output -2
SE_output 3.参数: -D STR Prefix name for
reference index [*.index]. -a STR
Query file, for SE
reads alignment or one end
of PE reads -b STR
Query b file, one
end of PE reads -o STR
Output file for
alignment results -2 STR
Output file contains
mapped but unpaired reads when
do PE alignment -u STR
Output file for
unmapped reads, [none] -m INT Minimal insert size INT
allowed for PE,
[400] -x INT
Maximal insert size INT
allowed for PE,
[600] -n INT
Filter low quality
reads contain more INT bp
Ns, [5] -t Output reads id instead
reads name, [none] -r INT
How to report
repeat hits, 0=none; 1=random
one; 2=all, [1] -R RF alignment for long
insert size(>= 2k bps) PE data, [none]
FR alignment -l INT
For long reads
with high error
rate at 3'-end,
those can't align whole
length, then first align 5' INT bp subsequence as a seed,
[256] use whole length of
the read -v INT
Totally allowed mismatches
in one read, [2] -M INT
Match mode for each
read or the seed part of
read, which shouldn't contain more
than 2 mismaches,
[4] 0: exact match
only 1: 1 mismatch match
only 2: 2 mismatch match
only 3: [gap] (coming
soon) 4: find the best
hits -p INT
Multithreads, n threads,
[1]