[转载]如何把sra格式转成fastq格式

2017-06-01 11:24阅读：

http://blog.sina.cn/dpool/blog/u/3164679961

原文作者：胖小妖

sra是NCBI 推出的存储高通量数据的格式，而平常我们工作用得多是fastq格式。
Fasta/Fastq格式的文本，这是转录组的最初的数据，后续分析都是在这个

文件上进行的。
NCBI SRA，是short reads archive的简写，二代测序的数据一般都会传到这个数据库，所以你自己测的可以传进去，另外如果你想分析别人已经测的，应该可以从这个数据库里面直接下载数据，进行分析。这里面的Fastq的质量值都转换为了ASCII33了。

如果需要把sra 转成fastq，则
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
下载相应的软件。
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

Tool: fastq-dump

Name: fastq-dump - dump sra data in fastq format
Usage: fastq-dump [options] [ -A ] <accession>
fastq-dump [op

Input:
-A	\|	--accession <accession>	Replaces accession derived from <path> in filename(s) and deflines (only for single table dump)
		--table <table-name>	Table name within cSRA object, default is 'SEQUENCE'
Processing:
Read Splitting. Sequence data may be used in raw form or split into individual reads
		--split-spot	Split spots into individual reads
Full Spot Filters. Applied to the full spot independently of --split-spot
-N	\|	--minSpotId <rowid>	Minimum spot id
-X	\|	--maxSpotId <rowid>	Maximum spot id
		--spot-groups <[list]>	Filter by SPOT_GROUP (member): name[,...]'
-W	\|	--clip	Apply left and right clips
Common Filters. Applied to spots when --split-spot is not set, otherwise - to individual reads
-M	\|	--minReadLen <len>	Minimum read length to output, default is 25
-R	\|	--read-filter <[filter]>	Split into files by READ_FILTER value optionally filter by value: pass\|reject\|criteria\|redacted
-E	\|	--qual-filter	Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N
Filters based on alignments. Filters are active when alignment data are present
		--aligned	Dump only aligned sequences
		--unaligned	Dump only unaligned sequences
		--aligned-region <name[:from-to]>	Filter by position on genome. Name can either be accession.version (ex: NC_000001.10) or file specific name (ex: 'chr1' or '1'). 'from' and 'to' are 1-based coordinates
		--matepair-distance <from-to\|unknown>	Filter by distance between matepairs. Use 'unknown' to find matepairs split between the references. Use from-to to limit matepail distance on the same reference
Filters for individual reads. Applied only with --split-spot set
		--skip-technical	Applied only with --split-spot set. Dump only biological reads
Output:
-O	\|	--outdir <path>	Output directory, default is working directory '.'
-Z	\|	--stdout	Output to stdout, all split data become joined into single stream
		--gzip	Compress output using gzip
		--bzip2	Compress output using bzip2
Multiple File Options. Setting these options will produce more than 1 file, each of which will be suffixed according to splitting criteria
		--split-files	Dump each read into separate file. Files will receive suffix corresponding to read number
		--split-3	Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files _1.fastq and _2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored
-G	\|	--spot-group	Split into files by SPOT_GROUP (member name)
-R	\|	--read-filter <[filter]>	Split into files by READ_FILTER value. Optionally filter by value: pass\|reject\|criteria\|redacted
-T	\|	--group-in-dirs	Split into subdirectories instead of files
-K	\|	--keep-empty-files	Do not delete empty files
Formatting:
Sequence
-C	\|	--dumpcs <[cskey]>	Formats sequence using color space (default for SOLiD),'cskey' may be specified for translation
-B	\|	--dumpbase	Formats sequence using base space (default for other than SOLiD)
Quality
-Q	\|	--offset <integer>	Offset to use for quality conversion, default is 33
		--fasta <[line width]>	FASTA only, no qualities, optional line wrap width (set to zero for no wrapping)
Defline
-F	\|	--origfmt	Defline contains only original sequence name
-I	\|	--readids	Append read id after spot id as 'accession.spot.readid' on defline
		--helicos	Helicos style defline
		--defline-seq <fmt>	Defline format specification for sequence
		--defline-qual <fmt>	Defline format specification for quailty. <fmt> is string of characters and/or variables. The variables can be one of: $ac - accession, $si spot id, $sn spot name, $sg spot group (barcode), $sl spot length in bases, $ri read number, $rn read name, $rl read length in bases. '[]' could be used for an optional output: if all vars in [] yield empty values whole group is not printed. Empty value is empty string or for numeric variables. Ex: @$sn[_$rn]/$ri '_$rn' is omitted if name is empty
Other:
-h	\|	--help	Output brief explanation of program usage
-V	\|	--version	Display the version of the program
-L	\|	--log-level <level>	Logging level as number or enum string One of (fatal\|sys\|int\|err\|warn\|info) or (0-5). Current/default is warn
-v	\|	--verbose	Increase the verbosity level of the program. Use multiple times for more verbosity
		--report	Control program execution environment report generation (if implemented). One of (never\|error\|always). Default is error

新浪博客

[转载]如何把sra格式转成fastq格式

Tool: fastq-dump

分享

我的更多文章

下载客户端阅读体验更佳

疯狂捕鱼