| Class | Bio::FastaFormat |
| In: |
lib/bio/db/fasta.rb
(CVS)
|
| Parent: | DB |
Treats a FASTA formatted entry, such as:
>id and/or some comments <== comment line ATGCATGCATGCATGCATGCATGCATGCATGCATGC <== sequence lines ATGCATGCATGCATGCATGCATGCATGCATGCATGC ATGCATGCATGC
The precedent ’>’ can be omitted and the trailing ’>’ will be removed automatically.
f_str = <<END >sce:YBR160W CDC28, SRM5; cyclin-dependent protein kinase catalytic subunit [EC:2.7.1.-] [SP:CC28_YEAST] MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEG VPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYME GIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNL KLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGC IFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFP QWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES >sce:YBR274W CHK1; probable serine/threonine-protein kinase [EC:2.7.1.-] [SP:KB9S_YEAST] MSLSQVSPLPHIKDVVLGDTVGQGAFACVKNAHLQMDPSIILAVKFIHVP TCKKMGLSDKDITKEVVLQSKCSKHPNVLRLIDCNVSKEYMWIILEMADG GDLFDKIEPDVGVDSDVAQFYFQQLVSAINYLHVECGVAHRDIKPENILL DKNGNLKLADFGLASQFRRKDGTLRVSMDQRGSPPYMAPEVLYSEEGYYA DRTDIWSIGILLFVLLTGQTPWELPSLENEDFVFFIENDGNLNWGPWSKI EFTHLNLLRKILQPDPNKRVTLKALKLHPWVLRRASFSGDDGLCNDPELL AKKLFSHLKVSLSNENYLKFTQDTNSNNRYISTQPIGNELAELEHDSMHF QTVSNTQRAFTSYDSNTNYNSGTGMTQEAKWTQFISYDIAALQFHSDEND CNELVKRHLQFNPNKLTKFYTLQPMDVLLPILEKALNLSQIRVKPDLFAN FERLCELLGYDNVFPLIINIKTKSNGGYQLCGSISIIKIEEELKSVGFER KTGDPLEWRRLFKKISTICRDIILIPN END f = Bio::FastaFormat.new(f_str) puts "### FastaFormat" puts "# entry" puts f.entry puts "# entry_id" p f.entry_id puts "# definition" p f.definition puts "# data" p f.data puts "# seq" p f.seq puts "# seq.type" p f.seq.type puts "# length" p f.length puts "# aaseq" p f.aaseq puts "# aaseq.type" p f.aaseq.type puts "# aaseq.composition" p f.aaseq.composition puts "# aalen" p f.aalen
| DELIMITER | = | RS = "\n>" | Entry delimiter in flatfile text. | |
| DELIMITER_OVERRUN | = | 1 | (Integer) excess read size included in DELIMITER. |
| data | [RW] | The seuqnce lines in text. |
| definition | [RW] | The comment line of the FASTA formatted data. |
| entry_overrun | [R] |
Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.
# File lib/bio/db/fasta.rb, line 155 def initialize(str) @definition = str[/.*/].sub(/^>/, '').strip # 1st line @data = str.sub(/.*/, '') # rests @data.sub!(/^>.*/m, '') # remove trailing entries for sure @entry_overrun = $& end
Returens the length of Bio::Sequence::AA.
# File lib/bio/db/fasta.rb, line 245 def aalen self.aaseq.length end
Returens the Bio::Sequence::AA.
# File lib/bio/db/fasta.rb, line 240 def aaseq Sequence::AA.new(seq) end
Parsing FASTA Defline (using identifiers method), and shows accession numbers. It returns an array of strings.
# File lib/bio/db/fasta.rb, line 298 def accessions identifiers.accessions end
Parsing FASTA Defline (using identifiers method), and shows a possibly unique identifier. It returns a string.
# File lib/bio/db/fasta.rb, line 277 def entry_id identifiers.entry_id end
Parsing FASTA Defline (using identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.
# File lib/bio/db/fasta.rb, line 286 def gi identifiers.gi end
Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or ":"-separated IDs. It returns a Bio::FastaDefline instance.
# File lib/bio/db/fasta.rb, line 267 def identifiers unless defined?(@ids) then @ids = FastaDefline.new(@definition) end @ids end
Returens the length of Bio::Sequence::NA.
# File lib/bio/db/fasta.rb, line 235 def nalen self.naseq.length end
Returens the Bio::Sequence::NA.
# File lib/bio/db/fasta.rb, line 230 def naseq Sequence::NA.new(seq) end
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
#!/usr/bin/env ruby
require 'bio'
factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
flatfile.each do |entry|
p entry.definition
result = entry.fasta(factory)
result.each do |hit|
print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
p hit.lap_at
end
end
# File lib/bio/db/fasta.rb, line 186 def query(factory) factory.query(@entry) end
Returns a joined sequence line as a String.
# File lib/bio/db/fasta.rb, line 193 def seq unless defined?(@seq) unless /\A\s*^\#/ =~ @data then @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up else a = @data.split(/(^\#.*$)/) i = 0 cmnt = {} s = [] a.each do |x| if /^# ?(.*)$/ =~ x then cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1 else x.tr!(" \t\r\n0-9", '') # lazy clean up i += x.length s << x end end @comment = cmnt @seq = Bio::Sequence::Generic.new(s.join('')) end end @seq end
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.
# File lib/bio/db/fasta.rb, line 256 def to_seq seq obj = Bio::Sequence.new(@seq) obj.definition = self.definition obj end