Class Bio::FastaFormat
In: lib/bio/db/fasta.rb  (CVS)
Parent: DB

Treats a FASTA formatted entry, such as:

  >id and/or some comments                    <== comment line
  ATGCATGCATGCATGCATGCATGCATGCATGCATGC        <== sequence lines
  ATGCATGCATGCATGCATGCATGCATGCATGCATGC
  ATGCATGCATGC

The precedent ’>’ can be omitted and the trailing ’>’ will be removed automatically.

Examples

  f_str = <<END
  >sce:YBR160W  CDC28, SRM5; cyclin-dependent protein kinase catalytic subunit [EC:2.7.1.-] [SP:CC28_YEAST]
  MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEG
  VPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYME
  GIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNL
  KLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGC
  IFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFP
  QWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES
  >sce:YBR274W  CHK1; probable serine/threonine-protein kinase [EC:2.7.1.-] [SP:KB9S_YEAST]
  MSLSQVSPLPHIKDVVLGDTVGQGAFACVKNAHLQMDPSIILAVKFIHVP
  TCKKMGLSDKDITKEVVLQSKCSKHPNVLRLIDCNVSKEYMWIILEMADG
  GDLFDKIEPDVGVDSDVAQFYFQQLVSAINYLHVECGVAHRDIKPENILL
  DKNGNLKLADFGLASQFRRKDGTLRVSMDQRGSPPYMAPEVLYSEEGYYA
  DRTDIWSIGILLFVLLTGQTPWELPSLENEDFVFFIENDGNLNWGPWSKI
  EFTHLNLLRKILQPDPNKRVTLKALKLHPWVLRRASFSGDDGLCNDPELL
  AKKLFSHLKVSLSNENYLKFTQDTNSNNRYISTQPIGNELAELEHDSMHF
  QTVSNTQRAFTSYDSNTNYNSGTGMTQEAKWTQFISYDIAALQFHSDEND
  CNELVKRHLQFNPNKLTKFYTLQPMDVLLPILEKALNLSQIRVKPDLFAN
  FERLCELLGYDNVFPLIINIKTKSNGGYQLCGSISIIKIEEELKSVGFER
  KTGDPLEWRRLFKKISTICRDIILIPN
  END

  f = Bio::FastaFormat.new(f_str)
  puts "### FastaFormat"
  puts "# entry"
  puts f.entry
  puts "# entry_id"
  p f.entry_id
  puts "# definition"
  p f.definition
  puts "# data"
  p f.data
  puts "# seq"
  p f.seq
  puts "# seq.type"
  p f.seq.type
  puts "# length"
  p f.length
  puts "# aaseq"
  p f.aaseq
  puts "# aaseq.type"
  p f.aaseq.type
  puts "# aaseq.composition"
  p f.aaseq.composition
  puts "# aalen"
  p f.aalen

References

Methods

aalen   aaseq   acc_version   accession   accessions   blast   comment   entry   entry_id   fasta   gi   identifiers   length   locus   nalen   naseq   new   query   seq   to_s   to_seq  

Constants

DELIMITER = RS = "\n>"   Entry delimiter in flatfile text.
DELIMITER_OVERRUN = 1   (Integer) excess read size included in DELIMITER.

Attributes

data  [RW]  The seuqnce lines in text.
definition  [RW]  The comment line of the FASTA formatted data.
entry_overrun  [R] 

Public Class methods

Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.

[Source]

# File lib/bio/db/fasta.rb, line 155
    def initialize(str)
      @definition = str[/.*/].sub(/^>/, '').strip       # 1st line
      @data = str.sub(/.*/, '')                         # rests
      @data.sub!(/^>.*/m, '')   # remove trailing entries for sure
      @entry_overrun = $&
    end

Public Instance methods

Returens the length of Bio::Sequence::AA.

[Source]

# File lib/bio/db/fasta.rb, line 245
    def aalen
      self.aaseq.length
    end

Returens the Bio::Sequence::AA.

[Source]

# File lib/bio/db/fasta.rb, line 240
    def aaseq
      Sequence::AA.new(seq)
    end

Returns accession number with version.

[Source]

# File lib/bio/db/fasta.rb, line 303
    def acc_version
      identifiers.acc_version
    end

Returns an accession number.

[Source]

# File lib/bio/db/fasta.rb, line 291
    def accession
      identifiers.accession
    end

Parsing FASTA Defline (using identifiers method), and shows accession numbers. It returns an array of strings.

[Source]

# File lib/bio/db/fasta.rb, line 298
    def accessions
      identifiers.accessions
    end
blast(factory)

Alias for query

Returns comments.

[Source]

# File lib/bio/db/fasta.rb, line 219
    def comment
      seq
      @comment
    end

Returns the stored one entry as a FASTA format. (same as to_s)

[Source]

# File lib/bio/db/fasta.rb, line 163
    def entry
      @entry = ">#{@definition}\n#{@data.strip}\n"
    end

Parsing FASTA Defline (using identifiers method), and shows a possibly unique identifier. It returns a string.

[Source]

# File lib/bio/db/fasta.rb, line 277
    def entry_id
      identifiers.entry_id
    end
fasta(factory)

Alias for query

Parsing FASTA Defline (using identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.

[Source]

# File lib/bio/db/fasta.rb, line 286
    def gi
      identifiers.gi
    end

Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or ":"-separated IDs. It returns a Bio::FastaDefline instance.

[Source]

# File lib/bio/db/fasta.rb, line 267
    def identifiers
      unless defined?(@ids) then
        @ids = FastaDefline.new(@definition)
      end
      @ids
    end

Returns sequence length.

[Source]

# File lib/bio/db/fasta.rb, line 225
    def length
      seq.length
    end

Returns locus.

[Source]

# File lib/bio/db/fasta.rb, line 308
    def locus
      identifiers.locus
    end

Returens the length of Bio::Sequence::NA.

[Source]

# File lib/bio/db/fasta.rb, line 235
    def nalen
      self.naseq.length
    end

Returens the Bio::Sequence::NA.

[Source]

# File lib/bio/db/fasta.rb, line 230
    def naseq
      Sequence::NA.new(seq)
    end

Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.

  #!/usr/bin/env ruby
  require 'bio'

  factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
  flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
  flatfile.each do |entry|
    p entry.definition
    result = entry.fasta(factory)
    result.each do |hit|
      print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
      p hit.lap_at
    end
  end

[Source]

# File lib/bio/db/fasta.rb, line 186
    def query(factory)
      factory.query(@entry)
    end

Returns a joined sequence line as a String.

[Source]

# File lib/bio/db/fasta.rb, line 193
    def seq
      unless defined?(@seq)
        unless /\A\s*^\#/ =~ @data then
          @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up
        else
          a = @data.split(/(^\#.*$)/)
          i = 0
          cmnt = {}
          s = []
          a.each do |x|
            if /^# ?(.*)$/ =~ x then
              cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1
            else
              x.tr!(" \t\r\n0-9", '') # lazy clean up
              i += x.length
              s << x
            end
          end
          @comment = cmnt
          @seq = Bio::Sequence::Generic.new(s.join(''))
        end
      end
      @seq
    end
to_s()

Alias for entry

Returns sequence as a Bio::Sequence object.

Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.

[Source]

# File lib/bio/db/fasta.rb, line 256
    def to_seq
      seq
      obj = Bio::Sequence.new(@seq)
      obj.definition = self.definition
      obj
    end

[Validate]