| Path: | doc/KEGG_API.rd (CVS) |
| Last Update: | Wed Dec 27 22:40:45 +0900 2006 |
$Id: KEGG_API.rd,v 1.5 2006/12/27 13:40:45 k Exp $
Copyright (C) 2003-2006 Toshiaki Katayama <k@bioruby.org>
KEGG API is a web service to use the KEGG system from your program via SOAP/WSDL.
We have been making the ((<KEGG|URL:/kegg/>)) system available at ((<GenomeNet|URL:/>)). KEGG is a suite of databases including GENES, SSDB, PATHWAY, LIGAND, LinkDB, etc. for genome research and related research areas in molecular and cellular biology. These databases and associated computation services are available via WWW and the user interfaces are built on web browsers. Thus, the interfaces are designed to be accessed by humans, not by machines, which means that it is troublesome for the researchers who want to use KEGG in an automated manner. Besides, from the database developer‘s side, it is impossible to prepare all the CGI programs that satisfy a variety of users’ needs.
In recent years, the Internet technology for application-to-application communication referred to as the ((<web service|URL:www.oreillynet.com/lpt/a/webservices/2002/02/12/webservicefaqs.html>)) is improving at a rapid rate. For exmaple, Google, a popular Internet search engine, provides the web service called the ((<Google Web API|URL:www.google.com/apis/>)). The service enables users to develop software that accesses and manipulates a massive amount of web documents that are constantly refreshed. In the field of genome research, a similar kind of web service called ((<DAS|URL:www.biodas.org/>)) (distributed annotation system) has been used on several web sites, including ((<Ensembl|URL:www.ensembl.org/>)), ((<Wormbase|URL:www.wormbase.org/>)), ((<Flybase|URL:www.flybase.org/>)), ((<SGD|URL:www.yeastgenome.org/>)), ((<TIGR|URL:www.tigr.org/>)).
With the background and the trends noted above, we have started developing a new web service called KEGG API using ((<SOAP|URL:www.w3.org/TR/SOAP/>)) and ((<WSDL|URL:www.w3.org/TR/wsdl20/>)). The service has been tested with ((<Ruby|URL:www.ruby-lang.org/>)) (Ruby 1.8.2 or Ruby 1.6.8 with ((<SOAP4R|URL:raa.ruby-lang.org/project/soap4r/>)) version 1.4.8.1) and ((<Perl|URL:www.perl.org/>)) (((<SOAP::Lite|URL:www.soaplite.com/>)) version 0.55) languages. Although the service has not been tested with clients written in other languages, it should work if the language can treat SOAP/WSDL.
The ((<BioRuby|URL:bioruby.org/>)) project prepared a Ruby library to handle the KEGG API, so users of the Ruby language should check out the latest release of the BioRuby distribution.
For the general information on KEGG API, see the following page at GenomeNet:
* ((<URL:http://www.genome.jp/kegg/soap/>))
This guide explains how to use the KEGG API in your programs for searching and retrieving data from the KEGG database.
As always, the best way to become familar with it is by looking at an example. In this document, sample codes written in several languages are shown. After understanding the first exsample, try other APIs.
Firstly, you have to install the SOAP related libraries for the programming language of your choice.
In the case of Perl, you need to install the following packages:
* ((<SOAP Lite|URL:http://www.soaplite.com/>)) (tested with 0.60)
* Note: SOAP Lite > 0.60 is reported to have errors in some methods for now.
* ((<MIME-Base64|URL:http://search.cpan.org/author/GAAS/MIME-Base64/>))
* ((<LWP|URL:http://search.cpan.org/author/GAAS/libwww-perl/>))
* ((<URI|URL:http://search.cpan.org/author/GAAS/URI/>))
Here‘s a first example in Perl language.
#!/usr/bin/env perl
use SOAP::Lite;
$wsdl = 'http://soap.genome.jp/KEGG.wsdl';
$serv = SOAP::Lite->service($wsdl);
$offset = 1;
$limit = 5;
$top5 = $serv->get_best_neighbors_by_gene('eco:b0002', $offset, $limit);
foreach $hit (@{$top5}) {
print "$hit->{genes_id1}\t$hit->{genes_id2}\t$hit->{sw_score}\n";
}
The output will be
eco:b0002 eco:b0002 5283 eco:b0002 ecj:JW0001 5283 eco:b0002 sfx:S0002 5271 eco:b0002 sfl:SF0002 5271 eco:b0002 ecc:c0003 5269
showing that eco:b0002 has Smith-Waterman score 5271 with sfl:SF0002 as a 4th hit among the entire KEGG/GENES database (here, "eco" means
KEGG organism codes).
The method internally searches the KEGG/SSDB (Sequence Similarity Database) database which contains information about the amino acid sequence similarities among all protein coding genes in the complete genomes, together with information about best hits and bidirectional best hits (best-best hits). The relation of gene x in genome A and gene y in genome B is called bidirectional best hits, when x is the best hit of query y against all genes in A and vice versa, and it is often used as an operational definition of ortholog.
Next example simply lists PATHWAYs for E. coli ("eco") in KEGG database.
#!/usr/bin/env perl
use SOAP::Lite;
$wsdl = 'http://soap.genome.jp/KEGG.wsdl';
$results = SOAP::Lite
-> service($wsdl)
-> list_pathways("eco");
foreach $path (@{$results}) {
print "$path->{entry_id}\t$path->{definition}\n";
}
This example colors the boxes corresponding to the E. coli genes b1002 and b2388 on a Glycolysis pathway of E. coli (path:eco00010).
#!/usr/bin/env perl
use SOAP::Lite;
$wsdl = 'http://soap.genome.jp/KEGG.wsdl';
$serv = SOAP::Lite -> service($wsdl);
$genes = SOAP::Data->type(array => ["eco:b1002", "eco:b2388"]);
$result = $serv -> mark_pathway_by_objects("path:eco00010", $genes);
print $result; # URL of the generated image
If you use the KEGG API methods which requires arguments in ArrayOfstring datatype, you must need following modifications depending on the version of SOAP::Lite.
As you see in the above example, you always need to convert a Perl‘s array into a SOAP object expicitly in SOAP::Lite by
SOAP::Data->type(array => [value1, value2, .. ])
when you pass an array as the argument for any KEGG API method.
You should use version >= 0.69 as the versions between 0.61-0.68 contain bugs.
You need to add following code to your program to pass the array of string and/or int data to the SOAP server.
sub SOAP::Serializer::as_ArrayOfstring{
my ($self, $value, $name, $type, $attr) = @_;
return [$name, {'xsi:type' => 'array', %$attr}, $value];
}
sub SOAP::Serializer::as_ArrayOfint{
my ($self, $value, $name, $type, $attr) = @_;
return [$name, {'xsi:type' => 'array', %$attr}, $value];
}
By adding the above, you can write
$genes = ["eco:b1002", "eco:b2388"];
instead of the following (writing as follows is also permitted).
$genes = SOAP::Data->type(array => ["eco:b1002", "eco:b2388"]);
You can test with the following script for the SOAP::Lite v0.69. If it works, a URL of the generated image will be returned.
#!/usr/bin/env perl
use SOAP::Lite +trace => [qw(debug)];
print "SOAP::Lite = ", $SOAP::Lite::VERSION, "\n";
my $serv = SOAP::Lite -> service("http://soap.genome.jp/KEGG.wsdl");
my $genes = ["eco:b1002", "eco:b2388"];
my $result = $serv->mark_pathway_by_objects("path:eco00010", $genes);
print $result, "\n";
# sub routines implicitly used in the above code
sub SOAP::Serializer::as_ArrayOfstring{
my ($self, $value, $name, $type, $attr) = @_;
return [$name, {'xsi:type' => 'array', %$attr}, $value];
}
sub SOAP::Serializer::as_ArrayOfint{
my ($self, $value, $name, $type, $attr) = @_;
return [$name, {'xsi:type' => 'array', %$attr}, $value];
}
If you are using Ruby 1.8.1 or later, you are ready to use KEGG API as Ruby already supports SOAP in its standard library.
If your Ruby is 1.6.8 or older, you need to install followings:
* ((<SOAP4R|URL:http://raa.ruby-lang.org/list.rhtml?name=soap4r>)) 1.5.1 or later
* One of the following XML processing library
* ((<rexml|URL:http://raa.ruby-lang.org/list.rhtml?name=rexml>))
* ((<xmlparser|URL:http://raa.ruby-lang.org/list.rhtml?name=xmlparser>))
* ((<xmlscan|URL:http://raa.ruby-lang.org/list.rhtml?name=xmlscan>))
* ((<date2|URL:http://raa.ruby-lang.org/list.rhtml?name=date2>))
* ((<devel-logger|URL:http://raa.ruby-lang.org/list.rhtml?name=devel-logger>))
* ((<uconv|URL:http://raa.ruby-lang.org/list.rhtml?name=uconv>))
* ((<http-access2|URL:http://raa.ruby-lang.org/list.rhtml?name=http-access2>))
Here‘s a sample code for Ruby having the same functionality with Perl‘s first example shown above.
#!/usr/bin/env ruby
require 'soap/wsdlDriver'
wsdl = "http://soap.genome.jp/KEGG.wsdl"
serv = SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
serv.generate_explicit_type = true
# if uncommented, you can see transactions for debug
#serv.wiredump_dev = STDERR
offset = 1
limit = 5
top5 = serv.get_best_neighbors_by_gene('eco:b0002', offset, limit)
top5.each do |hit|
print hit.genes_id1, "\t", hit.genes_id2, "\t", hit.sw_score, "\n"
end
You may need to iterate to obtain all the results by increasing offset and/or limit.
#!/usr/bin/env ruby
require 'soap/wsdlDriver'
wsdl = "http://soap.genome.jp/KEGG.wsdl"
serv = SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
serv.generate_explicit_type = true
offset = 1
limit = 100
loop do
results = serv.get_best_neighbors_by_gene('eco:b0002', offset, limit)
break unless results
results.each do |hit|
print hit.genes_id1, "\t", hit.genes_id2, "\t", hit.sw_score, "\n"
end
offset += limit
end
It is automatically done by using ((<BioRuby|URL:bioruby.org/>)) library, which implements get_all_* methods for this. BioRuby also provides filtering functionality for selecting needed fields from the complex data type.
#!/usr/bin/env ruby
require 'bio'
serv = Bio::KEGG::API.new
results = serv.get_all_best_neighbors_by_gene('eco:b0002')
results.each do |hit|
print hit.genes_id1, "\t", hit.genes_id2, "\t", hit.sw_score, "\n"
end
# Same as above but using filter to select fields
fields = [:genes_id1, :genes_id2, :sw_score]
results.each do |hit|
puts hit.filter(fields).join("\t")
end
# Different filters to pick additional fields for each amino acid sequence
fields1 = [:genes_id1, :start_position1, :end_position1, :best_flag_1to2]
fields2 = [:genes_id2, :start_position2, :end_position2, :best_flag_2to1]
results.each do |hit|
print "> score: ", hit.sw_score, ", identity: ", hit.identity, "\n"
print "1:\t", hit.filter(fields1).join("\t"), "\n"
print "2:\t", hit.filter(fields2).join("\t"), "\n"
end
The equivalent for the Perl‘s second example described above will be
#!/usr/bin/env ruby
require 'bio'
serv = Bio::KEGG::API.new
list = serv.list_pathways("eco")
list.each do |path|
print path.entry_id, "\t", path.definition, "\n"
end
and equivalent for the last example is as follows.
#!/usr/bin/env ruby
require 'bio'
serv = Bio::KEGG::API.new
genes = ["eco:b1002", "eco:b2388"]
result = serv.mark_pathway_by_objects("path:eco00010", genes)
print result # URL of the generated image
In the case of Python, you have to install
* ((<SOAPpy|URL:http://pywebsvcs.sourceforge.net/>))
plus some extra packages required for SOAPpy ( ((<fpconst|URL:www.analytics.washington.edu/Zope/projects/fpconst>)), ((<PyXML|URL:pyxml.sourceforge.net/>)) etc.).
Here‘s a sample code using KEGG API with Python.
#!/usr/bin/env python
from SOAPpy import WSDL
wsdl = 'http://soap.genome.jp/KEGG.wsdl'
serv = WSDL.Proxy(wsdl)
results = serv.get_genes_by_pathway('path:eco00020')
print results
In the case of Java, you need to obtain Apache Axis library version axis-1_2alpha or newer (axis-1_1 doesn‘t work properly for KEGG API)
* ((<Apache Axis|URL:http://ws.apache.org/axis/>))
and put required jar files in an appropriate directory.
For the binary distribution of the Apache axis-1_2alpha release, copy the jar files stored under the axis-1_2alpha/lib/ to the directory of your choice.
% cp axis-1_2alpha/lib/*.jar /path/to/lib/
You can use WSDL2Java coming with Apache Axis to generate classes needed for the KEGG API automatically.
To generate classes and documents for the KEGG API, download the script ((<axisfix.pl|URL:www.genome.jp/kegg/soap/support/axisfix.pl>)) and follow the steps below:
% java -classpath /path/to/lib/axis.jar:/path/to/lib/jaxrpc.jar:/path/to/lib/commons-logging.jar:/path/to/lib/commons-discovery.jar:/path/to/lib/saaj.jar:/path/to/lib/wsdl4j.jar:. org.apache.axis.wsdl.WSDL2Java -p keggapi http://soap.genome.jp/KEGG.wsdl % perl -i axisfix.pl keggapi/KEGGBindingStub.java % javac -classpath /path/to/lib/axis.jar:/path/to/lib/jaxrpc.jar:/path/to/lib/wsdl4j.jar:. keggapi/KEGGLocator.java % jar cvf keggapi.jar keggapi/* % javadoc -classpath /path/to/lib/axis.jar:/path/to/lib/jaxrpc.jar -d keggapi_javadoc keggapi/*.java
This program will do the same job as the Python‘s example (extended to accept a pathway_id as the argument).
import keggapi.*;
class GetGenesByPathway {
public static void main(String[] args) throws Exception {
KEGGLocator locator = new KEGGLocator();
KEGGPortType serv = locator.getKEGGPort();
String query = args[0];
String[] results = serv.get_genes_by_pathway(query);
for (int i = 0; i < results.length; i++) {
System.out.println(results[i]);
}
}
}
This is another example which uses ArrayOfSSDBRelation data type.
import keggapi.*;
class GetBestNeighborsByGene {
public static void main(String[] args) throws Exception {
KEGGLocator locator = new KEGGLocator();
KEGGPortType serv = locator.getKEGGPort();
String query = args[0];
SSDBRelation[] results = null;
results = serv.get_best_neighbors_by_gene(query, 1, 50);
for (int i = 0; i < results.length; i++) {
String gene1 = results[i].getGenes_id1();
String gene2 = results[i].getGenes_id2();
int score = results[i].getSw_score();
System.out.println(gene1 + "\t" + gene2 + "\t" + score);
}
}
}
Compile and execute this program (don‘t forget to include keggapi.jar file in your classpath) as follows:
% javac -classpath /path/to/lib/axis.jar:/path/to/lib/jaxrpc.jar:/path/to/lib/wsdl4j.jar:/path/to/keggapi.jar GetBestNeighborsByGene.java % java -classpath /path/to/lib/axis.jar:/path/to/lib/jaxrpc.jar:/path/to/lib/commons-logging.jar:/path/to/lib/commons-discovery.jar:/path/to/lib/saaj.jar:/path/to/lib/wsdl4j.jar:/path/to/keggapi.jar:. GetBestNeighborsByGene eco:b0002
You may wish to set the CLASSPATH environmental variable.
bash/zsh:
% for i in /path/to/lib/*.jar
do
CLASSPATH="${CLASSPATH}:${i}"
done
% export CLASSPATH
tcsh:
% foreach i ( /path/to/lib/*.jar )
setenv CLASSPATH ${CLASSPATH}:${i}
end
For the other cases, consult the javadoc pages generated by WSDL2Java.
* ((<URL:http://www.genome.jp/kegg/soap/doc/keggapi_javadoc/>))
Users can use a WSDL file to create a SOAP client driver. The WSDL file for the KEGG API can be found at:
* ((<URL:http://soap.genome.jp/KEGG.wsdl>))
* 'org' is a three-letter (or four-letter) organism code used in KEGG.
The list can be found at (see the description of the list_organisms
method below):
* ((<URL:http://www.genome.jp/kegg/catalog/org_list.html>))
* 'db' is a database name used in GenomeNet service. See the
description of the list_databases method below.
* 'entry_id' is a unique identifier of which format is the combination of
the database name and the identifier of an entry joined by a colon sign
as 'database:entry' (e.g. 'embl:J00231' means an EMBL entry 'J00231').
'entry_id' includes 'genes_id', 'enzyme_id', 'compound_id', 'drug_id',
'glycan_id', 'reaction_id', 'pathway_id' and 'motif_id' described in below.
* 'genes_id' is a gene identifier used in KEGG/GENES which consists of
'keggorg' and a gene name (e.g. 'eco:b0001' means an E. coli gene 'b0001').
* 'enzyme_id' is an enzyme identifier consisting of database name 'ec'
and an enzyme code used in KEGG/LIGAND ENZYME database.
(e.g. 'ec:1.1.1.1' means an alcohol dehydrogenase enzyme)
* 'compound_id' is a compound identifier consisting of database name
'cpd' and a compound number used in KEGG COMPOUND / LIGAND database
(e.g. 'cpd:C00158' means a citric acid). Note that some compounds
also have 'glycan_id' and both IDs are accepted and converted internally
by the corresponding methods.
* 'drug_id' is a drug identifier consisting of database name 'dr'
and a compound number used in KEGG DRUG / LIGAND database
(e.g. 'dr:D00201' means a tetracycline).
* 'glycan_id' is a glycan identifier consisting of database name 'gl'
and a glycan number used in KEGG GLYCAN database (e.g. 'gl:G00050'
means a Paragloboside). Note that some glycans also have 'compound_id'
and both IDs are accepted and converted internally by the corresponding
methods.
* 'reaction_id' is a reaction identifier consisting of database name 'rn'
and a reaction number used in KEGG/REACTION (e.g. 'rn:R00959' is a
reaction which catalyze cpd:C00103 into cpd:C00668)
* 'pathway_id' is a pathway identifier consisting of 'path' and a pathway
number used in KEGG/PATHWAY. Pathway numbers prefixed by 'map' specify
the reference pathway and pathways prefixed by the 'keggorg' specify
pathways specific to the organism (e.g. 'path:map00020' means a reference
pathway for the cytrate cycle and 'path:eco00020' means a same pathway of
which E. coli genes are marked).
* 'motif_id' is a motif identifier consisting of motif database names
('ps' for prosite, 'bl' for blocks, 'pr' for prints, 'pd' for prodom,
and 'pf' for pfam) and a motif entry name. (e.g. 'pf:DnaJ' means a Pfam
database entry 'DnaJ').
* 'ko_id' is a KO identifier consisting of 'ko' and a ko number used in
KEGG/KO. KO (KEGG Orthology) is an classification of orthologous genes
defined by KEGG (e.g. 'ko:K02598' means a KO group for nitrite transporter
NirC genes).
* 'ko_class_id' is a KO class identifier which is used to classify
'ko_id' hierarchically (e.g. '01110' means a 'Carbohydrate Metabolism'
class).
* ((<URL:http://www.genome.jp/dbget-bin/get_htext?KO>))
* 'offset' and 'limit' are both an integer and used to control the
number of the results returned at once. Methods having these arguments
will return first 'limit' results starting from 'offset'th.
* 'fg_color_list' is a list of colors for the foreground (corresponding
to the texts and borders of the objects on the KEGG pathway map).
* 'bg_color_list' is a list of colors for the background (corresponding
to the inside of the objects on the KEGG pathway map).
Related site:
* ((<URL:http://www.genome.jp/kegg/kegg3.html>))
Many of the KEGG API methods will return a set of values in a complex data structure as described below. This section summarizes all kind of these data types. Note that, the retuened values for the empty result will be
* an empty array -- for the methods which return ArrayOf'OBJ' * an empty string -- for the methods which return String * -1 -- for the methods which return int * NULL -- for the methods which return any other 'OBJ'
+ SSDBRelation
SSDBRelation data type contains the following fields:
genes_id1 genes_id of the query (string) genes_id2 genes_id of the target (string) sw_score Smith-Waterman score between genes_id1 and genes_id2 (int) bit_score bit score between genes_id1 and genes_id2 (float) identity identity between genes_id1 and genes_id2 (float) overlap overlap length between genes_id1 and genes_id2 (int) start_position1 start position of the alignment in genes_id1 (int) end_position1 end position of the alignment in genes_id1 (int) start_position2 start position of the alignment in genes_id2 (int) end_position2 end position of the alignment in genes_id2 (int) best_flag_1to2 best flag from genes_id1 to genes_id2 (boolean) best_flag_2to1 best flag from genes_id2 to genes_id1 (boolean) definition1 definition string of the genes_id1 (string) definition2 definition string of the genes_id2 (string) length1 amino acid length of the genes_id1 (int) length2 amino acid length of the genes_id2 (int)
+ ArrayOfSSDBRelation
ArrayOfSSDBRelation data type is a list of the SSDBRelation data type.
+ MotifResult
MotifResult data type contains the following fields:
motif_id motif_id of the motif (string) definition definition of the motif (string) genes_id genes_id of the gene containing the motif (string) start_position start position of the motif match (int) end_position end position of the motif match (int) score score of the motif match for TIGRFAM and PROSITE (float) evalue E-value of the motif match for Pfam (double)
Note: ‘score’ and/or ‘evalue’ is set to -1 if the corresponding value is not applicable.
+ ArrayOfMotifResult
ArrayOfMotifResult data type is a list of the MotifResult data type.
+ Definition
Definition data type contains the following fields:
entry_id database entry_id (string) definition definition of the entry (string)
+ ArrayOfDefinition
ArrayOfDefinition data type is a list of the Definition data type.
+ LinkDBRelation
LinkDBRelation data type contains the following fields:
entry_id1 entry_id of the starting entry (string) entry_id2 entry_id of the terminal entry (string) type type of the link as "direct" or "indirect" (string) path link path information across the databases (string)
+ ArrayOfLinkDBRelation
ArrayOfLinkDBRelation data type is a list of the LinkDBRelation data type.
+ PathwayElement
PathwayElement represents the object on the KEGG PATHWAY map. PathwayElement data type contains the following fields:
element_id unique identifier of the object on the pathway (int)
type type of the object ("gene", "enzyme" etc.) (string)
names array of names of the object (ArrayOfstring)
components array of element_ids of the group components (ArrayOfint)
+ ArrayOfPathwayElement
ArrayOfPathwayElement data type is a list of the PathwayElement data type.
+ PathwayElementRelation
PathwayElementRelation represents the relationship between PathwayElements. PathwayElementRelation data type contains the following fields:
element_id1 unique identifier of the object on the pathway (int)
element_id2 unique identifier of the object on the pathway (int)
type type of relation ("ECrel", "maplink" etc.) (string)
subtypes array of objects involved in the relation (ArrayOfSubtype)
+ ArrayOfPathwayElementRelation
ArrayOfPathwayElementRelation data type is a list of the PathwayElementRelation data type.
++ Subtype
Subtype is used in the PathwayElementRelation data type to represent the object involved in the relation. Subtype data type contains the following fields:
element_id unique identifier of the object on the pathway (int)
relation kind of relation ("compound", "inhibition" etc.) (string)
type type of relation ("+p", "--|" etc.) (string)
++ ArrayOfSubtype
ArrayOfSubtype data type is a list of the Subtype data type.
+ StructureAlignment
StructureAlignment represents structural alignment of nodes between two molecules with score. StructureAlignment data type contains the following fields:
target_id entry_id of the target (string) score alignment score (float) query_nodes indices of aligned nodes in the query molecule (ArrayOfint) target_nodes indices of aligned nodes in the target molecule (ArrayOfint)
+ ArrayOfStructureAlignment
ArrayOfStructureAlignment data type is a list of the StructureAlignment data type.
This section describes the APIs for retrieving the general information concerning latest version of the KEGG database.