Friday, August 20, 2010

RDP MultiClassifier

On Ubuntu, 
To get the zip file
url: http://rdp.cme.msu.edu/classifier/classifier.jsp
$ wget http://rdp.cme.msu.edu/download/rdp_multiclassifier.zip
$ unzip http://rdp.cme.msu.edu/download/rdp_multiclassifier.zip
$ cd rdp_multiclassifier


NOTE: read the README file
$ ls dist
- shows the multiclassifier.jar file which is required, note the path to this jar file


Input file:
- a fasta file called 'v2bBar8L.fsa'
- It has 1054 sequences


Testing
$ java -Xmx1g -jar /path/to/rdp_multiclassifier/dist/multiclassifier.jar v2bBar8L.fsa > foo
$ cat foo
===

taxid lineage name  rank  v2bBar8L.fsa
0 null  Root  no rank 1054
1 Root;norank;Bacteria;domain;  Bacteria  domain  1054
1142  Root;norank;Bacteria;domain;"Fusobacteria";phylum;  "Fusobacteria"  phylum  91
1143  Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class; "Fusobacteria"  class 91
1144  Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order; "Fusobacteriales" order 91
1151  Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family; "Leptotrichiaceae"  family  91
1154  Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;Sneathia;genus;  Sneathia  genus 91
61  Root;norank;Bacteria;domain;"Firmicutes";phylum;  "Firmicutes"  phylum  901
===
- the output has 5 columns: taxid, lineage, name, rank, name of the file/count of sequences
- the third line indicates that all 1054 sequences in the file belongs to domain named Bacteria, whos taxid is 1 and lineage is "Root;norank;Bacteria;domain;"

Try using the option --conf
- set the confidence cutoff value at 0.9
$ java -Xmx1g -jar /path/to/rdp_multiclassifier/dist/multiclassifier.jar --conf=0.9 v2bBar8L.fsa > foo1
$ cat foo1
===
taxid lineage name rank v2bBar8L.fsa
0 null Root no rank 1054
1 Root;norank;Bacteria;domain; Bacteria domain 1054
1142 Root;norank;Bacteria;domain;"Fusobacteria";phylum; "Fusobacteria" phylum 88
1143 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class; "Fusobacteria" class 88
1144 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order; "Fusobacteriales" order 88
1151 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family; "Leptotrichiaceae" family 88
1154 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;Sneathia;genus; Sneathia genus 87
-1152 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;unclassified_"Leptotrichiaceae";; unclassified_"Leptotrichiaceae" 1
61 Root;norank;Bacteria;domain;"Firmicutes";phylum; "Firmicutes" phylum 899
===
- observe the reduction in count due to increased cutoff


Using the option assign_outfile=<file>
- to get the RDP classifier output for each sequence in the file
$ java -Xmx1g -jar /home/krevanna/Desktop/TEST/RDPMultiClassifier/rdp_multiclassifier/dist/multiclassifier.jar --assign_outfile=assign_outfile --conf=0.9 v2bBar8L.fsa > foo1
$ cat assign_outfile 
===
GAV21LS02D9RRO - Root norank 1.0 Bacteria domain 1.0 "Fusobacteria" phylum 1.0 "Fusobacteria" class 1.0 "Fusobacteriales" order 1.0 "Leptotrichiaceae" family 1.0 Sneathia genus 1.0
GAV21LS02DRIZP - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Clostridia" class 1.0 Clostridiales order 1.0 Veillonellaceae family 1.0 Veillonella genus 1.0
GAV21LS02DP3EU - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Bacilli" class 1.0 "Lactobacillales" order 1.0 "Aerococcaceae" family 1.0 Aerococcus genus 1.0
GAV21LS02C89VS - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Clostridia" class 1.0 Clostridiales order 1.0 Veillonellaceae family 1.0 Veillonella genus 1.0
GAV21LS02C89KB - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Bacilli" class 1.0 "Lactobacillales" order 1.0 Lactobacillaceae family 1.0 Lactobacillus genus 1.0
===



Trying option --bootstrap_out=<file>
- default is null
$ java -Xmx1g -jar /home/krevanna/Desktop/TEST/RDPMultiClassifier/rdp_multiclassifier/dist/multiclassifier.jar --bootstrap_out=bootstrap_out --conf=0.9 v2bBar8L.fsa > foo1
$ cat bootstrap_out
===
sample: v2bBar8L.fsa
  Number of matching assignments out of 100 bootstraps
Rank  >90 90-81 80-71 70-61 60-51 50-41 40-31 30-21 20-11 10-1
domain  1054  0 0 0 0 0 0 0 0 0
phylum  1029  13  4 3 3 2 0 0 0 0
class 985 27  17  13  5 3 4 0 0 0
order 978 30  17  13  8 4 3 1 0 0
family  954 29  14  21  16  11  4 3 2 0
genus 931 34  13  17  17  19  8 7 8 0
===
- The sum of each line is 1054, i.e. 1054 sequences
- For eg, genus, 931 sequences are classified with >90 confidence, 34 were with 81-90 % confidence, and so on.

No comments:

Post a Comment