To get the zip file
url: http://rdp.cme.msu.edu/classifier/classifier.jsp
$ wget http://rdp.cme.msu.edu/download/rdp_multiclassifier.zip
$ unzip http://rdp.cme.msu.edu/download/rdp_multiclassifier.zip
$ cd rdp_multiclassifier
NOTE: read the README file
$ ls dist
- shows the multiclassifier.jar file which is required, note the path to this jar file
Input file:
- a fasta file called 'v2bBar8L.fsa'
- It has 1054 sequences
Testing
$ java -Xmx1g -jar /path/to/rdp_multiclassifier/dist/multiclassifier.jar v2bBar8L.fsa > foo
$ cat foo
===
taxid lineage name rank v2bBar8L.fsa
0 null Root no rank 1054
1 Root;norank;Bacteria;domain; Bacteria domain 1054
1142 Root;norank;Bacteria;domain;"Fusobacteria";phylum; "Fusobacteria" phylum 91
1143 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class; "Fusobacteria" class 91
1144 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order; "Fusobacteriales" order 91
1151 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family; "Leptotrichiaceae" family 91
1154 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;Sneathia;genus; Sneathia genus 91
61 Root;norank;Bacteria;domain;"Firmicutes";phylum; "Firmicutes" phylum 901
===
- the output has 5 columns: taxid, lineage, name, rank, name of the file/count of sequences
- the third line indicates that all 1054 sequences in the file belongs to domain named Bacteria, whos taxid is 1 and lineage is "Root;norank;Bacteria;domain;"
Try using the option --conf
- set the confidence cutoff value at 0.9
$ java -Xmx1g -jar /path/to/rdp_multiclassifier/dist/multiclassifier.jar --conf=0.9 v2bBar8L.fsa > foo1
$ cat foo1
===
taxid lineage name rank v2bBar8L.fsa
taxid lineage name rank v2bBar8L.fsa
0 null Root no rank 1054
1 Root;norank;Bacteria;domain; Bacteria domain 1054
1142 Root;norank;Bacteria;domain;"Fusobacteria";phylum; "Fusobacteria" phylum 88
1143 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class; "Fusobacteria" class 88
1144 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order; "Fusobacteriales" order 88
1151 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family; "Leptotrichiaceae" family 88
1154 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;Sneathia;genus; Sneathia genus 87
-1152 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;unclassified_"Leptotrichiaceae";; unclassified_"Leptotrichiaceae" 1
61 Root;norank;Bacteria;domain;"Firmicutes";phylum; "Firmicutes" phylum 899
===
===
- observe the reduction in count due to increased cutoff
Using the option assign_outfile=<file>
- to get the RDP classifier output for each sequence in the file
$ java -Xmx1g -jar /home/krevanna/Desktop/TEST/RDPMultiClassifier/rdp_multiclassifier/dist/multiclassifier.jar --assign_outfile=assign_outfile --conf=0.9 v2bBar8L.fsa > foo1
$ cat assign_outfile
===
GAV21LS02D9RRO - Root norank 1.0 Bacteria domain 1.0 "Fusobacteria" phylum 1.0 "Fusobacteria" class 1.0 "Fusobacteriales" order 1.0 "Leptotrichiaceae" family 1.0 Sneathia genus 1.0
GAV21LS02DRIZP - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Clostridia" class 1.0 Clostridiales order 1.0 Veillonellaceae family 1.0 Veillonella genus 1.0
GAV21LS02DP3EU - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Bacilli" class 1.0 "Lactobacillales" order 1.0 "Aerococcaceae" family 1.0 Aerococcus genus 1.0
GAV21LS02C89VS - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Clostridia" class 1.0 Clostridiales order 1.0 Veillonellaceae family 1.0 Veillonella genus 1.0
GAV21LS02C89KB - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Bacilli" class 1.0 "Lactobacillales" order 1.0 Lactobacillaceae family 1.0 Lactobacillus genus 1.0
===
Using the option assign_outfile=<file>
- to get the RDP classifier output for each sequence in the file
$ java -Xmx1g -jar /home/krevanna/Desktop/TEST/RDPMultiClassifier/rdp_multiclassifier/dist/multiclassifier.jar --assign_outfile=assign_outfile --conf=0.9 v2bBar8L.fsa > foo1
$ cat assign_outfile
===
GAV21LS02D9RRO - Root norank 1.0 Bacteria domain 1.0 "Fusobacteria" phylum 1.0 "Fusobacteria" class 1.0 "Fusobacteriales" order 1.0 "Leptotrichiaceae" family 1.0 Sneathia genus 1.0
GAV21LS02DRIZP - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Clostridia" class 1.0 Clostridiales order 1.0 Veillonellaceae family 1.0 Veillonella genus 1.0
GAV21LS02DP3EU - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Bacilli" class 1.0 "Lactobacillales" order 1.0 "Aerococcaceae" family 1.0 Aerococcus genus 1.0
GAV21LS02C89VS - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Clostridia" class 1.0 Clostridiales order 1.0 Veillonellaceae family 1.0 Veillonella genus 1.0
GAV21LS02C89KB - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Bacilli" class 1.0 "Lactobacillales" order 1.0 Lactobacillaceae family 1.0 Lactobacillus genus 1.0
===
Trying option --bootstrap_out= <file>
- default is null
$ java -Xmx1g -jar /home/krevanna/Desktop/TEST/RDPMultiClassifier/rdp_multiclassifier/dist/multiclassifier.jar --bootstrap_out=bootstrap_out --conf=0.9 v2bBar8L.fsa > foo1
$ cat bootstrap_out
===
sample: v2bBar8L.fsa
Number of matching assignments out of 100 bootstraps
Rank >90 90-81 80-71 70-61 60-51 50-41 40-31 30-21 20-11 10-1
domain 1054 0 0 0 0 0 0 0 0 0
phylum 1029 13 4 3 3 2 0 0 0 0
class 985 27 17 13 5 3 4 0 0 0
order 978 30 17 13 8 4 3 1 0 0
family 954 29 14 21 16 11 4 3 2 0
genus 931 34 13 17 17 19 8 7 8 0
Rank >90 90-81 80-71 70-61 60-51 50-41 40-31 30-21 20-11 10-1
domain 1054 0 0 0 0 0 0 0 0 0
phylum 1029 13 4 3 3 2 0 0 0 0
class 985 27 17 13 5 3 4 0 0 0
order 978 30 17 13 8 4 3 1 0 0
family 954 29 14 21 16 11 4 3 2 0
genus 931 34 13 17 17 19 8 7 8 0
===
- The sum of each line is 1054, i.e. 1054 sequences
- The sum of each line is 1054, i.e. 1054 sequences
- For eg, genus, 931 sequences are classified with >90 confidence, 34 were with 81-90 % confidence, and so on.