Friday, August 20, 2010

RDP MultiClassifier

On Ubuntu, 
To get the zip file
url: http://rdp.cme.msu.edu/classifier/classifier.jsp
$ wget http://rdp.cme.msu.edu/download/rdp_multiclassifier.zip
$ unzip http://rdp.cme.msu.edu/download/rdp_multiclassifier.zip
$ cd rdp_multiclassifier


NOTE: read the README file
$ ls dist
- shows the multiclassifier.jar file which is required, note the path to this jar file


Input file:
- a fasta file called 'v2bBar8L.fsa'
- It has 1054 sequences


Testing
$ java -Xmx1g -jar /path/to/rdp_multiclassifier/dist/multiclassifier.jar v2bBar8L.fsa > foo
$ cat foo
===

taxid lineage name  rank  v2bBar8L.fsa
0 null  Root  no rank 1054
1 Root;norank;Bacteria;domain;  Bacteria  domain  1054
1142  Root;norank;Bacteria;domain;"Fusobacteria";phylum;  "Fusobacteria"  phylum  91
1143  Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class; "Fusobacteria"  class 91
1144  Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order; "Fusobacteriales" order 91
1151  Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family; "Leptotrichiaceae"  family  91
1154  Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;Sneathia;genus;  Sneathia  genus 91
61  Root;norank;Bacteria;domain;"Firmicutes";phylum;  "Firmicutes"  phylum  901
===
- the output has 5 columns: taxid, lineage, name, rank, name of the file/count of sequences
- the third line indicates that all 1054 sequences in the file belongs to domain named Bacteria, whos taxid is 1 and lineage is "Root;norank;Bacteria;domain;"

Try using the option --conf
- set the confidence cutoff value at 0.9
$ java -Xmx1g -jar /path/to/rdp_multiclassifier/dist/multiclassifier.jar --conf=0.9 v2bBar8L.fsa > foo1
$ cat foo1
===
taxid lineage name rank v2bBar8L.fsa
0 null Root no rank 1054
1 Root;norank;Bacteria;domain; Bacteria domain 1054
1142 Root;norank;Bacteria;domain;"Fusobacteria";phylum; "Fusobacteria" phylum 88
1143 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class; "Fusobacteria" class 88
1144 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order; "Fusobacteriales" order 88
1151 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family; "Leptotrichiaceae" family 88
1154 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;Sneathia;genus; Sneathia genus 87
-1152 Root;norank;Bacteria;domain;"Fusobacteria";phylum;"Fusobacteria";class;"Fusobacteriales";order;"Leptotrichiaceae";family;unclassified_"Leptotrichiaceae";; unclassified_"Leptotrichiaceae" 1
61 Root;norank;Bacteria;domain;"Firmicutes";phylum; "Firmicutes" phylum 899
===
- observe the reduction in count due to increased cutoff


Using the option assign_outfile=<file>
- to get the RDP classifier output for each sequence in the file
$ java -Xmx1g -jar /home/krevanna/Desktop/TEST/RDPMultiClassifier/rdp_multiclassifier/dist/multiclassifier.jar --assign_outfile=assign_outfile --conf=0.9 v2bBar8L.fsa > foo1
$ cat assign_outfile 
===
GAV21LS02D9RRO - Root norank 1.0 Bacteria domain 1.0 "Fusobacteria" phylum 1.0 "Fusobacteria" class 1.0 "Fusobacteriales" order 1.0 "Leptotrichiaceae" family 1.0 Sneathia genus 1.0
GAV21LS02DRIZP - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Clostridia" class 1.0 Clostridiales order 1.0 Veillonellaceae family 1.0 Veillonella genus 1.0
GAV21LS02DP3EU - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Bacilli" class 1.0 "Lactobacillales" order 1.0 "Aerococcaceae" family 1.0 Aerococcus genus 1.0
GAV21LS02C89VS - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Clostridia" class 1.0 Clostridiales order 1.0 Veillonellaceae family 1.0 Veillonella genus 1.0
GAV21LS02C89KB - Root norank 1.0 Bacteria domain 1.0 "Firmicutes" phylum 1.0 "Bacilli" class 1.0 "Lactobacillales" order 1.0 Lactobacillaceae family 1.0 Lactobacillus genus 1.0
===



Trying option --bootstrap_out=<file>
- default is null
$ java -Xmx1g -jar /home/krevanna/Desktop/TEST/RDPMultiClassifier/rdp_multiclassifier/dist/multiclassifier.jar --bootstrap_out=bootstrap_out --conf=0.9 v2bBar8L.fsa > foo1
$ cat bootstrap_out
===
sample: v2bBar8L.fsa
  Number of matching assignments out of 100 bootstraps
Rank  >90 90-81 80-71 70-61 60-51 50-41 40-31 30-21 20-11 10-1
domain  1054  0 0 0 0 0 0 0 0 0
phylum  1029  13  4 3 3 2 0 0 0 0
class 985 27  17  13  5 3 4 0 0 0
order 978 30  17  13  8 4 3 1 0 0
family  954 29  14  21  16  11  4 3 2 0
genus 931 34  13  17  17  19  8 7 8 0
===
- The sum of each line is 1054, i.e. 1054 sequences
- For eg, genus, 931 sequences are classified with >90 confidence, 34 were with 81-90 % confidence, and so on.

Monday, August 16, 2010

Install :: Java :: Ubuntu 10.04

To install Java on Ubuntu 10.04

In the /etc/apt/sources.list, add the 2 lines
deb http://http.us.debian.org/debian lenny main contrib non-free 
deb-src http://http.us.debian.org/debian lenny main contrib non-free
$ sudo apt-get update
$ sudo apt-cache search sun-java
$ sudo apt-get install sun-java6-source sun-java6-jre sun-java6-javadb sun-java6-fonts sun-java6-plugin sun-java6-jdk sun-java6-demo sun-java6-bin


Wednesday, August 11, 2010

ESPRIT :: Installation :: Cluster

Obtain the source code


On *nix, Steps to install ESPRIT
$ unzip ESPRIT_distribution.zip
$ cd ESPRIT_distribution


Read, esprit_user_guide.pdf and README.txt
$ cd source
$ vim Makefile
Choose the platform by uncomment/comment


To make the package
$ make esprit_cc


Its always better to 
$ make clean 
$ make esprit_cc


Precaution: 
- make sure that the fasta file has header in one and the sequence in one line
- if the sequence is in multiple lines, convert the file to contain just one line of sequence


Pseudocode is followed here using shell scripting and clusterjobmanager
Copy the sequence file here, If you have more than one than group them.
$ cp /path/to/sequence.fas .


To run preproc
$ /path/to/ESPRIT_distribution/source/preproc -f sequence.fas
160794 Seqs Match Primer
160794 Seqs Valid Len


31072 Seqs After Process
      1.63 secs in Purging Strings.


flag:
-f this prevents the program from trimming.


Files created:
sequence_Clean.fas
sequence_Clean.frq


To check
$ awk -F' ' '{ s+=$2 } END { print s }' sequence_Clean.frq
160794
$ grep -c '>' sequence.fas
160794
- Make sure that these numbers are same.


To run kmerdist_par
$ cat submit_kmer_jobs.sh
for i in $(seq 1 10)
do
for j in $(seq $i 10) 
do
job="/path/to/ESPRIT_distribution/source/kmerdist_par sequence_Clean.fas 10 $i $j\n ";
RANDOM=10
num=$RANDOM
echo -e $job > kmer_job_$num.clusterJob
clusterJobSubmission < kmer_job_$num.clusterJob
done
done
- where clusterJobSubmission is your cluster job submission manager
- the extension .clusterJob can be replaced with the extension required
- variable job can include other details if required.
$ cat jobs.clusterJob
## ..
sh submit_kmer_jobs.sh
$ jobsubmit < jobs. clusterJob
- this will submit the job.


Output:
sequence_Clean_[*]_[*].dist
- make sure that numbers are 1_[1-10], 2_[2-10], 3_[3-10], 4_[4-10], 5_[5-10], 6_[6-10], 7_[7-10], 8_[8-10], 9_[9-10], 10_10


Merge all the .dist files
$ cat sequence_Clean_*.dist >> kmer.dist


Split the kmer files into 100 files
$ /path/to/ESPRIT_distribution/source/splitdist -s 100 kmer.dist
Counting Total Records....
71249223 Records Found, Splitting...


Output:
kmer.dist_[0-99]


Submit parallel jobs for needle_dist
$ cat submit_needle_job.sh
for i in $(seq 0 99)
do
job="/path/to/ESPRIT_distribution/source/needledist sequence_Clean.fas kmer.dist\_$i needle.dist\_$i\n ";
RANDOM=10
num=$RANDOM
echo -e $job > needle_job_$num.clusterJob
clusterJobSubmission < needle_job_$num.clusterJob
done


Output
needle.dist_[0-99]


Group all the needle.dist files
$ cat needle.dist_* >> sequence.ndist


To run hcluster
$ /path/to/ESPRIT_distribution/source/hcluster -t 15000 sequence.ndist sequence_Clean.frq
- flag -t is used to increase the size of the linked table, default is 10000


Output
sequence.ndist_sort
sequence.OTU
sequence.Outliers
sequence.Rarefaction
sequence.Cluster
sequence.Cluster_List
sequence.ACE
sequence.CHAO1

Ubuntu :: Printer :: Access

If the printer doesnt show up in the printer list.

Steps:
- Login as admin
- Open a browser
- use url http://localhost:631
- "CUPS for Administrators" > Adding Printers and Classes
- provide the username and password.
- It searches over the network and lists the printer
- Select the print and continue
- Select the driver to use and continue

To print a test page,
Maintenance > print test page



Tuesday, August 3, 2010

R :: install :: gdata :: qvalue

To install gdata
> install.packages("genetics", depend=TRUE)
check the installation
> library('gdata')

To install qvalue
> install.packages("qvalue", depend=TRUE)
Check the installation
> library('qvalue')