In this chapter what DEODAS is doing will be covered step-by-step in the order the actions occur. Please have the program deodas.py
running while reading this. The deodas.py
interface was designed to be as self explanatory as possible.
All files ending `.py' are Python scripts. These are ran by the Python interpreter. If there is any problem starting these scripts make sure the `python' interpreter is on the executable path.
DEODAS starts with collections of related protein sequences, or a protein family, each in their own FASTA format file. Related proteins can be found by starting with a base sequence and conducting a BLASTP search (see section References). The BLASTP search compares the base sequence with proteins sequences reverse translated from nucleic acid databases (see section References). Due to automated sequencing the number of nucleic acid sequences far exceeds the number of directly sequenced proteins. More than one FASTA file and protein family can be processed at once but they will receive the same settings (seen later). Make sure all protein sequences for a family are in a FASTA format file together (see section Tutorial). All FASTA files to be processed are placed in a single directory.
BLASTP search results are the recommended input for DEODAS. DEODAS has been extensively tested with BLASTP protein input. In these file the second sequence is the closest relative to the top sequence and they diverge as the sequences go down the file. If the sequences are entered in random order in the input file the splittree
program may not group subfamilies correctly if close relatives are far apart in the FASTA file but this has not been well tested.
A file extension changing utility program, extension.py
is included with DEODAS. It changes all file extension in a directory to another. It runs with arguments extension.py
file-extension-from file-extension-to. For example if there are several `*.fas' file in a directory run the command extension.py fas fasta
. The extension.py
command also runs interactively. Run extension.py
and enter the file extensions at the command prompts.
The DEODAS interface program deodas.py
selects a directory to run the designer process, deodasdesigner.py
on. Every `*.fasta' file is operated on by deodasdesigner.py
, other file extensions are not recognized even if it is a FASTA file. The deodasdesigner.py
runs as a separate process because the design and analysis process can take over a day if many sequences are involved. The interface deodas.py
can be turned off once a design process is started or can be used to start a new process in another directory.
See figure 1 next page, DEODAS main interface.
deodas.py
Settings for DEODAS include, the Codon usage table tool bar item, seen on the interface. The codon usage table is used by codehop
to back-translate amino-acid sequences into nucleic-acid sequences (see section References). Codon usage tables are kept in $BLIMPS_DIR
`/docs'. Additional codon usage tables are available from
http://blocks.fhcrc.org/blocks/help/CODEHOP/codon.html.
The Maximum subfamily dissimilarity range is used to divide the the FASTA protein family files into sub-families based on their dissimilarity. A family tree of each protein family is create by clustalw
and splittree
reads the tree and looks for sections branching above the dissimilarity set in deodas.py
. Branches of the tree exceeding the dissimilarity setting are broken off into sub-families. A new FASTA file is written for each subfamily with the name original-base-file-name.sub-number.fasta
. For example `uspA.fasta' could split into `uspA.1.fasta' and `uspA.2.fasta'. The subfamily splitting is needed because if the sequences in the original `*.fasta' file are too diverse no oligonucleotides can be produced. It may be necessary to try designing oligonuceotides for a family with a few different dissimilarity values.
The Minimum oligonucleotide length range sets deodasdesigner.py
to filter out oligonucleotides below the size from the final data. These oligos still appear in intermediary files discussed later.
The Maximum mismatches range adjusts the mismatches allowed when searching the designed oligonucleotides against Genbank flat files. After oligonucleotides are designed each oligonucleotide over the Minimum oligonucleotide length is searched against Genbank for matches by the fuzznuc
program from EMBOSS. This setting adjusts the mismatches allowed in this search.
Genbank databases entry selects the Genbank flat files that the oligonucleotides are searched against. The flat files are entered with their full path names separated by commas with no spaces. Genbank flat files are at ftp://ncbi.nlm.nih.gov/genbank/. An Expect script, `ftp.genbank.exp', can be used to automatically download and install Genbank flat files (see section Genbank installation).
The Create a new results database entry creates a new database with the Postgresql database management system (DBMS). To use this the user must have been enabled with the Postgresql DBMS (see section Genbank installation). The results of the oligonucleotide design are saved to this database. The Save results to an existing database entry sets deodasdesigner.py
to save it's results to an existing Postgresql database. The user must be allowed to write to this database under the Postgresql DBMS's authority. If multiple users save to the same database consults the Postgresql manual.
The OK button starts deodasdesigner.py
. As mentioned before this is a separate process and the deodas.py
interface can be used again in another directory or tuned off. Be careful not to start a process in same directory by pressing OK again or the second deodasdesigner.py
process will overwrite the earlier process's files.
The first released version of DEODAS uses the operating system's normal process control system. The command ps
displays the processes running from a x-terminal. This can be used to monitor the DEODAS processes. To stop deodasdesigner.py
, run ps
from the terminal running deodasdesigner.py
to get the process number. The run the command kill
process-number. As an alternative the K-Desktop Environment (KDE) comes with a a graphical process monitoring system, kpm
that can be used to monitor and kill processes. It comes in the `kde-utils' package in many Linux systems. It also requires that KDE is installed. Turning off the terminal running deodasdesigner.py
will stop everything if needed. Future versions of DEODAS may contain more process control.
The design and screening of the oligonucleotides is handled by the deodasdesigner.py
. As before it runs in its own process. It reads all `*.fasta' files in the starting directory (see section Input files). Then it processes each file one at a time until finished.
Next deodasdesigner.py
creates a subdirectory with the name of the FASTA file minus the ending `.fasta'. For example `uspA.fasta' creates directory `uspA/'. This keeps all intermediate files generated for a FASTA file in one directory together. The settings for the program are written to a file in each sub-directory named `settings.txt'. The program cleanfasta
removes any long descriptions that can interfere with many sequence analysis programs and creates a stripped copy of the FASTA file in the sub-directory.
Next clustalw
reads the FASTA file in the sub-directory and outputs a multiple sequence alignment in the file file-name.`.aln' and a phylogenetic tree in Newick nested parenthesis format in the file file-name.dnd
(see section References). The multiple sequence alignment is produced by aligning each pair of sequences and assigning a similarity score. The pairwise alignments are used to build the phylogenetic tree. The multiple alignment starts with the the closest aligned pair and one by one adds on the next sequence in the tree (see section References).
The phylogenetic tree and FASTA file pair are read by splittree
. The dissimilarity setting is used to divide the FASTA file into subfiles. splittree
begins copying the FASTA sequences in the `*.fasta' into `*.1.fasta'. When a branch in the phylogenetic tree, `*.dnd', file over the dissimilarity is found a new file, `*.2.fasta', is created and the FASTA sequences are copied into the next subfile (see section Maximum subfamily dissimilarity). Every time a branch over the set dissimilarity value is encountered a new subfamily file is created. This process continues until all the entire `*.dnd' tree file has been read. Splitting the sequences into subfamilies causes the oligonucleotides to be designed on more closely related sequences.
Next clustalw
realigns the sequences in the subfamily file `*.number.fasta'. The oligonucleotides will be designed based on the sub-alignments.
The Blimps package's mablock
program finds highly conserved blocks within the subfamily protein alignment (see section References). The blocks are ungapped regions of similarity often separated by gapped regions. Th blocks are found by two systems in parallel. On system searches for protein motifs and extending outward until similarity disappears. The other is an interative approach that trys all combinations of position and length of the blocks (see section References). Using two approaches in parallel double checks the reality of the block. The blocks are saved in the file `*.blks'. Another Blimps program, codehop
reads the `*.blks' file. codehop
forms a consensus protein sequence and back-translates into nucleic acid using the codon usage table input earlier (see section Codon Usage table). Oligonucleotides are designed on the back-translated nucleic-acid consensus sequence.
The oligonucleotides designed by codehop
are consensus-degenerate sequences. The 3' region is a short, 11-12 bp, degenerate core region based on 3 or 4 highly conserved amino acids. The 5' region is a longer, 18-25 bp consensus clamp (see section References). The degenerate core helps the oligonucleotide pair with a range of related genes. The consensus core region reduces the degeneracy of the oligonucleotides population. If the degeneracy is too high a single oligonucleotide sequence may be too dilute to produce a detectable signal in hybridization reactions even if there is a match. The results from codehop
are saved in `*.codehop' files. These files are translated into HTML by htmlize-codehop
and have the name `*.codehop.html'. They can be viewed with any web-browser to see the consensus block sequences and the locations of the oligonucleotides against those blocks.
The next step is to electronically analyze the oligonucleotides produced by codehop
. This is done by searching the oligonucleotide sequences against Genbank flat files with nucleic acid sequences. The deodas.py
interface allows the selection of Genbank flat files, minimum length, and mismatches. Sequences under the minimum length are skipped by deodasdesigner.py
at this step. The fuzznuc
program, from EMBOSS, searches each sequence against the genes in the specified Genbank flat files, including complimentary sequences, and the results are written to files with a name fasta-file-base-name.subfamily-number.probe-number.Genbank-flat-file.fuzznuc (submitted for publication EMBOSS) (see section References). The data output by fuzznuc includes the matching targets' locus, description, accession number, starting position of the match, mismatches, and matching sequence. All of this data is read and formatted into the structured query language (SQL) file fasta-files-base-name.sql. This file is loaded into the Postgresql database selected in the interface (see section Results databases). An SQL file is written instead of loading the data directly into the Postgresql database so it can be loaded into another DBMS if the laboratory already uses another DBMS.
Degenerate oligonucleotide design process finished
when it is finished. At this point the results can be analyzed interactively.
The analysis program deodasdesigner
can be started from the deodas.py
interface, where it will automatically connect to the selected Postgresql database or it can be started by itself. New databases are easily connected by typing in a new database name in the Connect to database entry and pressing enter. The current database is automatically disconnected. The way to analyze the results is to search for keywords in the Search target descriptions entry. This will find oligonucleotides matching interesting targets. The next step is to search the name of an interesting oligonucleotides to find out all the targets it matched. This is to screen out oligonucleotides matching targets other than the desired one. Prescreening the designed oligonucleotides against gene databases can be used to reduce cross-hybridization (see section References).
See figure 2 next page, Deodasquery interface
Once an oligonucleotide is selected for use, it's name is entered into a list kept in the database by entering it's name in the Select an oligonucleotide name entry. Names that don't exist in the database can't be added to the list. So after entering a name press the List selections button to make sure the name was entered correctly. This button lists all selected sequences and information including name, sequence, block, degeneracy, melting temperature, and length (GC content will be added later). CODEHOP calculated the melting temperature based on the melting temperatures of the lest stable oligonucleotide-template pair as in the method by Rychlik (see section References). The text box listing the results can be printed using the Print button. Oligonucleotides can be deleted from the selected list with the Deselect an oligonucleotide entry. The deodasquery
program includes pop-up help boxes so try running the program with results to discover more about how it works.
Go to the first, previous, next, last section, table of contents.