Chapter 5.
Osprey: Improving Spotted Oligo Microarrays
The Importance of
Design Parameters
Contributed by Paul Gordon, Sun
Center of
Excellence for Visual Genomics, University of Calgary
Gordon PM, Sensen CW. Osprey: a comprehensive tool employing
novel methods for the design of oligonucleotides for DNA sequencing and
microarrays. Nucleic Acids Res. 2004 Sep 29;32(17):e133 [PubMed]
[PDF].
While it
is still fairly common to design oligo PCR primers manually
using the so-called Wallace Rule: G/C = 4°C, A/T = 2°C, summed
for
melting temperature; more accurate formulae exist that closely model
the thermodynamics of nucleotide binding. The need to calculate large
numbers of primers for genomic sequencing and the growing use of
microarrays have lead to the development of increasingly sophisticated
algorithms to improve the automation of oligo design.
Non-target
binding can cause sequencing reactions to be unusable,
and give false mRNA expression level readings in microarrays.
Bioinformaticians have approached this problem in a variety of
different ways. Oligodb filters out
low-complexity regions using dustn
without checking to determine if the sequences are repeated. ProbeWiz
explicitly
disables filtering, while other system manuals do not document this
aspect of the computation. Most programs use a simple BLAST search to
filter secondary binding based on percent mismatch, but this method has
disadvantages; small sequence stretches with evenly spaced mismatches
may not be found due to the heuristic nature of BLAST. Even if these
methods could find all matches, the Sarani
documentation
shows that the duplex melting temperature of two twenty (20) base
targets against the same oligo sequence can differ by 20°C when
both
targets have only two mismatches. Also, high GC regions bind with much
higher energy than low GC regions of similar length. The number of
pairwise matches is therefore not necessarily a good measure of melting
thermodynamics.
SantaLucia's
"unified" free energy parameters model, was derived
from the unification of previously described nearest neighbor (NN)
methods, and is generally considered the best model yet of DNA binding
thermodynamics for melting temperature and duplex stability. The NN
model assumes that summing the interaction energy of adjacent nucleic
acids on a strand is the best predictor of the whole duplex's
stability. Popular programs such as Primer3 are based on older NN
models that often work, but can deviate significantly (6°C) from
the
unified model, causing problems for experiments like microarrays, where
achieving a uniform melting temperature is very desirable.
Osprey: Simplifying and Improving
Large-Scale Design
Osprey is
new software that includes automated techniques to provide
higher quality oligos with less human intervention. It also introduces
the novel use of Position Specific Scoring Matrices (PSSMs) to encode
the free energy model, improving the specificity and sensitivity of
oligo secondary binding searches.
With
regards to Web interface oligonucleotide design, there are many
choices. Many of the most popular ones are wrappers around the
ever-popular Primer3, and can have many
parameters. Santa Lucia provides a thermodynamics calculation service
called HyTher
(based on the unified model), which works for single sequences. Osprey
attempts to simplify the process for the user: the unified model
thermodynamics, and most parameters calculated automatically from the
input sequence (such as the ideal oligo length for a given melting
temperature or vice versa as in Figure
1 below).

Figure 1. Osprey Web interface
for microarray oligo design.
Optimal Oligonucleotide Criteria
Osprey incorporates a
series of fitness tests, in the following
order: melting temperature, dimer potential, hairpin potential, and
secondary (non-specific) binding. Ordered from computationally simple
to computationally expensive, the tests filter out unsuitable
candidates as quickly as possible.
For
prokaryotes, typically a random hexamer nucleotide mixture is
used to prime reverse transcription to cDNAs from the sample mRNA,
placing little restraint on the location of oligonucleotide binding.
For eukaryotic microarrays, a 3' hybridization site bias in maintained,
since a poly(dT) is used to prime reverse transcription starting at the
gene's 3' mRNA poly(A) tail. Checks for secondary binding are
restricted to only include transcribed sequences in the genome.
A program from the popular Mfold package, quikfold, can be used to
confirm the absence of significant secondary structure. Osprey is
configured to use this check by default, similar to other oligo design
programs including OligoArray.
Removing Sequence Redundancy Automatically
Using
the MegaBLAST program from the BLAST package, all repetitive
elements larger than a user-defined threshold are quickly (less than
one minute for 3800 Sulfolobus genes) identified in
the query
sequences. For whole-genome analysis, the query file and the database
are the same. If the user is iteratively searching for oligos using the
"rejects" from a previous run, the database remains the whole genome,
while the query consists of the sequences that do not yet have a
suitable
candidate. In either case, Osprey filters the query down to unique
sequence, plus one copy of each repetitive section. This setup allows
the secondary binding checks to be performed without interference from
multi-copy elements. No user intervention or preprocessing of the
dataset is required, facilitating the use of Osprey with redundant data
derived from GenBank and other sources.
PSSMs: Improving Non-target Binding Checks
Osprey
introduces a novel method of calculating and accelerating
secondary binding checks using Position Specific Scoring Matrices
(PSSMs). The models are compatible with the method established by
Gribskov et
al.
Appropriately setting the position specific
scores allows the raw profile score to encode the significant caloric
(thermodynamic) values of the DNA binding. The interface available on
the Osprey Web site uses the Genome Canada Bioinformatics
Platform
funded DeCypher bioinformatics accelerator to provide timely
PSSM searches.
This
representation reflects the thermodynamics of oligo duplexes,
and compensates for dangling ends, as well as interspersed mismatches
and bulges. Such a search is advantageous over a BLAST-type search
because, unlike pairwise alignments, the match, mismatch and gap scores
are context sensitive (following the NN model). It also overcomes
inherent limitations of the alignment heuristic when dealing with short
oligo sequences, such as missing DNA matches with gaps (duplex bulges),
and interspersed mismatches. Due to these limitations, oligos where no
apparent secondary binding was identified with heuristic pairwise
alignments may in fact show some using profiles (increased
sensitivity). Also, candidates rejected due to a percentage similarity
cutoff exceeded in BLAST may in fact not bind strongly to those sites
when the NN thermodynamics are calculated (improved specificity).
Availability
A
manuscript detailing Osprey's implementation has been submitted for
publication. Upon acceptance,
the source code will be freely available to academic users. Please feel
free to use the Osprey Web Interface (http://osprey.ucalgary.ca) and provide feedback
on bugs and areas for improvement!
Suggested
readings on oligonucleotide thermodynamics and PSSM searches:
- Eddy, S. R. (1998) Profile
Hidden Markov Models. Bioinformatics 14:755-763.
- Gribskov M., McLachlan A., Eisenberg D. (1987) Profile
analysis: detection of distantly related proteins. Proc Natl Acad
Sci U S A. 1987 Jul;84(13):4355-8.
- Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD,
Madore SJ. (2000) Assessment
of the sensitivity and specificity of oligonucleotide (50mer)
microarrays. Nucleic Acids Research 28:4552-4557.
- Li F., Stormo G.D. (2001) Selection
of optimal DNA oligos for gene expression arrays. Bioinformatics
17(11):1067-1076.
- Nielsen H.B., Knudsen, S. (2002) Avoiding
cross hybridization by choosing nonredundant targets on cDNA arrays.
Bioinformatics 18(2):321-322.
- SantaLucia J. (1998) A
unified view of polymer, dumbbell, and oligonucleotide DNA
nearest-neighbour thermodynamics. Proceeding of the National
Academy of Sciences 95:1460-1465.
- Zuker, M. (2003) Mfold
web server for nucleic acid folding and hybridization prediction.
Nucleic Acids Res. 31(13), 3406-3415.
Chapter 6. BioMOBY
Contributed by Ian J. Forsythe,
Canadian Bioinformatics Help Desk, University of Alberta
Wilkinson MD, Links M.
BioMOBY: an open source biological web services proposal. Brief
Bioinform. 2002 Dec;3(4):331-41 [PubMed].
Many biologists face
a plethora of web service providers on a daily basis: NCBI BLAST, PubMed, SwissProt, PROSITE, PRINTS, ProDom,
Pfam, TIGRFAM, SMART, to name a few. Most
web service providers present one with colourful, graphical-rich entry
forms, each with their own unique data input requirements. If one is
lucky, one can locate some help pages that offer tips on how to use the
entry forms. Often, it is unclear what types of services the provider
offers or how to use their services. Potentially even more difficult
can be figuring out how to use their services from within a computer
program
instead of through a web form or taking the output from one service
provider and sending it to another service provider. Many of these
steps can be automated, using the power of MOBY, removing the
frustration of having to work with customized, often changing
interfaces from hundreds of different service
providers.
MOBY simplifies the
web services creation process by employing a simple
application programming interface (API)
and
providing a central registry, called MOBY-Central, that informs users
about all of the services that are available (Figure 1).
If one has an algorithm,
a program, or information that they want to provide, they can make it
available using MOBY. Instead of having to create an elaborate web
form, the provider just registers their URL with MOBY and provides a MOBY-compliant
interface. MOBY-Central is analogous to a phone book. If the user
inquires about a certain data type (e.g. a Genbank accession number or
a FASTA-formatted sequence), MOBY lists all of the services that
require this type of input. MOBY appears to have been designed with
biologists in mind. MOBY objects consist of lightweight XML, providing
a hierarchy of input/output objects that are particularly well suited
to bioinformatics. Objects are provided for storing sequences, BLAST
results, and other common biological data types. MOBY's extensibility
allows biologists to add their own custom data types.
BioMOBY is an international, Open
Source web service integration project, sponsored in part by Genome Prairie and Genome Canada. It aims to create an architecture
for the
discovery and distribution of
biological data across the internet. BioMOBY
involves
biological data
hosts, biological data service providers,
and computer programmers. BioMOBY
seeks to integrate data from
disparate biological data sources via a central registry, MOBY-Central. Mark Wilkinson, a National
Bioinformatics Platform team member, leads
the MOBY Services Branch of the
project. Lincoln Stein
and Damian
Gessler lead the Semantic MOBY
Branch
of the project.
Here are some links to BioMOBY web pages, articles, and tutorials:
Figure 1. The
dynamics of a MOBY transaction. Three computers are involved:
MOBY-Central,
MOBY-Server, and MOBY-Client. MOBY Servers register (once) their
services
with MOBY-Central. A MOBY client has a piece of data in-hand, and
queries
MOBY-Central for the services which are able to use that piece of data
as input. MOBY-Central returns one or more Web Services Description
Language (WSDL) service
descriptions
for
the applicable services, and the client then choses one, sends its data
to the service, and is returned another form of data; the transaction
is
done according to the specifications in the WSDL document. The data
passed
is in the form of MOBY Objects—lightweight (generally) XML documents
that conform to MOBY object descriptions. Not shown in this diagram is
the ability to query MOBY-Central based on a service-type ontology, or
an output type ontology in addition to the (shown) input type query.
Source: http://biomoby.org