An
Introduction to Bioinformatics
Fall Semester 2003. Lectures - Tuesdays and Thursdays from
9:30 to 10:45 AM, in Dirac 499; optional Laboratory Section -
Tuesdays from 3:45 to 5:45 PM in Conradi 223.
Credit hours: three (or two without lab)
This course introduces the emerging topic of bioinformatics. It is
designed primarily for life science students who do not have an extensive
background in mathematics, statistics, or computer science but who are
interested in survey-level knowledge of bioinformatics and its techniques.
The course is also suitable for students in mathematics, computer science,
or statistics who wish to learn how methods drawn from their discipline
are applied as bioinformatics. It is a 'hands on,' 'how to' lecture,
demonstration, and laboratory (optional) course that includes both
sequence- and structure-based methods.
The lecture meets twice weekly for 75 minutes apiece; the
optional lab component meets once weekly for two hours.
Students have accounts on the Mendel biocomputing server
for access to computational biology databases and
software. Steve Thompson coordinates the course,
delivers several lectures, and instructs in the
laboratory. Guest faculty members lecture in
their individual areas of expertise and knowledge.
Participating Faculty
Instructors
(lecture allocations listed below)
| David Swofford |
CSIT/Biology |
course director |
| Steve Thompson |
CSIT |
course coordinator |
| David Banks |
CSIT/Computer Science |
biosciences visualization |
| Peter Beerli |
CSIT/Biology |
population evolution |
| Michael Chapman |
IMB/Chemistry |
AA attributes
and secondary structure |
| Ross Ellington |
IMB/Biology |
homology modeling |
| Hong Li |
IMB/Chemistry |
nucleic acid structure |
| Jack Quine |
Mathematics |
alignment algorithms |
| David Swofford |
CSIT/Biology |
molecular evolution |
| Robert van Engelen |
CSIT/Computer Science |
computer science
fundamentals |
| Jim Wilgenbusch |
CSIT/Biology |
molecular evolution |
| Huan-Xiang Zhou |
CSIT/Physics |
molecular
mechanics/dynamics |
No textbook is required,
but I have some suggestions. "Developing
Bioinformatics Computer Skills," by Cynthia
Gibas and Per Jambeck is a practical, first-pass, introduction
to the field, a part of the O'Reilly series; and
"Fundamental Concepts of
Bioinformatics," by Dan E. Krane and Michael L. Raymer
is a good start on the basics of computational biology theory.
"Introduction to Bioinformatics: A Theoretical and
Practical Approach," edited by Stephen A. Krawetz
and David D. Womble, is an extremely comprehensive,
perhaps too so, review of the area, particularly sequence analysis,
from biochemistry and cellular and molecular biology, through
the UNIX environment, to biocomputing software applications.
David W. Mount has an excellent, intermediate-level, text, "
Bioinformatics: Sequence and Genome Analysis,"
with more depth into the algorithmic basis of sequence analysis.
Check out the
FSU Bookstore, Amazon.com, and
StudentMarket for
availability.
Organizational Meeting,
David Swofford:
Overview of Course Format and Content
(pdf
from Michael Chapman, Spring 2003).
Week 1, Tues. Aug. 26, 2003.
What have you gotten yourself into?
Lecture #1,
Robert van Engelen:
Introduction -- Computer Science and Biology
(pdf).
suggested reading from Krawetz and Womble:
chapters 1 and 13, and appendix 3;
from Krane and Raymer: appendix 1;
from Gibas and Jambeck: chapters 1 and 2.
Week 1, Tues. Aug. 26 (cont.), 2003.
I. Why are Computers Used?
What's the Advantage? Data-Mining!
- The tremendous growth of data, especially
DNA sequence data.
- The nature of the process, especially pattern
searching.
- The need for complex calculations and sorted
data subsets.
- The increased speed of the computer over
the human brain.
II. String Searches and
Their Uses.
- Because of huge databases -- need for methods
of inquiry and access.
- How do simple one-dimensional pattern text
searches work?
- How does a computer recognize a
"match?"
- What is a simple pattern --
regular pattern expressions and ambiguity?
- What type of information gathering is just
simple pattern searching?
- Searches of the databases for text strings,
e.g. entry, organism, or gene names, authors, etc. vs.
- Searches for primary sequence information,
e.g. restriction enzyme cut sites, regulatory patterns
("signals"), and searches to identify known
functional or structural patterns (motifs).
- e.g. PROSITE -- A Dictionary of Protein
Sites and Patterns.
Week 1, Thurs. Aug. 28, 2003.
III. Advanced Techniques: Neural Nets,
Genetic Algorithms, and Supercomputers.
- Descriptions of the concepts and several
examples of application methods and derived results.
- What are neural nets and genetic algorithms
and how do they help solve biocomputing problems?
- What is a cluster; what is grid computing?
- What is supercomputing's influence and what
is massively parallel?
- What does the future hold for biocomputing?
Lecture #2,
Steve Thompson: Biological Databases
(PowerPoint
or pdf).
suggested reading from Krawetz and Womble:
chapters 11, 26, and 28 to p 469;
from Krane and Raymer: chapter 1;
from Gibas and Jambeck: chapters 6 and 13.
Week 2, Tues. Sept. 2, 2003.
I. Sequence Databases: Content and
Organization.
- Primarily 'flatfile' ASCII data, often
with binary indexing.
- What are primary sequences? What are sequence
databases?
- What information do they contain and how
is it organized?
- How is this information accessed?
- Who maintains them and where are they kept?
- Changes and effects -- history and
development over the years.
II. Structural Databases: Content and
Organization.
- Primarily 'flatfile' ASCII data, rarely binary
indices.
- What are structures? What are structural
databases? The Protein Data Bank (PDB).
- What information do they contain and how
is it organized?
- How is this information accessed?
- Who maintains them and where are they kept?
- Changes and effects -- history and
development over the years.
Week 2, Thurs. Sept. 4, 2003.
III. and guest
Misha Taylor:
Higher Order Databases and Database Structure.
(PowerPoint or
pdf).
- Seldom 'simple' databases, often relational
and/or object oriented.
- Examples include SQL, Oracle, and WWW
databases like NCBI's ASN.1 Entrez database, metabolic pathway
databases, genome databases such as GDB, etc.
Lecture #3,
Steve Thompson: Dot Matrix Methods
(PowerPoint
or pdf).
suggested reading from Krane and Raymer: chapter 2, pp 33-34.
Week 3, Tues. Sept. 9, 2003.
- Why? Provides Gestalt of all comparisons
possible.
- What approaches are used? Windowing,
filters, and word techniques.
- What can be determined with dot matrixing?
Repeats and elements 'less than best.'
- What is the significance of the parameters
employed -- very important!
Lecture #4,
Jack Quine:
Alignments and Substitution Matrices
(pdf).
suggested reading from Krawetz and Womble: chapter 27, pp 443-449,
and chapter 31, pp 539-544;
from Krane and Raymer: chapter 2, pp 35-47;
from Gibas and Jambeck: chapter 7, through p 181.
Week 3, Thurs. Sept. 11, 2003.
I. Pair-wise Sequence Alignment.
- What are pair-wise alignments? How are they
made?
- What sort of approaches are used?
- What is the dynamic programming
algorithm?
- Scoring matrix --> path matrix -->
traceback --> alignment.
- What are gap penalties and what difference
do they make?
- What is the difference between local and
global alignment?
Week 4, Tues. Sept. 16, 2003.
II. Substitution Matrices.
- What are scoring matrices and
log-odds matrices and what ones are available?
- What is the difference between the Dayhoff,
BLOSUM, Gonet, and other available matrices?
- Why one and not another? Which are
appropriate for what sort of different situations.
Lecture #5,
Steve Thompson:
Multiple Sequence Alignment -- Methods and Problems
(PowerPoint or
pdf).
suggested reading from Krawetz and Womble: chapter 31, pp 544-558,
and chapter 33, pp 602-611;
from Krane and Raymer: chapter 2, p 52;
from Gibas and Jambeck: chapter 8, through p 198.
Week 4, Thurs. Sept. 18, 2003.
- Global versus local methods; limitations
of all methods.
- Manual alignment vs. automated, progressive
pairwise alignment.
- Refinement, refinement, refinement .
. . .
- The nature of a consensus; how are they
generated?
Lecture #6,
Steve Thompson:
Database Searching -- Old Methods and New
Developments
(PowerPoint or
pdf).
suggested reading from Krawetz and Womble: chapter 27, pp 449-461,
chapter 28, pp 469-487;
from Krane and Raymer: chapter 2, pp 48-51;
from Gibas and Jambeck: chapter 7, remainder.
Week 5, Tues. Sept. 23, 2003.
I. Heuristic Database
Searching.
- What sequences in the databases are similar
to yours?
- Methods of database searching, traditional
and newer algorithms.
- What do all of these words mean:
"heuristics,"
"hashing," "word," "k-tup,"
"fast," "blast," etc?
- How do the algorithms work; how are they are
different?; how does parameter choice affect the
outcome?
- DNA vs. protein searches, which and
why?
- What program do I use for what
situation?
Week 5, Thurs. Sept. 25, 2003.
II.
Similarity Searching and Significance
(Spring
2003 pdf from
Bill Pearson!).
- What do the scores mean?
- What is the significance of the results?
- What are Monte Carlo methods, Z scores, and
expectations?
- How does any of this relate to "homology" and
'real-life' biology?
Lecture #7,
Bjarne Knudsen:
Remote Relationships, the Profile Methods -- ala Gribskov, MEMEs,
and HMMRs
(pdf
and a biostatistiics
pdf from Dr. Lei Li, Spring, 2002).
suggested reading from Krawetz and Womble:
chapters 22, 23, 29, 30, and pp 611-618 from chapter 33;
from Gibas and Jambeck: chapter 8, pp 205-214.
Week 6, Tues. Sept. 30, 2003.
- What are profiles, Hidden Markov Models,
and expectation maximizations?
- What is the nature of complex pattern searching,
and how are complex patterns derived?
- Two-dimensional, position specific weight
matrices built from multiple, related sequences.
- What data is used to generate such a pattern
and what programs are available for building them?
- How does this relate to structure and function
-- evolutionarily conserved regions are those of functional
importance and therefore necessarily constrained
structure!
Lecture #8,
Steve Thompson:
Nucleic Acid Sequence Characterization -- Genomics
(PowerPoint or
pdf).
suggested reading from Krawetz and Womble: chapters 24 and 25;
from Krane and Raymer: chapter 6;
from Gibas and Jambeck: chapter 11.
Week 6, Thurs. Oct. 2, 2003.
- What sort of information can be determined
from a primary sequence?
- Easy -- restriction digests and
associated mapping.
- Harder -- fragment assembly and genome
mapping.
- Very hard -- gene finding and sequence
annotation.
- This is a much more complex problem than
simple pattern searching.
- Recognizing protein coding regions:
through "content approaches" --
"nonrandomness," codon usage/preference, and URF's
vs. ORF's;
and through homology based methods -- "signal
searches"
and the weight matrix, and similarity based alignments;
and via "combined methods" on the InterNet, e.g.
Grail and GeneID.
- Easy -- translation to peptides.
- Hard again -- genome comparisons.
Lecture #9:
Molecular Evolution.
suggested reading from Krawetz and Womble:
chapter 7, and chapter 33, pp 622-633;
from Krane and Raymer: chapters 3, 4, and 5;
from Gibas and Jambeck: chapter 8, pp 199-205;
especially see "Molecular Systematics," editors
Hillis, Moritz, Mable, chapter 11.
Week 7, Tues. Oct. 7, 2003.
I.
David Swofford: Background
(pdf).
- What is molecular phylogenetics?
- What are its applications and why would
anyone care?
- What is the importance of your multiple
sequence alignment input.
- What other assumptions must be made for these
approaches to be valid?
Week 7, Thurs. Oct. 9, 2003.
II. David Swofford
(cont.): Methods
(pdf).
- Overview of methods --
- parsimony, distance,
maximum likelihood, Bayesian inference.
- How do these methods work?
- What are their strengths and
weaknesses?
Week 8, Tues. Oct. 14, 2003.
III. David Swofford
(cont.): Models.
- Choosing appropriate models and their impact
on the results.
- Accuracy, reliability, robustness, and
confidence.
- (boot-strapping and other statistical
tests).
Week 8, Thurs. Oct. 16, 2003.
IV.
Jim Wilgenbusch: PAUP* demonstration
(pdf).
.
- How do you use the PAUP* package?
- An overview of the conventions, input
and output formats, methods available, and models used in the
package.
- Some fun . . . .
Lecture #10,
Peter Beerli:
Population Coalescence
(Peter Beerli's
lecture).
suggested reading from Krawetz and Womble: chapter 12.
Week 9, Tues. Oct. 21, 2003.
I. What is the Coalescence?
- Population level evolution and
the coalescence.
- Effective population size, exponential
growth rate, migration rate, and per-site recombination
rate
Week 9, Thurs. Oct. 23, 2003.
II. What is LAMARC?
- LAMARC -- Likelihood analysis
using the Metropolis Monte Carlo sampling technique.
MidTerm TakeHome Exam -
distributed Thurs. Oct. 23 - due Thurs. Oct. 30.
Available online in
pdf
format.
Lecture #11,
Hong Li:
Prediction of Nucleic Acid Secondary Structure
(PowerPoint
or pdf).
suggested reading from Krane and Raymer: chapter 7, p 175.
Week 10, Tues. Oct. 28, 2003.
- What are the approaches used in predicting
RNA and DNA secondary structure?
- Simple repeats and stem-loops vs.
more complex folding programs that use energetics and/or
phylogenetics.
- How and when should these programs be
run?
- How can experimental approaches complement
nucleic acid structure prediction?
- What are the limitations of these
approaches?
Lecture #12,
Michael Chapman:
Amino Acids, Polypeptides, and Sequence Attributes
(pdf from
Huan-Xiang Zhou in Spring 2002 and
html from
Michael Chapman in Spring 2003).
suggested reading from Krane and Raymer: chapter 7, pp 155-160;
from Gibas and Jambeck: chapter 9.
Week 10, Thurs. Oct. 30, 2003.
MidTerm Exam due today!
I. The Covalent Structure of Proteins.
-
Going from sequence to function sometimes requires a structure.
- The peptide unit - covalent chemistry;
Conformational variability through torsion angles;
van der Waal's repulsion.
- Physical/structural properties related to primary
sequence,
e.g. protease cut sites and fragment sizes, iso-electric point, HPLC
retention, molecular weight, amino acid composition.
Week 11, Tues. Nov. 4, 2003.
II. Amino Acid Properties and Sequence Attributes.
-
Structural and functional roles:
- hydrophobic (membrane spanning?) or hydrophilic
(polar),
amphiphilicity and the hydrophobic moment;
- charge -- acidic or basic
(catalytic reaction center?);
- antigenic sites prediction -- surface probability and
flexibility.
Week 11, Thurs. Nov. 6, 2003.
Lecture #13,
Michael Chapman: From Sequence to
Secondary Structure Prediction and Beyond
(see lecture notes
under Lecture #12).
suggested reading from Krane and Raymer: chapter 7, pp 160-164;
from Gibas and Jambeck: chapter 10, pp 277-282.
I. The Classics.
-
Primary, secondary, tertiary, and quaternary structure.
-
Structural motifs and their recognition, domains.
-
What are the classical approaches to secondary structure
prediction?
- Chou-fasman, GOR, and others.
-
How were these approaches developed and how do they work?
-
What are the methods' limitations and problems?
II. Emerging Trends.
- What are the newer approaches and directions
to the problem?
- Combinatorial approaches e.g. profile methods,
neural nets, genetic algortihms, etc.
-
The relation between secondary and tertiary structure prediction:
an introduction to threading and Rosetta methods.
- What are the limitations of these newer
approaches?
III. Supersecondary Structure and Protein
Building Blocks.
- Concepts and findings -- relate to ideas
of super-secondary structure, lattice-models, and
threading.
- How do secondary structural elements interact
with each other?
Lecture #14,
Huan-Xiang Zhou:
Protein Tertiary Structure -- a Dynamic View.
(Michael
Chapman's pdf from Spring 2002 and
Huan-Xiang Zhou's PowerPoint from Spring 2003).
suggested reading from Krane and Raymer: chapter 7, pp 164-172,
174,
and chapter 8, pp 183, 191-198;
from Gibas and Jambeck: chapter 10.
Week 12, Tues. Nov. 11, 2003.
Veteran's Day Holiday.
Week 12, Thurs. Nov. 13, 2003.
I. Overview.
- A general introduction to the field of molecular
modeling.
- The "Catch 22" problem -- how do
proteins fold? Calculation of Structure and Properties.
- What approaches are being used to study the
subject? What are their limitations?
- What are the new directions in this field?
II. Molecular Mechanics (Energy
Minimization) and Molecular Dynamics
(Simulations).
- Modeling, minimization, and molecular
dynamics.
- What are the underlying concepts behind molecular
mechanics, the minimization process, force fields, and
molecular dynamics?
- What are the advantages and disadvantages
of these techniques?
- What are the limitations of the various force
fields, the minimization process, and the molecular
simulation?
- Restricted nature of the molecules used,
e.g. limited metal ions and solvent, and the inability
to handle unusual functional groups.
- What does the data look like?
- What are new directions in this area and
what has been the influence of supercomputing?
Week 13, Tues. Nov. 18, 2003.
III. Example 1.
Acetylcholinesterase (AchE)
- Elucidating functional mechanism through
molecular dynamics.
IV. Example 2.
the Trp Cage
- Exploring folding pathways with
the world's smallest protein.
V. Example 3.
Src Kinase and Potassium Channels
- Investigating protein-protein interactions
and time scale effects.
Lecture #15,
Ross Ellington:
Prediction of Protein Tertiary Structure
(pdf).
suggested reading from Krawetz and Womble:
chapter 31, pp 558-560;
from Krane and Raymer: chapter 7, pp 172-174,
and chapter 8, pp 187-191;
from Gibas and Jambeck: chapter 10 review.
Week 13, Thurs. Nov. 20, 2003.
I. Homology Modeling I.
- What is homology modeling?
- How does information from solved structures
relate to non-determined structures?
- What are the underlying concepts behind the
idea of homology modeling?
- When can this technique be used? What are
its limitations?
Week 14, Tues. Nov. 25, 2003.
II. Homology Modeling II.
- What are some of homology modeling's
applications?
- Ligand docking and drug design.
- Vital application of technology to the
pharmaceutical industry.
- What are the new directions in this area?
Week 14, Thurs. Nov. 27, 2003.
Thanksgiving Day Holiday.
Lecture #16,
David Banks:
Scientific Visualization -- Problems and Trends
in Biological Systems
(html).
suggested reading from Krawetz and Womble: chapter 32;
from Gibas and Jambeck: chapter 14.
Week 15, Tues. Dec. 2, 2003.
I. Visualization of Biological Molecules.
- What information can be determined visually
from a sequence or structure?
- Physical information such as:
size, surface area, volume, and charge distribution, and
relationships to other molecules via super-positioning.
- Realistic representations:
backbones, alpha carbon traces, ribbon cartoons, color coding
for elements such as secondary structures.
- What modifications of this information makes
the data easier to understand?
Week 15, Thurs. Dec. 4, 2003.
II. Larger Systems Visualization.
- The big picture: applications of visualization
for complete systems integration.
- How does visualization help with
systems larger than molecules:
metabolic interactions, viruses, cells, tissues, organs,
organisms, populations, ecozones . . . .
Optional Laboratory Component:
Click here to go to the
Laboratory Syllabus.
The lecture classroom has the facilities to project
live Internet links and most lectures include considerable demonstration
of actual biocomputing techniques. Therefore, the optional lab section is
not absolutely essential; however, it is strongly recommended, as
experience has shown that most student learning occurs when using real
data with real software. Students apply the theory learned in lecture to
experimental settings yielding an advanced understanding of evolution,
form, and function.
After lab students have had their introduction to basic
UNIX concepts, utility operations, editing procedures, and molecular
databases within the first couple weeks, they decide on a protein of
current interest from a list of molecules for which complete structural
coordinates are known. They then perform all of the laboratory computer
exercises upon that particular molecule. This way they are able to gain
experience in all aspects of biocomputing in the course in a
project-oriented fashion using the same natural progression as would be
used in an actual experimental setting.
Resultant predictive data derived from sequence
analysis will no doubt conflict with aspects of the known structural data,
but elements of truth will also be found. In this way the strengths and
weaknesses of each approach can be better understood, and a greater
empathy can be found for the tremendous problems encountered in the
all-too-common case of a newly discovered gene product without any
structural information available. With this approach to computerized
molecular biology, students "come full swing" gaining
appreciation for the full biocomputing spectrum available.
This structured exercise tutorial sequence lasts for
the first two thirds of the semester, ten weeks. After the laboratory
tutorial portion of the course has finished, lab students devote scheduled
lab sessions to working on their individual research projects. They will
be required to begin dialogue with the lab instructor regarding their
project topic early on in the semester and then will be required to submit
a project proposal as part of the midterm exam. Lab students are
encouraged to choose semester projects related to their their own
undergraduate or graduate research. This helps to insure excellence by
providing vested interest.
If students are not taking the lab, they are encouraged
to participate in the non-credit GCG Bioinformatics Workshop Series taught
every semester, but they are not required to do so. Steve Thompson is
available to assist students in using their own laboratory and the Biology
and CSIT Teaching Computing Lab computers to access the Mendel
biocomputing server, and to help with their projects throughout the
semester.
Student Evaluation:
There are three graded requirements, split in thirds
(individuals not taking the lab are evaluated on only the first
two items, split 50/50) --
A written, take-home, essay style exam that
covers the fundamentals of sequence analysis as presented in
the course up through Week 9 and that is due the class
meeting Thurs. Oct. 30. This will contribute one
third of the grade (half for those without the lab).
An end of semester final project in the form
of a research paper (10+ pages) in which students use any
combination of the techniques taught in the course to answer
a 'real' biological question or to build a 'real' biocomputing
tool. Students should have vested interest in their chosen
topic by making it relevant to their own academic research.
This will contribute the second third of the grade for those
students participating in the lab (and be the only
other component of the grade for those not involved in the
lab). The final project is due in my office (Dirac 150G),
printed on paper, not later than 5:00 PM the
Tuesday of Finals Week (Dec. 9).
The lab participants will have their final third
grade contribution come from laboratory reports covering each of the
required structured tutorials completed during the first ten weeks of
the course. These lab reports are due via e-mail to me the
following week after a tuturial has been performed.
This lecture testing strategy will not cover the
structural topics presented in the last portion of the semester,
but these methods can certainly be used in the term project.
The term project is very important
as it demonstrates the practical aspects of the methodology.
It "brings it on home." Therefore,
rather than forcing a final exam and a term paper simultaneously,
we require a single exam at midterm, and the final project,
as well as the ten lab reports for those individuals
participating in the lab section.
All assignments will
be graded based on quality of understanding, originality of thought,
and clearness of presentation. Good writing skills certainly
help!