fsu torches florida state university
fsu torches
fsu torches BSC4933/BSC5936 (Sections 01) Syllabus


An Introduction to Bioinformatics


Lecture

Laboratory

Steve Thompson's Home Page

GCG Workshops

CSIT School

Biology Department

westcott building

 

An Introduction to Bioinformatics

 

Fall Semester 2003. Lectures - Tuesdays and Thursdays from 9:30 to 10:45 AM, in Dirac 499; optional Laboratory Section - Tuesdays from 3:45 to 5:45 PM in Conradi 223. Credit hours: three (or two without lab)

This course introduces the emerging topic of bioinformatics. It is designed primarily for life science students who do not have an extensive background in mathematics, statistics, or computer science but who are interested in survey-level knowledge of bioinformatics and its techniques. The course is also suitable for students in mathematics, computer science, or statistics who wish to learn how methods drawn from their discipline are applied as bioinformatics. It is a 'hands on,' 'how to' lecture, demonstration, and laboratory (optional) course that includes both sequence- and structure-based methods.

The lecture meets twice weekly for 75 minutes apiece; the optional lab component meets once weekly for two hours. Students have accounts on the Mendel biocomputing server for access to computational biology databases and software. Steve Thompson coordinates the course, delivers several lectures, and instructs in the laboratory. Guest faculty members lecture in their individual areas of expertise and knowledge.

Participating Faculty Instructors (lecture allocations listed below)

David Swofford CSIT/Biology course director
Steve Thompson CSIT course coordinator
David Banks CSIT/Computer Science biosciences visualization
Peter Beerli CSIT/Biology population evolution
Michael Chapman IMB/Chemistry AA attributes and secondary structure
Ross Ellington IMB/Biology homology modeling
Hong Li IMB/Chemistry nucleic acid structure
Jack Quine Mathematics alignment algorithms
David Swofford CSIT/Biology molecular evolution
Robert van Engelen CSIT/Computer Science computer science fundamentals
Jim Wilgenbusch CSIT/Biology molecular evolution
Huan-Xiang Zhou CSIT/Physics molecular mechanics/dynamics

No textbook is required, but I have some suggestions. "Developing Bioinformatics Computer Skills," by Cynthia Gibas and Per Jambeck is a practical, first-pass, introduction to the field, a part of the O'Reilly series; and "Fundamental Concepts of Bioinformatics," by Dan E. Krane and Michael L. Raymer is a good start on the basics of computational biology theory. "Introduction to Bioinformatics: A Theoretical and Practical Approach," edited by Stephen A. Krawetz and David D. Womble, is an extremely comprehensive, perhaps too so, review of the area, particularly sequence analysis, from biochemistry and cellular and molecular biology, through the UNIX environment, to biocomputing software applications. David W. Mount has an excellent, intermediate-level, text, " Bioinformatics: Sequence and Genome Analysis," with more depth into the algorithmic basis of sequence analysis. Check out the FSU Bookstore, Amazon.com, and StudentMarket for availability.

Organizational Meeting, David Swofford: Overview of Course Format and Content (pdf from Michael Chapman, Spring 2003).

Week 1, Tues. Aug. 26, 2003.

What have you gotten yourself into?

Lecture #1, Robert van Engelen: Introduction -- Computer Science and Biology (pdf).

suggested reading from Krawetz and Womble: chapters 1 and 13, and appendix 3;
from Krane and Raymer: appendix 1; from Gibas and Jambeck: chapters 1 and 2.

Week 1, Tues. Aug. 26 (cont.), 2003.

I. Why are Computers Used? What's the Advantage? Data-Mining!

    The tremendous growth of data, especially DNA sequence data.
    The nature of the process, especially pattern searching.
    The need for complex calculations and sorted data subsets.
    The increased speed of the computer over the human brain.

II. String Searches and Their Uses.

    Because of huge databases -- need for methods of inquiry and access.
    How do simple one-dimensional pattern text searches work?
    How does a computer recognize a "match?"
    What is a simple pattern -- regular pattern expressions and ambiguity?
    What type of information gathering is just simple pattern searching?
    Searches of the databases for text strings, e.g. entry, organism, or gene names, authors, etc. vs.
    Searches for primary sequence information, e.g. restriction enzyme cut sites, regulatory patterns ("signals"), and searches to identify known functional or structural patterns (motifs).
    e.g. PROSITE -- A Dictionary of Protein Sites and Patterns.

Week 1, Thurs. Aug. 28, 2003.

III. Advanced Techniques: Neural Nets, Genetic Algorithms, and Supercomputers.

    Descriptions of the concepts and several examples of application methods and derived results.
    What are neural nets and genetic algorithms and how do they help solve biocomputing problems?
    What is a cluster; what is grid computing?
    What is supercomputing's influence and what is massively parallel?
    What does the future hold for biocomputing?

Lecture #2, Steve Thompson: Biological Databases (PowerPoint or pdf).

suggested reading from Krawetz and Womble: chapters 11, 26, and 28 to p 469;
from Krane and Raymer: chapter 1; from Gibas and Jambeck: chapters 6 and 13.

Week 2, Tues. Sept. 2, 2003.

I. Sequence Databases: Content and Organization.

    Primarily 'flatfile' ASCII data, often with binary indexing.
    What are primary sequences? What are sequence databases?
    What information do they contain and how is it organized?
    How is this information accessed?
    Who maintains them and where are they kept?
    Changes and effects -- history and development over the years.

II. Structural Databases: Content and Organization.

    Primarily 'flatfile' ASCII data, rarely binary indices.
    What are structures? What are structural databases? The Protein Data Bank (PDB).
    What information do they contain and how is it organized?
    How is this information accessed?
    Who maintains them and where are they kept?
    Changes and effects -- history and development over the years.

Week 2, Thurs. Sept. 4, 2003.

III. and guest Misha Taylor: Higher Order Databases and Database Structure. (PowerPoint or pdf).

    Seldom 'simple' databases, often relational and/or object oriented.
    Examples include SQL, Oracle, and WWW databases like NCBI's ASN.1 Entrez database, metabolic pathway databases, genome databases such as GDB, etc.

Lecture #3, Steve Thompson: Dot Matrix Methods (PowerPoint or pdf).

suggested reading from Krane and Raymer: chapter 2, pp 33-34.

Week 3, Tues. Sept. 9, 2003.

    Why? Provides Gestalt of all comparisons possible.
    What approaches are used? Windowing, filters, and word techniques.
    What can be determined with dot matrixing? Repeats and elements 'less than best.'
    What is the significance of the parameters employed -- very important!

Lecture #4, Jack Quine: Alignments and Substitution Matrices (pdf).

suggested reading from Krawetz and Womble: chapter 27, pp 443-449, and chapter 31, pp 539-544;
from Krane and Raymer: chapter 2, pp 35-47;
from Gibas and Jambeck: chapter 7, through p 181.

Week 3, Thurs. Sept. 11, 2003.

I. Pair-wise Sequence Alignment.

    What are pair-wise alignments? How are they made?
    What sort of approaches are used?
    What is the dynamic programming algorithm?
    Scoring matrix --> path matrix --> traceback --> alignment.
    What are gap penalties and what difference do they make?
    What is the difference between local and global alignment?

Week 4, Tues. Sept. 16, 2003.

II. Substitution Matrices.

    What are scoring matrices and log-odds matrices and what ones are available?
    What is the difference between the Dayhoff, BLOSUM, Gonet, and other available matrices?
    Why one and not another? Which are appropriate for what sort of different situations.

Lecture #5, Steve Thompson: Multiple Sequence Alignment -- Methods and Problems (PowerPoint or pdf).

suggested reading from Krawetz and Womble: chapter 31, pp 544-558, and chapter 33, pp 602-611;
from Krane and Raymer: chapter 2, p 52;
from Gibas and Jambeck: chapter 8, through p 198.

Week 4, Thurs. Sept. 18, 2003.

    Global versus local methods; limitations of all methods.
    Manual alignment vs. automated, progressive pairwise alignment.
    Refinement, refinement, refinement . . . .
    The nature of a consensus; how are they generated?

Lecture #6, Steve Thompson: Database Searching -- Old Methods and New Developments (PowerPoint or pdf).

suggested reading from Krawetz and Womble: chapter 27, pp 449-461, chapter 28, pp 469-487;
from Krane and Raymer: chapter 2, pp 48-51;
from Gibas and Jambeck: chapter 7, remainder.

Week 5, Tues. Sept. 23, 2003.

I. Heuristic Database Searching.

    What sequences in the databases are similar to yours?
    Methods of database searching, traditional and newer algorithms.
    What do all of these words mean: "heuristics," "hashing," "word," "k-tup," "fast," "blast," etc?
    How do the algorithms work; how are they are different?; how does parameter choice affect the outcome?
    DNA vs. protein searches, which and why?
    What program do I use for what situation?

Week 5, Thurs. Sept. 25, 2003.

II. Similarity Searching and Significance (Spring 2003 pdf from Bill Pearson!).

    What do the scores mean?
    What is the significance of the results?
    What are Monte Carlo methods, Z scores, and expectations?
    How does any of this relate to "homology" and 'real-life' biology?

Lecture #7, Bjarne Knudsen: Remote Relationships, the Profile Methods -- ala Gribskov, MEMEs, and HMMRs (pdf and a biostatistiics pdf from Dr. Lei Li, Spring, 2002).

suggested reading from Krawetz and Womble: chapters 22, 23, 29, 30, and pp 611-618 from chapter 33;
from Gibas and Jambeck: chapter 8, pp 205-214.

Week 6, Tues. Sept. 30, 2003.

    What are profiles, Hidden Markov Models, and expectation maximizations?
    What is the nature of complex pattern searching, and how are complex patterns derived?
    Two-dimensional, position specific weight matrices built from multiple, related sequences.
    What data is used to generate such a pattern and what programs are available for building them?
    How does this relate to structure and function -- evolutionarily conserved regions are those of functional importance and therefore necessarily constrained structure!

Lecture #8, Steve Thompson: Nucleic Acid Sequence Characterization -- Genomics (PowerPoint or pdf).

suggested reading from Krawetz and Womble: chapters 24 and 25;
from Krane and Raymer: chapter 6; from Gibas and Jambeck: chapter 11.

Week 6, Thurs. Oct. 2, 2003.

    What sort of information can be determined from a primary sequence?
    Easy -- restriction digests and associated mapping.
    Harder -- fragment assembly and genome mapping.
    Very hard -- gene finding and sequence annotation.
    This is a much more complex problem than simple pattern searching.
    Recognizing protein coding regions:
    through "content approaches" -- "nonrandomness," codon usage/preference, and URF's vs. ORF's;
    and through homology based methods -- "signal searches" and the weight matrix, and similarity based alignments; and via "combined methods" on the InterNet, e.g. Grail and GeneID.
    Easy -- translation to peptides.
    Hard again -- genome comparisons.

Lecture #9: Molecular Evolution.

suggested reading from Krawetz and Womble: chapter 7, and chapter 33, pp 622-633;
from Krane and Raymer: chapters 3, 4, and 5;
from Gibas and Jambeck: chapter 8, pp 199-205;
especially see "Molecular Systematics," editors Hillis, Moritz, Mable, chapter 11.

Week 7, Tues. Oct. 7, 2003.

I. David Swofford: Background (pdf).

    What is molecular phylogenetics?
    What are its applications and why would anyone care?
    What is the importance of your multiple sequence alignment input.
    What other assumptions must be made for these approaches to be valid?

Week 7, Thurs. Oct. 9, 2003.

II. David Swofford (cont.): Methods (pdf).

    Overview of methods --
    parsimony, distance, maximum likelihood, Bayesian inference.
    How do these methods work?
    What are their strengths and weaknesses?

Week 8, Tues. Oct. 14, 2003.

III. David Swofford (cont.): Models.

    Choosing appropriate models and their impact on the results.
    Accuracy, reliability, robustness, and confidence.
    (boot-strapping and other statistical tests).

Week 8, Thurs. Oct. 16, 2003.

IV. Jim Wilgenbusch: PAUP* demonstration (pdf). .

    How do you use the PAUP* package?
    An overview of the conventions, input and output formats, methods available, and models used in the package.
    Some fun . . . .

Lecture #10, Peter Beerli: Population Coalescence (Peter Beerli's lecture).

suggested reading from Krawetz and Womble: chapter 12.

Week 9, Tues. Oct. 21, 2003.

I. What is the Coalescence?

    Population level evolution and the coalescence.
    Effective population size, exponential growth rate, migration rate, and per-site recombination rate

Week 9, Thurs. Oct. 23, 2003.

II. What is LAMARC?

    LAMARC -- Likelihood analysis using the Metropolis Monte Carlo sampling technique.

MidTerm TakeHome Exam - distributed Thurs. Oct. 23 - due Thurs. Oct. 30. Available online in pdf format.

Lecture #11, Hong Li: Prediction of Nucleic Acid Secondary Structure (PowerPoint or pdf).

suggested reading from Krane and Raymer: chapter 7, p 175.

Week 10, Tues. Oct. 28, 2003.

    What are the approaches used in predicting RNA and DNA secondary structure?
    Simple repeats and stem-loops vs.
    more complex folding programs that use energetics and/or phylogenetics.
    How and when should these programs be run?
    How can experimental approaches complement nucleic acid structure prediction?
    What are the limitations of these approaches?

Lecture #12, Michael Chapman: Amino Acids, Polypeptides, and Sequence Attributes (pdf from Huan-Xiang Zhou in Spring 2002 and html from Michael Chapman in Spring 2003).

suggested reading from Krane and Raymer: chapter 7, pp 155-160;
from Gibas and Jambeck: chapter 9.

Week 10, Thurs. Oct. 30, 2003.

MidTerm Exam due today!

I. The Covalent Structure of Proteins.

    Going from sequence to function sometimes requires a structure.
    The peptide unit - covalent chemistry;
    Conformational variability through torsion angles; van der Waal's repulsion
    .
    Physical/structural properties related to primary sequence,
    e.g. protease cut sites and fragment sizes, iso-electric point, HPLC retention, molecular weight, amino acid composition.

Week 11, Tues. Nov. 4, 2003.

II. Amino Acid Properties and Sequence Attributes.

    Structural and functional roles:
    hydrophobic (membrane spanning?) or hydrophilic (polar),
    amphiphilicity and the hydrophobic moment;
    charge -- acidic or basic (catalytic reaction center?);
    antigenic sites prediction -- surface probability and flexibility.

Week 11, Thurs. Nov. 6, 2003.

Lecture #13, Michael Chapman: From Sequence to Secondary Structure Prediction and Beyond (see lecture notes under Lecture #12).

suggested reading from Krane and Raymer: chapter 7, pp 160-164;
from Gibas and Jambeck: chapter 10, pp 277-282.

I. The Classics.

    Primary, secondary, tertiary, and quaternary structure.
    Structural motifs and their recognition, domains.
    What are the classical approaches to secondary structure prediction?
    Chou-fasman, GOR, and others.
    How were these approaches developed and how do they work?
    What are the methods' limitations and problems?

II. Emerging Trends.

    What are the newer approaches and directions to the problem?
    Combinatorial approaches e.g. profile methods, neural nets, genetic algortihms, etc.
    The relation between secondary and tertiary structure prediction:
    an introduction to threading and Rosetta methods.
    What are the limitations of these newer approaches?

III. Supersecondary Structure and Protein Building Blocks.

    Concepts and findings -- relate to ideas of super-secondary structure, lattice-models, and threading.
    How do secondary structural elements interact with each other?

Lecture #14, Huan-Xiang Zhou: Protein Tertiary Structure -- a Dynamic View. (Michael Chapman's pdf from Spring 2002 and Huan-Xiang Zhou's PowerPoint from Spring 2003).

suggested reading from Krane and Raymer: chapter 7, pp 164-172, 174,
and chapter 8, pp 183, 191-198; from Gibas and Jambeck: chapter 10.

Week 12, Tues. Nov. 11, 2003. Veteran's Day Holiday.

Week 12, Thurs. Nov. 13, 2003.

I. Overview.

    A general introduction to the field of molecular modeling.
    The "Catch 22" problem -- how do proteins fold? Calculation of Structure and Properties.
    What approaches are being used to study the subject? What are their limitations?
    What are the new directions in this field?

II. Molecular Mechanics (Energy Minimization) and Molecular Dynamics (Simulations).

    Modeling, minimization, and molecular dynamics.
    What are the underlying concepts behind molecular mechanics, the minimization process, force fields, and molecular dynamics?
    What are the advantages and disadvantages of these techniques?
    What are the limitations of the various force fields, the minimization process, and the molecular simulation?
    Restricted nature of the molecules used, e.g. limited metal ions and solvent, and the inability to handle unusual functional groups.
    What does the data look like?
    What are new directions in this area and what has been the influence of supercomputing?

Week 13, Tues. Nov. 18, 2003.

III. Example 1. Acetylcholinesterase (AchE)

    Elucidating functional mechanism through molecular dynamics.

IV. Example 2. the Trp Cage

    Exploring folding pathways with the world's smallest protein.

V. Example 3. Src Kinase and Potassium Channels

    Investigating protein-protein interactions and time scale effects.

Lecture #15, Ross Ellington: Prediction of Protein Tertiary Structure (pdf).

suggested reading from Krawetz and Womble: chapter 31, pp 558-560;
from Krane and Raymer: chapter 7, pp 172-174, and chapter 8, pp 187-191;
from Gibas and Jambeck: chapter 10 review.

Week 13, Thurs. Nov. 20, 2003.

I. Homology Modeling I.

    What is homology modeling?
    How does information from solved structures relate to non-determined structures?
    What are the underlying concepts behind the idea of homology modeling?
    When can this technique be used? What are its limitations?

Week 14, Tues. Nov. 25, 2003.

II. Homology Modeling II.

    What are some of homology modeling's applications?
    Ligand docking and drug design.
    Vital application of technology to the pharmaceutical industry.
    What are the new directions in this area?

Week 14, Thurs. Nov. 27, 2003. Thanksgiving Day Holiday.

Lecture #16, David Banks: Scientific Visualization -- Problems and Trends in Biological Systems (html).

suggested reading from Krawetz and Womble: chapter 32; from Gibas and Jambeck: chapter 14.

Week 15, Tues. Dec. 2, 2003.

I. Visualization of Biological Molecules.

    What information can be determined visually from a sequence or structure?
    Physical information such as:
    size, surface area, volume, and charge distribution, and relationships to other molecules via super-positioning.
    Realistic representations:
    backbones, alpha carbon traces, ribbon cartoons, color coding for elements such as secondary structures.
    What modifications of this information makes the data easier to understand?

Week 15, Thurs. Dec. 4, 2003.

II. Larger Systems Visualization.

    The big picture: applications of visualization for complete systems integration.
    How does visualization help with systems larger than molecules:
    metabolic interactions, viruses, cells, tissues, organs, organisms, populations, ecozones . . . .

Optional Laboratory Component:

Click here to go to the Laboratory Syllabus.

The lecture classroom has the facilities to project live Internet links and most lectures include considerable demonstration of actual biocomputing techniques. Therefore, the optional lab section is not absolutely essential; however, it is strongly recommended, as experience has shown that most student learning occurs when using real data with real software. Students apply the theory learned in lecture to experimental settings yielding an advanced understanding of evolution, form, and function.

After lab students have had their introduction to basic UNIX concepts, utility operations, editing procedures, and molecular databases within the first couple weeks, they decide on a protein of current interest from a list of molecules for which complete structural coordinates are known. They then perform all of the laboratory computer exercises upon that particular molecule. This way they are able to gain experience in all aspects of biocomputing in the course in a project-oriented fashion using the same natural progression as would be used in an actual experimental setting.

Resultant predictive data derived from sequence analysis will no doubt conflict with aspects of the known structural data, but elements of truth will also be found. In this way the strengths and weaknesses of each approach can be better understood, and a greater empathy can be found for the tremendous problems encountered in the all-too-common case of a newly discovered gene product without any structural information available. With this approach to computerized molecular biology, students "come full swing" gaining appreciation for the full biocomputing spectrum available.

This structured exercise tutorial sequence lasts for the first two thirds of the semester, ten weeks. After the laboratory tutorial portion of the course has finished, lab students devote scheduled lab sessions to working on their individual research projects. They will be required to begin dialogue with the lab instructor regarding their project topic early on in the semester and then will be required to submit a project proposal as part of the midterm exam. Lab students are encouraged to choose semester projects related to their their own undergraduate or graduate research. This helps to insure excellence by providing vested interest.

If students are not taking the lab, they are encouraged to participate in the non-credit GCG Bioinformatics Workshop Series taught every semester, but they are not required to do so. Steve Thompson is available to assist students in using their own laboratory and the Biology and CSIT Teaching Computing Lab computers to access the Mendel biocomputing server, and to help with their projects throughout the semester.

Student Evaluation:

There are three graded requirements, split in thirds (individuals not taking the lab are evaluated on only the first two items, split 50/50) --

  1. A written, take-home, essay style exam that covers the fundamentals of sequence analysis as presented in the course up through Week 9 and that is due the class meeting Thurs. Oct. 30. This will contribute one third of the grade (half for those without the lab).

  2. An end of semester final project in the form of a research paper (10+ pages) in which students use any combination of the techniques taught in the course to answer a 'real' biological question or to build a 'real' biocomputing tool. Students should have vested interest in their chosen topic by making it relevant to their own academic research. This will contribute the second third of the grade for those students participating in the lab (and be the only other component of the grade for those not involved in the lab). The final project is due in my office (Dirac 150G), printed on paper, not later than 5:00 PM the Tuesday of Finals Week (Dec. 9).

  3. The lab participants will have their final third grade contribution come from laboratory reports covering each of the required structured tutorials completed during the first ten weeks of the course. These lab reports are due via e-mail to me the following week after a tuturial has been performed.

This lecture testing strategy will not cover the structural topics presented in the last portion of the semester, but these methods can certainly be used in the term project. The term project is very important as it demonstrates the practical aspects of the methodology. It "brings it on home." Therefore, rather than forcing a final exam and a term paper simultaneously, we require a single exam at midterm, and the final project, as well as the ten lab reports for those individuals participating in the lab section.

All assignments will be graded based on quality of understanding, originality of thought, and clearness of presentation. Good writing skills certainly help!

 
   
 
fsu seal
fsu seal
Steve's Home | Workshop Page | CSIT | Biology | copyright
© 2002 Florida State University, stevet@bio.fsu.edu
florida state university fsu seal