Liste des sponsors

GdrBIMUniversité de Rennes 1



Centre de recherche commun Inria Microsoft Research

JOBIM 2012

Journées Ouvertes en Biologie, Informatique et Mathématiques

3 - 6 juillet 2012


Abstracts des conférences invitées

David B. Searls, University of Pennsylvania
"Molecules, Languages, and Automata"

Molecular biology is replete with linguistic metaphors, from the language of DNA to the genome as “book of life”. Certainly the organization of genes and other functional modules along the DNA sequence invites a syntactic view, which can be adopted for purposes of pattern-matching search via parsing. This in turn has led to the development of novel grammar formalisms specially adapted to the biological domain. It has also been shown that folding of RNA structures is neatly expressed by grammars that require expressive power beyond context-free on the Chomsky hierarchy, an approach that has been conceptually extended with other grammar formalisms to the much more complex structures of proteins. Grammars and their cognate automata have even been adopted to describe evolutionary processes and algorithms for their reconstruction via sequence alignment, and indeed the analogy between the evolution of species and of languages (first noted by Darwin) has been exploited by applying bioinformatics tools to human languages as well. Processive enzymes and other “molecular machines” can also be cast in terms of automata, and thus of grammars, opening up new possibilities for the formal specification, modeling, and simulation of biological processes, and perhaps tools useful in the fields of DNA computing and nanotechnology. This talk will review linguistic approaches to molecular biology, and perspectives on potential future applications of grammars and automata in this field.

Hugues Roest-Crollius, Ecole Normale Supérieure
"The 4th dimension of Biology. A historical perspective of biological processes"

Biology is an experimental science: hypotheses aiming at a better understanding of molecular, cellular or physiological functions are tested through experiments in living models. But living models (animals, plants, microorganisms) are contemporary, and therefore only present-day biological processes may be studied directly. Yet biological processes have been shaped during hundreds of millions of years of evolution under the influence of multiple forces, including mutational events, population genetics, adaptation, genetic drift, etc. Clearly, taking into account this historical perspective to understand biology would add tremendous power to our ability to interpret present-day experimental results. Most importantly, documenting the precise historical succession of events leading to molecular interactions, cellular development and physiological equilibria observed in contemporary experiments would help us understand “why” biological processes are organised the way we see them.
Unfortunately, the historical records required to achieve this have been erased, and no biological samples old and abundant enough to cover the evolution of even one biological function remain today. The only possible approach is to infer ancestral biological states from contemporary data. Genomic data is particularly suited to this undertaking because it is generally highly precise, highly accurate, highly abundant and centralised in publicly available databases. The genome also provides fundamental insights into functional properties of an organism, which in turn inform us on the likelihood that specific metabolic or developmental pathways may have existed, and on the importance of specific functions for the species. Genomes thus represent the foundation upon which many insights may be gained on ancestral biological processes, partly replacing the missing historical records mentioned above.

I will present an algorithmic framework developed to reconstruct the organisation of ancestral genomes from their modern descendants. The method relies first on phylogenetic trees built from more than 50 vertebrate proteomes to infer the presence of genes in ancestors, and second on the systematic identification of conserved adjacent gene sets between modern genes to reconstruct the orders of genes in ancestral chromosomes. The method has been extensively validated using realistic simulations of genome rearrangements in vertebrate genomes. I will next describe how this information can be used to examine specific biological questions of interest. First, successive ancestral chromosomes provide a direct way of identifying branch-specific genomic rearrangements, including duplications and inversions. Ancestral gene duplications conserved in modern genomes represent a proxy for the selection of advantageous functions, such as those involved in the immune response, in sexual reproduction, or in specific metabolic pathways. Their precise mapping in genomes and during the course of evolution represent a global resource to document the history of these adaptive events. Second, the reconstruction of ancestral coding sequences represents a new resource to observe the distribution of GC content along ancestral chromosomes, and provides should help answer important questions on the evolution of the compositional landscape of chromosomes. Third, ancestral vertebrate genomes reveal ways of studying ancient events such as the two whole genome duplications at the origin of this group, thus providing an unprecedented resource to examine the molecular signatures left by these founding evolutionary events.
Pierre Baldi, University of California in Irvine
"Machine Learning Approaches in Proteomics"
Over the past three decades machine learning approaches have had a profound influence on many fields, including bioinformatics. Here we will provide a brief historical perspective on machine learning and its applications to proteomics, particularly structural proteomics, and discuss why structural proteomics is important for machine learning. We will then present state-of-the art machine learning methods for predicting protein structures and structural features, from secondary structure to contact maps.
We will stress and demonstrate the importance of combining supervised and unsupervised learning, and using deep and modular architectures capable of integrating information over space and "time" at multiple scales. Finally, we will describe two proteomic applications that have benefited from statistical machine learning methods: (1) the discovery of new drug leads for neglected diseases;and (2)
the development of high-throughput platforms to study the immune response with applications to antigen discovery and vaccine development.

Ivo Hofacker, Institute for Theoretical Chemistry
"RNA Structures, Interactions, and Folding Kinetics"

The talk will give an overview of recent advances in the computational analysis of RNA secondary structures, focusing in particular on three specific areas: (i) Classical RNA structure prediction considers only Watson-Crick type base pairings. Tertiary structures, however, contain a multitude of non-canonical pairings that determine 3D fold and tertiary interactions. I will present a recent method that include these non-canonical pairs in the prediction without sacrificing the efficiency of classical algorithms. (ii) Most RNAs function by interacting with other RNAs. The prediction of interaction targets is therefore a promising step towards understanding the function of novel non coding RNAs. Several recent methods address this problem while striking a balance between speed and accuracy.  (iii) Even relatively short RNAs may exhibit high energy barriers in their folding landscape that make folding to the minimum free energy structure prohibitively slow. We will therefore present methods to predict the folding kinetics, focusing in particular at the problem co-transcriptional folding.

Toni Gabaldon, Centre for Genomic Regulation
"Phylogenomics in the light of ever-growing sequencing data"

Comparative Genomics Group. Bioinformatics and Genomic Programme. Centre for Genomic Regulation (CRG). Dr Aiguader, 88 08003 Barcelona. Spain.


A pressing challenge in phylogenomics is the need to cope with the massive production of complete genomic sequences, especially after recent technological developments. Problems that are particularly affected by the increasing flow of genomic data and that require continuous update are: i) the establishment of evolutionary relationships between species (the so-called Tree Of Life (TOL)), ii) the inference of orthology and paralogy relationships across genomes, and iii) the study of the evolution of large, widespread super-families that evolved through complex patterns of duplications and losses. To face such challenges we have developed two sophisticated pipelines that allow high scalability and continuous update, while achieving highest levels of accuracy. The first such pipeline automatically reconstructs entire species-centric collections of gene phylogenies (the so-called phylome), and combines this with phylogenetic information from various other sources to derive unique orthology and paralogy predictions. The second pipeline, which we apply to the superfamily and the Tree of Life assembly problems, is able to reconstruct large phylogenies by means of an iterative strategy that provides scalable resolution and allows continuous update. In this talk, I will illustrate the use of such approaches in the context of the assessment of the evolution of important traits in fungi, and the reconstruction of a genome-based, eukaryotic tree of life

Martin Vingron, Max Planck Institute for Molecular Genetics
"Computational Regulatory Genomics"

Genome sequence encodes not only genes but also the regulatory relationships among genes. Thus, the time and spatial patterns of gene expression are also encrypted in the DNA sequence. In order to unravel this other genetic code, regulatory genomics attempts to integrate functional genomics data with sequence data. This talk will summarize several approaches developed in our group, starting with a biophysically motivated method for prediction of transcription factor binding sites. Main applications are the identification of tissue specific transcription factors and the prediction of regulatory changes due to SNPs. Further, the talk will describe some indications that the division of promoters into two classes with high and low CpG contents, respectively, is of functional importance and helps in understanding mammalian promoters. In fact, the two classes of promoters display different features when it comes to binding site usage and tissue specific regulation. The dichotomy is further supported by an analysis of histone modifications in the promoters. Taken together, we interpret this as indication that different regulatory mechanisms govern transcription in these two classes of promoters.

Bertil Schmidt, Johannes Gutenberg University Mainz
"Scalable Algorithms and Tools for Biological

High-throughput techniques for DNA sequencing have led to a rapid growth in the amount of digital biological data. The current state-of-the-art technology produces 600 billion nucleotides per machine run. Furthermore, the speed and yield of NGS (Next-generation sequencing) instruments continue to increase at a rate beyond Moore’s Law, with updates in 2012 enabling 1 trillion nucleotides per run. Correspondingly, sequencing costs (per sequenced nucleotide) continue to fall rapidly, from several billion dollars for the first human genome in 2000 to a forecast US$1000 per genome by the end of 2012. However, to be effective, the usage of NGS for medical treatment will require algorithms and tools for sequence analysis that can scale to billions of short reads. In this talk I will demonstrate how parallel computing platforms based on CUDA-enabled GPUs, multi-core CPUs, and heterogeneous CPU/GPU clusters can be used as efficient computational platforms to design and implement scalable tools for sequence analysis. I will present solutions for classical sequence alignment problems (such as pairwise sequence alignment, BLAST, multiple sequence analysis, motif finding) as well as for NGS algorithms (such as short-read error correction, short-read mapping, short-read assembly, short-read clustering).
Keywords Sequence alignment, Next-generation sequencing, parallel computing, GPUs.


  1. Y. Liu, B. Schmidt, D. Maskell. CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform, Bioinformatics, doi:10.1093/bioinformatics/bts276, 2012
  2. Y. Liu, B. Schmidt, D. Maskell. DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI, BMC Bioinformatics, 12:85, 2011
  3. W. Liu, B. Schmidt, W. Müller-Wittig, CUDA-BLASTP: Accelerating BLASTP on CUDA-enabled Graphics Hardware, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8:6, pp. 1678 – 1684, 2011
  4. H. Shi, B. Schmidt, W. Liu, W. Mueller-Wittig, A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-enabled Graphics Hardware, Journal of Computational Biology, Vol. 17, No. 4, pp. 603-615, 2010
  5. Y. Liu, D. Maskell, B. Schmidt: CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units, BMC Research Notes, 2:73, 2009