binary translator numbers
Transcription
binary translator numbers
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Arbres formels et Arbre de la Vie Olivier Gascuel Centre National de la Recherche Scientifique LIRMM, Montpellier, France www.lirmm.fr/gascuel 10 permanent researchers 2 technical staff 3 postdocs, 10 PhDs www.lirmm.fr/mab 1 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Phylogenetics • Tree building algorithms and software (PhyML) • Tree combinatorics (triplets, quartets, minimum evolution) • Supertrees, gene trees/species trees • Probabilistic modeling of substitutions • Branch testing 2005 2007 2 Text algorithmics • Alignment • Searching for repeats, motifs, tags • Comparative genomics • New generation sequencing Machine learning and clustering • Expression data analysis • Hidden Markov Models Mining pathogens • Plasmodium falciparum (malaria) • HIV, Flu Biodiversity Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Arbres formels et Arbre de la Vie • • • • • • • A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees 3 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Charles Darwin 1809-1859-1882 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Charles Darwin 1809-1859-1882 4 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Charles Darwin 1809-1859-1882 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Charles Darwin 1809-1859-1882 5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Arbre d’Haeckel ~1875 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Jakob Steiner 1796-1863 6 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Watson et Crick 1962 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Replication DNA duplicates Séquence Transcription RNA synthesis Translation Protein synthesis Protein folding Fonction 7 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Replication DNA duplicates Mutations Transcription RNA synthesis Translation Protein synthesis Protein folding Sélection Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Theodosius Dobzhansky 1900-1973-1975 Nothing in Biology Makes Sense Except in the Light of Evolution 8 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 The Tree(s) of Life Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 The Tree(s) of Life 9 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 HIV subtype A – Eastern & Southern Europe Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 The Tree(s) of Life 10 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Arbres formels et Arbre de la Vie • • • • • • • A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 A rooted phylogenetic tree Molecular clock assumption Tax1 Tax2 Tax5 Tax3 Tax4 11 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 A rooted phylogenetic tree Outgroup species OutG. Tax1 Tax2 Tax5 Tax3 Tax4 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 An unrooted phylogenetic tree • Inferred by most methods • Topology • Branch lengths • Binary Tax4 Tax1 Tax2 Tax3 Tax5 12 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 An unrooted phylogenetic tree • Inferred by some methods • Topology • Branch lengths • Unresolved Tax4 Tax1 Tax2 Tax3 Tax5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Graphs (vertices, leaves, labels, edges, weights) Adjacency table Ngbr 1 A A B C C 1 3 4 Tax1 Tax2 Tax3 Tax4 Tax5 A B C Ngbr 2 Ngbr 3 2 A 5 B C B Tax1 A Tax4 B C Tax2 Tax3 Tax5 13 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Graphs (vertices, leaves, labels, edges, weights) Ngbr1 A A B C C 1 3 4 Tax1 Tax2 Tax3 Tax4 Tax5 A B C Tax1 W1 A Lgth1 W1 W2 W3 W4 W5 W1 W3 W4 Wab Lgth2 Ngbr2 W2 Wab W5 2 A 5 B B C B W4 Wbc Wab Wbc Wbc Tax4 W5 W3 Tax2 Lgth3 Computer representation C W2 Ngbr3 Tax3 Tax5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Parentheses (expressions, newick format) ((1 X 2) + ((3 X (4 – 5))) = -1 + X X _ 1 2 5 3 4 14 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Parentheses (expressions and recursions) Value Expression ( LeftExp Operator RightExp ) compute (Exp) If Exp = Value, then return: Value Else Exp = (LeftExp Op RightExp) x = compute (LeftExp) y = compute (RightExp) Return: x Op y Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Parentheses (expressions and recursions) Expression Value • Recursive procedure ( LeftExp Operator RightExp • Strongly connected to )trees • Widely used • E.g. with parsimony, ML… compute (Exp) If Exp = Value, then return: Value Else Exp = (LeftExp Op RightExp) x = compute (LeftExp) y = compute (RightExp) Return: x Op y 15 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Parentheses (expressions, newick format) ((Tax1 Tax2)(Tax3 (Tax4 Tax5))) Tax1 Tax2 Tax5 Tax3 Tax4 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Parentheses (expressions, newick format) ((Tax1 Tax2) Tax3 (Tax4 tax5)) (Tax1 Tax2 (Tax3 (Tax4 Tax 5))) (Tax5 Tax4 ((Tax2 Tax1) Tax3)) …. Tax1 Tax4 Tax2 Tax3 Tax5 16 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Parentheses (expressions, newick format) Output format ((Tax1:W1, Tax2:W2):Wab, Tax3:W3, ((Tax4:W4, Tax5:W5):Wbc); Tax1 W1 W2 Tax2 A Wab B W4 Wbc C W3 Tax4 W5 Tax3 Tax5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Bipartitions (binary characters, splits) Topology { {Tax1, Tax2} | {Tax3, Tax4, Tax5} {Tax1, Tax2, Tax3} | {Tax4, Tax5} Tree building {Tax1} | {Tax2, Tax3, Tax4, Tax5} and comparison {Tax2} | L - {Tax2}, {Tax3} | L - {Tax3} {Tax4} | L - {Tax4}, {Tax5} | L - {Tax5} } Tax1 Tax4 Tax2 Tax3 Tax5 17 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Bipartitions (binary characters, splits) • A topology defines a bipartition set • Given a bipartition set, it’s easy to check that it defines a unique topology, using a local condition: A | B and A' | B' are tree compatible iff one of A A', A B', B A', B B' is empty Tax1 B A’ Tax4 A Tax2 Tax3 A topology is “equal” to its Tax5 bipartition set Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Bipartitions (binary characters, splits) • A phylogenetic tree defines a bipartition set • Given a bipartition set, it’s easy to check that it defines a unique topology, which may be unresolved { Tax1 {Tax1, Tax2} | {Tax3, Tax4, Tax5} {Tax1} | L - {Tax1} … } Tax4 Tax2 Tax5 Tax3 18 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Bipartitions (binary characters, splits) • L = {eagle, duck, dog, mouse, kiwi) • wings = {eagle, duck, kiwi} | {mouse, dog} • fly = {eagle, duck} | {kiwi, mouse, dog} • {eagle, duck} {mouse, dog} = Maximizing character compatibility is hard! eagle mouse duck kiwi dog Four Characters Suffice Katharina Huber, The Swedish University of Agricultural Sciences, and The Linnaeus Centre for Bioinformatics, Uppsala University, Sweden Vincent Moulton, The Linnaeus Centre for Bioinformatics, and Mike Steel, University of Canterbury, New Zealand 19 Phylogenetic trees Throughout this talk, we let X denote a finite set (of taxa). A tree T= (V,E ) together with a map f : X V is called a phylogenetic tree (on X) if f is a bijection of X onto the leaf set of T and all interior vertices of T have degree 3. 6 5 X = {1,2,…,6} 4 1 2 3 Partitions from characters We can associate a collection of partitions of X to any given collection of characters defined on X. 1 ATCGCTC 2 ATGCCGC 3 AGCTAGA 4 TCCAGTA 12|3|4 20 Convexity A partition P is called convex on a phylogenetic tree T if for distinct parts A, B of Pthe subtrees TA and TB (i.e. the minimal subtrees of T containing A, B, respectively) have no vertex in common. Consider the partition 12 | 35 | 4 | 6 5 6 4 1 2 T 5 6 3 1 4 T{35} 2 3 Defining trees A collection of partitions P defines a phylogenetic tree T if P is convex on T, and T is the only phylogenetic tree with this property. 5 6 4 1 2 T 3 {56 | 134 | 2 , 12 | 345 | 6 , 34 | 125 | 6} defines T 21 but if P = {12 | 35 | 4 | 6, 34 | 16 | 2 | 5, 56 | 24 | 1 | 3 } then P is convex on both of the following trees 6 1 5 1 4 2 T 3 6 4 3 2 T' 5 Question How many partitions suffice to define any given phylogenetic tree? 22 2 suffices for caterpillars 1 3 2 4 5 6 7 8 … Three partitions do not suffice…. 12 1 11 2 10 3 9 4 8 7 6 5 23 …..but at most five do! Theorem (Semple and Steel, 2002) Every phylogenetic tree can be defined by at most five partitions. Four does suffice! Theorem (Huber, Moulton, Steel, 2003) Any phylogenetic tree can be defined by at most four partitions. 24 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 1 Quartets (4-trees) 3 2 Topology Tax1 1 4 2 5 4 1 3 2 5 1 4 3 5 2 4 3 5 Tax4 Tax2 Tax3 Tax5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Quartets (4-trees) Topology { 12|34, 12|35, 12|45, 13|45, 23|45 } Tree building and comparison Tax1 Tax4 Tax2 Tax3 Tax5 25 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Quartets (4-trees) • A complete quartet set: for every quadruple {i, j, k, l} we have one resolved 4-tree, eg ij | kl • A binary topology defines a complete quartet set • It easy to check that a complete quartet set is tree compatible, and then defines a unique tree. A tree is “equal” to its quartet set. Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Quartets (4-trees) • It’s easy to infer 4-trees for all quadruples (eg ML) • But: 4-trees are not reliable • It is computationally hard to check that an incomplete quartet set is tree compatible • It is computationally hard to select the maximum number of compatible 4-trees Heuristics needed! 26 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Additive distances Tree with branch lengths Tree building (and comparison) Tax1 0 3 7 11 12 3 0 6 10 11 7 6 0 8 9 11 10 8 0 5 12 11 9 5 0 2 3 2 4 1 2 Tax2 Tax4 3 Tax3 Tax5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Additive distances 17 i = 1, j = 2, k = 3, l = 4 • A tree with lengths defines an additive distance. • A distance is additive iff it satisfies the local “4-point” condition: For every quadruple i, j , k , l , the two largests of 11 17 ij kl , ik jl , il jk are equal Tax1 2 3 1 Tax2 4 2 Tax4 2 Tax3 27 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Additive distances • A tree with lengths defines an additive distance. • A distance is additive iff it satisfies the local 4-point condition, which is easily checked. • An additive distance defines a unique tree, which is easily built. A tree is “equal” to its path length distance Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Additive distances • Estimating evolutionary distances between all taxon pairs is easy (ML) • But these distances are never 100% additive • This induces hard optimization problems • Numerous approaches and heuristics 28 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Summary of Definitions • Numerous definitions of trees (graphs, parentheses, bipartitions, characters, quartets, distances) • Used to represent, compare and infer trees • These definitions involve easy (polynomial) algorithms to recognize trees and change of representation • But hard problems to infer trees from data Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Arbres formels et Arbre de la Vie • • • • • • • A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees 29 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Numbers Number of edges in a binary tree with n taxa: 2 tax: 1, 3 tax: 3, 4 tax: 5, 5 tax : 7 … n tax: e(n) = e(n-1) + 2 = 2n - 3 n Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Numbers Number of unrooted binary trees with n taxa: : 2 tax: 1, 3 tax: 1, 4 tax: 3, 5 tax : 15, 5 tax : 105 … 53 tax: 1080 atoms in the universe n tax: t(n) = t(n - 1) x e(n - 1) = (2n – 5)(2n – 7) … 6 6 6 6 hard optimization problems! 6 6 6 30 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Numbers Number of rooted binary trees with n taxa: 2 tax: 1, 3 tax: 3, 4 tax: 15, 5 tax: 105 … n tax: r(n) = t(n) x e(n) = t(n+1) = (2n – 3) (2n – 5) … root root Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Arbres formels et Arbre de la Vie • • • • • • • A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees 31 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Topogical distance • Measure the distance between two topologies with the same taxon set • To analyze alternative trees (e.g. with parsimony) • To compare reconstruction methods with simulated data • To infer horizontal gene transfers •… Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Robinson & Foulds topogical distance • Number of moves to transform one tree into the other • Moves = edge contraction, unresolved node expansion Tax1 Tree1 Tax4 Tax2 Tax3 Tax5 32 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Robinson & Foulds topogical distance • Number of moves to transform one tree into the other • Moves = edge contraction, unresolved node expansion Tax1 Tree2 Tax3 Tax2 Tax4 Tax5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Robinson & Foulds topogical distance • Number of moves to transform one tree into the other • Moves = edge contraction, unresolved node expansion Tax1 contraction Tax3 Tax2 Tax4 Tax5 33 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Robinson & Foulds topogical distance • Number of moves to transform one tree into the other • Moves = edge contraction, unresolved node expansion expansion Tree1 Tax1 R&F(Tree1, Tree2) = 2 Tax3 Tax2 Tax4 Tax5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Bipartition distance • Number of bipartitions in one tree but not the other • Bipartition and R&F distances are equal (easy calculation) Tree1: {12|345, 123|45} Tax1 Tree2: {12|345, 124|35} Tax4 Tax3 Tax2 Tax3 Tax4 Tax5 34 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Quartet distance • Number of 4-trees in one tree but not the other • Easy to compute, more refined than R&F distance Tree1: {12|34, 12|35, 12|45, 13|45, 23|45} Tax1 Tree2: {12|34, 12|35, 12|45, 14|35, 24|35} Tax4 Tax3 QD = 4 Tax2 Tax3 Tax4 Tax5 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Horizontal gene transfers and SPR distance Species tree Tax1 Tax2 Tax5 Tax3 Tax4 35 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Horizontal gene transfers and SPR distance Gene transfer Tax1 Tax2 Tax5 Tax3 Tax4 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Horizontal gene transfers and SPR distance Subtree (Tax4, Tax5) is Pruned and Regraft 1 HGT 1 SPR Gene tree Tax1 Tax2 Tax3 Tax5 Tax4 36 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Horizontal gene transfers and SPR distance • SPR distance: minimum number of SPR moves required to tansform one tree into the other. • Biologically relevant: number of HGTs • Very hard to compute! Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Exercice Compute the RF and SPR distance between: Tax1 Tax2 Tax5 Tax3 Tax4 Tax2 Tax4 Tax3 Tax5 Tax1 37 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Arbres formels et Arbre de la Vie • • • • • • • A bit of history and biology Definitions Numbers Topological distances Consensus Random models Algorithms to build trees Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Consensus • We aim at estimating the consensus of a family of trees with the same taxon set. • Most consensus problems are hard (think to elections…) • But it’s easy to define and compute the majority rule consensus tree 38 Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Majority rule consensus tree • n trees with the same taxon set • every tree t defines a bipartition set Bt = {b} • collect B = { b seen in > n/2 sets Bt } • any pair b, b’ is seen in at least one common set Bt • therefore, b and b’ are tree compatible • and B defines a unique tree! Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011 Exercice: Compute the majority consensus tree between Tax1 Tax2 Tax5 Tax3 Tax2 Tax1 Tax4 Tax4 Tax5 Tax3 Tax2 Tax3 Tax1 Tax5 Tax4 39