binary translator numbers

Transcription

binary translator numbers
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Arbres formels et Arbre de la Vie
Olivier Gascuel
Centre National de la Recherche Scientifique
LIRMM, Montpellier, France
www.lirmm.fr/gascuel
10 permanent researchers
2 technical staff
3 postdocs, 10 PhDs
www.lirmm.fr/mab
1
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Phylogenetics
• Tree building algorithms and software (PhyML)
• Tree combinatorics (triplets, quartets, minimum evolution)
• Supertrees, gene trees/species trees
• Probabilistic modeling of substitutions
• Branch testing
2005
2007
2
Text algorithmics
• Alignment
• Searching for repeats, motifs, tags
• Comparative genomics
• New generation sequencing
Machine learning and clustering
• Expression data analysis
• Hidden Markov Models
Mining pathogens
• Plasmodium falciparum (malaria)
• HIV, Flu
Biodiversity
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Arbres formels et Arbre de la Vie
•
•
•
•
•
•
•
A bit of history and biology
Definitions
Numbers
Topological distances
Consensus
Random models
Algorithms to build trees
3
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Charles Darwin 1809-1859-1882
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Charles Darwin 1809-1859-1882
4
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Charles Darwin 1809-1859-1882
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Charles Darwin 1809-1859-1882
5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Arbre d’Haeckel ~1875
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Jakob Steiner 1796-1863
6
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Watson et Crick 1962
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Replication
DNA duplicates
Séquence
Transcription
RNA synthesis
Translation
Protein synthesis
Protein folding
Fonction
7
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Replication
DNA duplicates
Mutations
Transcription
RNA synthesis
Translation
Protein synthesis
Protein folding
Sélection
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Theodosius Dobzhansky 1900-1973-1975
Nothing in Biology Makes Sense
Except in the Light of Evolution
8
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
The Tree(s) of Life
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
The Tree(s) of Life
9
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
HIV subtype A – Eastern & Southern Europe
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
The Tree(s) of Life
10
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Arbres formels et Arbre de la Vie
•
•
•
•
•
•
•
A bit of history and biology
Definitions
Numbers
Topological distances
Consensus
Random models
Algorithms to build trees
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
A rooted phylogenetic tree
Molecular clock assumption
Tax1
Tax2
Tax5
Tax3
Tax4
11
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
A rooted phylogenetic tree
Outgroup species
OutG.
Tax1
Tax2
Tax5
Tax3
Tax4
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
An unrooted phylogenetic tree
• Inferred by most methods
• Topology
• Branch lengths
• Binary
Tax4
Tax1
Tax2
Tax3
Tax5
12
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
An unrooted phylogenetic tree
• Inferred by some methods
• Topology
• Branch lengths
• Unresolved
Tax4
Tax1
Tax2
Tax3
Tax5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Graphs (vertices, leaves, labels, edges, weights)
Adjacency table
Ngbr 1
A
A
B
C
C
1
3
4
Tax1
Tax2
Tax3
Tax4
Tax5
A
B
C
Ngbr 2
Ngbr 3
2
A
5
B
C
B
Tax1
A
Tax4
B
C
Tax2
Tax3
Tax5
13
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Graphs (vertices, leaves, labels, edges, weights)
Ngbr1
A
A
B
C
C
1
3
4
Tax1
Tax2
Tax3
Tax4
Tax5
A
B
C
Tax1
W1
A
Lgth1
W1
W2
W3
W4
W5
W1
W3
W4
Wab
Lgth2
Ngbr2
W2
Wab
W5
2
A
5
B
B
C
B
W4
Wbc
Wab
Wbc
Wbc
Tax4
W5
W3
Tax2
Lgth3
Computer
representation
C
W2
Ngbr3
Tax3
Tax5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Parentheses (expressions, newick format)
((1 X 2) + ((3 X (4 – 5)))
= -1
+
X
X
_
1
2
5
3
4
14
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Parentheses (expressions and recursions)
Value
Expression
( LeftExp Operator RightExp )
compute (Exp)
If Exp = Value, then return: Value
Else Exp = (LeftExp Op RightExp)
x = compute (LeftExp)
y = compute (RightExp)
Return: x Op y
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Parentheses (expressions and recursions)
Expression
Value
• Recursive procedure
( LeftExp
Operator
RightExp
• Strongly
connected
to )trees
• Widely used
• E.g. with parsimony, ML…
compute (Exp)
If Exp = Value, then return: Value
Else Exp = (LeftExp Op RightExp)
x = compute (LeftExp)
y = compute (RightExp)
Return: x Op y
15
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Parentheses (expressions, newick format)
((Tax1 Tax2)(Tax3 (Tax4 Tax5)))
Tax1
Tax2
Tax5
Tax3
Tax4
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Parentheses (expressions, newick format)
((Tax1 Tax2) Tax3 (Tax4 tax5))
(Tax1 Tax2 (Tax3 (Tax4 Tax 5)))
(Tax5 Tax4 ((Tax2 Tax1) Tax3))
….
Tax1
Tax4
Tax2
Tax3
Tax5
16
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Parentheses (expressions, newick format)
Output format
((Tax1:W1, Tax2:W2):Wab,
Tax3:W3,
((Tax4:W4, Tax5:W5):Wbc);
Tax1
W1
W2
Tax2
A
Wab
B
W4
Wbc
C
W3
Tax4
W5
Tax3
Tax5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Bipartitions (binary characters, splits)
Topology 
{
{Tax1, Tax2} | {Tax3, Tax4, Tax5}
{Tax1, Tax2, Tax3} | {Tax4, Tax5}
Tree building
{Tax1} | {Tax2, Tax3, Tax4, Tax5}
and comparison
{Tax2} | L - {Tax2}, {Tax3} | L - {Tax3}
{Tax4} | L - {Tax4}, {Tax5} | L - {Tax5}
}
Tax1
Tax4
Tax2
Tax3
Tax5
17
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Bipartitions (binary characters, splits)
• A topology defines a bipartition set
• Given a bipartition set, it’s easy to check
that it defines a unique topology, using
a local condition:
A | B and A' | B' are tree compatible iff
one of A  A', A  B', B  A', B  B' is empty
Tax1
B
A’
Tax4
A
Tax2
Tax3
A topology is “equal” to its
Tax5
bipartition
set
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Bipartitions (binary characters, splits)
• A phylogenetic tree defines a bipartition set
• Given a bipartition set, it’s easy to check
that it defines a unique topology, which
may be unresolved
{
Tax1
{Tax1, Tax2} | {Tax3, Tax4, Tax5}
{Tax1} | L - {Tax1} … }
Tax4
Tax2
Tax5
Tax3
18
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Bipartitions (binary characters, splits)
• L = {eagle, duck, dog, mouse, kiwi)
• wings = {eagle, duck, kiwi} | {mouse, dog}
• fly = {eagle, duck} | {kiwi, mouse, dog}
• {eagle, duck}  {mouse, dog} = 
Maximizing character
compatibility is hard!
eagle
mouse
duck
kiwi
dog
Four Characters Suffice
Katharina Huber,
The Swedish University of Agricultural Sciences, and
The Linnaeus Centre for Bioinformatics,
Uppsala University, Sweden
Vincent Moulton,
The Linnaeus Centre for Bioinformatics,
and
Mike Steel,
University of Canterbury, New Zealand
19
Phylogenetic trees
Throughout this talk, we let X denote a finite set (of taxa).
A tree T= (V,E ) together with a map f : X
V is
called a phylogenetic tree (on X) if f is a bijection
of X onto the leaf set of T and all interior vertices of T
have degree 3.
6
5
X = {1,2,…,6}
4
1
2
3
Partitions from characters
We can associate a collection of partitions of X to
any given collection of characters defined on X.
1
ATCGCTC
2
ATGCCGC
3
AGCTAGA
4
TCCAGTA
12|3|4
20
Convexity
A partition P is called convex on a phylogenetic tree T if for distinct
parts A, B of Pthe subtrees TA and TB (i.e. the minimal
subtrees of T containing A, B, respectively) have no vertex in common.
Consider the partition 12 | 35 | 4 | 6
5
6
4
1
2
T
5
6
3
1
4
T{35}
2
3
Defining trees
A collection of partitions P defines a phylogenetic tree T if P is
convex on T, and T is the only phylogenetic tree with this property.
5
6
4
1
2
T
3
{56 | 134 | 2 , 12 | 345 | 6 , 34 | 125 | 6} defines T
21
but if
P = {12 | 35 | 4 | 6, 34 | 16 | 2 | 5, 56 | 24 | 1 | 3 }
then P is convex on both of the following trees
6
1
5
1
4
2
T
3
6
4
3
2
T'
5
Question
How many partitions suffice to define any
given phylogenetic tree?
22
2 suffices for caterpillars
1
3
2
4
5
6
7
8
…
Three partitions do not suffice….
12
1
11
2
10
3
9
4
8
7
6
5
23
…..but at most five do!
Theorem (Semple and Steel, 2002)
Every phylogenetic tree can be defined by
at most five partitions.
Four does suffice!
Theorem (Huber, Moulton, Steel, 2003)
Any phylogenetic tree can be defined by at most
four partitions.
24
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
1
Quartets (4-trees)
3
2
Topology 
Tax1
1
4
2
5
4
1
3
2
5
1
4
3
5
2
4
3
5
Tax4
Tax2
Tax3
Tax5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Quartets (4-trees)
Topology  { 12|34, 12|35, 12|45, 13|45, 23|45 }
Tree building
and comparison
Tax1
Tax4
Tax2
Tax3
Tax5
25
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Quartets (4-trees)
• A complete quartet set: for every quadruple
{i, j, k, l} we have one resolved 4-tree, eg ij | kl
• A binary topology defines a complete quartet set
• It easy to check that a complete quartet set is
tree compatible, and then defines a unique tree.
A tree is “equal”
to its quartet set.
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Quartets (4-trees)
• It’s easy to infer 4-trees for all quadruples (eg ML)
• But: 4-trees are not reliable
• It is computationally hard to check that an
incomplete quartet set is tree compatible
• It is computationally hard to select the maximum
number of compatible 4-trees
Heuristics needed!
26
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Additive distances
Tree with branch lengths 
Tree building
(and comparison)
Tax1
 0 3 7 11 12 
 3 0 6 10 11 


7 6 0 8 9
 11 10 8 0 5 


 12 11 9 5 0 


2
3
2
4
1
2
Tax2
Tax4
3
Tax3
Tax5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Additive distances
17
i = 1, j = 2, k = 3, l = 4
• A tree with lengths defines an additive distance.
• A distance is additive iff it satisfies the local
“4-point” condition:
For every quadruple i, j , k , l , the two largests of
11
17
 ij  kl  ,  ik   jl  ,  il   jk  are equal
Tax1
2
3
1
Tax2
4
2
Tax4
2
Tax3
27
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Additive distances
• A tree with lengths defines an additive distance.
• A distance is additive iff it satisfies the local
4-point condition, which is easily checked.
• An additive distance defines a unique tree,
which is easily built.
A tree is “equal” to
its path length
distance
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Additive distances
• Estimating evolutionary distances between all
taxon pairs is easy (ML)
• But these distances are never 100% additive
• This induces hard optimization problems
• Numerous approaches and heuristics
28
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Summary of Definitions
• Numerous definitions of trees (graphs, parentheses,
bipartitions, characters, quartets, distances)
• Used to represent, compare and infer trees
• These definitions involve easy (polynomial)
algorithms to recognize trees and change of
representation
• But hard problems to infer trees from data
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Arbres formels et Arbre de la Vie
•
•
•
•
•
•
•
A bit of history and biology
Definitions
Numbers
Topological distances
Consensus
Random models
Algorithms to build trees
29
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Numbers
Number of edges in a binary tree with n taxa:
2 tax: 1, 3 tax: 3, 4 tax: 5, 5 tax : 7 …
n tax: e(n) = e(n-1) + 2 = 2n - 3
n
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Numbers
Number of unrooted binary trees with n taxa: :
2 tax: 1, 3 tax: 1, 4 tax: 3, 5 tax : 15, 5 tax : 105 …
53 tax:  1080  atoms in the universe
n tax: t(n) = t(n - 1) x e(n - 1) = (2n – 5)(2n – 7) …
6
6
6
6
 hard optimization
problems!
6
6
6
30
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Numbers
Number of rooted binary trees with n taxa:
2 tax: 1, 3 tax: 3, 4 tax: 15, 5 tax: 105 …
n tax: r(n) = t(n) x e(n) = t(n+1) = (2n – 3) (2n – 5) …
root
root
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Arbres formels et Arbre de la Vie
•
•
•
•
•
•
•
A bit of history and biology
Definitions
Numbers
Topological distances
Consensus
Random models
Algorithms to build trees
31
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Topogical distance
• Measure the distance between two topologies with
the same taxon set
• To analyze alternative trees (e.g. with parsimony)
• To compare reconstruction methods with simulated data
• To infer horizontal gene transfers
•…
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Robinson & Foulds topogical distance
• Number of moves to transform one tree into the other
• Moves = edge contraction, unresolved node expansion
Tax1
Tree1
Tax4
Tax2
Tax3
Tax5
32
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Robinson & Foulds topogical distance
• Number of moves to transform one tree into the other
• Moves = edge contraction, unresolved node expansion
Tax1
Tree2
Tax3
Tax2
Tax4
Tax5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Robinson & Foulds topogical distance
• Number of moves to transform one tree into the other
• Moves = edge contraction, unresolved node expansion
Tax1
contraction
Tax3
Tax2
Tax4
Tax5
33
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Robinson & Foulds topogical distance
• Number of moves to transform one tree into the other
• Moves = edge contraction, unresolved node expansion
expansion  Tree1
Tax1
R&F(Tree1, Tree2) = 2
Tax3
Tax2
Tax4
Tax5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Bipartition distance
• Number of bipartitions in one tree but not the other
• Bipartition and R&F distances are equal (easy calculation)
Tree1: {12|345, 123|45}
Tax1
Tree2: {12|345, 124|35}
Tax4
Tax3
Tax2
Tax3
Tax4
Tax5
34
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Quartet distance
• Number of 4-trees in one tree but not the other
• Easy to compute, more refined than R&F distance
Tree1: {12|34, 12|35, 12|45, 13|45, 23|45}
Tax1
Tree2: {12|34, 12|35, 12|45, 14|35, 24|35}
Tax4
Tax3
QD = 4
Tax2
Tax3
Tax4
Tax5
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Horizontal gene transfers and SPR distance
Species tree
Tax1
Tax2
Tax5
Tax3
Tax4
35
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Horizontal gene transfers and SPR distance
Gene transfer
Tax1
Tax2
Tax5
Tax3
Tax4
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Horizontal gene transfers and SPR distance
Subtree (Tax4, Tax5)
is Pruned and Regraft
1 HGT  1 SPR
Gene tree
Tax1
Tax2
Tax3
Tax5
Tax4
36
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Horizontal gene transfers and SPR distance
• SPR distance: minimum number of SPR moves required
to tansform one tree into the other.
• Biologically relevant:  number of HGTs
• Very hard to compute!
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Exercice
Compute the RF and SPR distance between:
Tax1
Tax2
Tax5
Tax3
Tax4
Tax2
Tax4
Tax3
Tax5
Tax1
37
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Arbres formels et Arbre de la Vie
•
•
•
•
•
•
•
A bit of history and biology
Definitions
Numbers
Topological distances
Consensus
Random models
Algorithms to build trees
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Consensus
• We aim at estimating the consensus of a family of trees
with the same taxon set.
• Most consensus problems are hard (think to elections…)
• But it’s easy to define and compute the majority rule
consensus tree
38
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Majority rule consensus tree
• n trees with the same taxon set
• every tree t defines a bipartition set Bt = {b}
• collect B = { b seen in > n/2 sets Bt }
• any pair b, b’ is seen in at least one common set Bt
• therefore, b and b’ are tree compatible
• and B defines a unique tree!
Olivier Gascuel – Arbres formels et Arbre de la Vie – Conférence ENS Cachan, septembre 2011
Exercice: Compute the majority consensus tree between
Tax1
Tax2
Tax5
Tax3
Tax2
Tax1
Tax4
Tax4
Tax5
Tax3
Tax2
Tax3
Tax1
Tax5
Tax4
39