Database v0.1 Notions and notation About this project AlgebraicPhylogenetics.jl documentation in Small Phylogenetic Trees (2007)
Notions and notation

Introduction

The algebraic-geometry theory behind the computations performed for display in this database was first presented by Lior Pachter and Bernd Sturmfels (2005; Algebraic Statistics for computational biology. Cambridge University Press. doi: 10.1017/CBO9780511610684). In particular, the 2006 version of this database is discussed in Chapter 15 "Catalog of Small Trees" of that book, written by Marta Casanellas, Luis David Garcia Puente, and Seth Sullivant. We link to their original database version, called Small Phylogenetic Trees, in the upper-right hand corner of this website.

The models studied here are the following.

Cavender-Ferris-Neyman (CFN) Cavender, James A. and Joseph Felsenstein (1987). “Invariants of phylogenies in a simple case with discrete states”. In: Journal of Classification 4(1), pp. 57–71. issn: 0176-4268. doi: 10.1007/BF01890075.

Jukes-Cantor (JC69) Jukes, Thomas H. and Charles R. Cantor (1969). "Evolution of protein molecules". In: Mammalian protein metabolism 3, pp. 21–132. doi: 10.1016/B978-1-4832-3211-9.50009-7.

Kimura 3 (K3P) Kimura, Motoo (1980). "A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences". In: Journal of Molecular Evolution 16(2), pp. 111–120. issn: 0022-2844. doi: 10.1007/BF01731581.

Kimura 2 (K2P) Kimura, Motoo (1981). "Estimation of evolutionary distances between homologous nucleotide sequences." In: Proceedings of the National Academy of Sciences 78(1), pp. 454–458. issn: 0027-8424. doi: 10.1073/pnas.78.1.454

General Markov (GMM) Allman, Elizabeth S., John A. Rhodes, and Amelia Taylor (2014). "A Semialgebraic Description of the General Markov Model on Phylogenetic Trees". In: SIAM Journal on Discrete Mathematics 28(2), pp. 736–755. issn: 0895-4801. doi: 10.1137/120901568. eprint: 1212.1200 (q-bio.PE).

Labelling conventions

If $T$ is a rooted tree, there is a distinguished vertex of $T$ called the root. The tree $T$ is drawn with the root at the top and all its children and their respective children below, with the leaves on the bottom-most level. The edges are numbered $e_1, e_2, ...$ starting at the upper left hand corner, proceeding left to right and top to bottom. The leaves are labeled with the numbers $1,2,3,...$ starting with the left-most leaf and proceeding left to right.

Unrooted trees will be featured in a follow-up version of the current website.

Modelling assumptions

A phylogenetic model is uniquely identified by its tree graph together with its name in natural language: Neyman 2-state model, Jukes Cantor, Kimura 2-parameter or Kimura 3-parameter model or the General Markov Model.

To each node in a model, we associate a random variable with two or four states: in the case of binary data these states are ${0,1}$, and for DNA data they are $\{$A,C,G,T$\}$, in this order. The root distribution is hence a vector of length two or four, throughout assumed to be uniform. The inner nodes are assumed to be hidden variables with the respective same number of states. The leaves $1,2,3...$ are associated to random variables called $X_1,X_2,X_3,...$ respectively.

To each edge $e_1,e_2,...$, we associate a transition matrix $M_1,M_2,...$, respectively. These matrices are of dimension $2\times 2$ or $4\times 4$, depending on whether the model is for binary data or DNA data. They code the probabilities of passing from the head-random variable of an edge to its tail-random variable. These probabilities in matrix $M_i$ are parameters $a_i,b_i,c_i,d_i$, some equal, depending on the model. We assume mutual independence of all random variables. We further assume that the parameters satisfy additional linear constraints ensuring that the transition matrices are stochastic. We do not, however, use these linear relations to eliminate parameters.

The distribution of each leaf is calculated using a root-to-leaf product of transition matrices. This may be expressed in a probability or Fourier parametrisation. The implicitisation we calculate of each model represents their joint distribution.

Computations

The software AlgebraicPhylogenetics.jl provides a coherent, open-source computational environment for defining phylogenetic models from trees, eliminating parameters to find a algebro-geometric representation of these models, and commands to calculate their properties as listed below. Code snippets and an in-depth introduction to all functionalities are presented in the software documentation.

An upcoming version of this website will feature a running example using that software.

Invariants

The notation and explanation of invariants presented below is taken from the original Small Phylogenetic Trees website. It reflects conventions and knowledge within the theory cited above.

$np$ Number of equivalences classes of the probability coordinates.
$nq$ Number of equivalence classes of Fourier coordinates without counting class 0. This is also the dimension of the smallest linear space that contains the model.
$p_i$ The sum of all probability coordinates in the $i$th equivalence class. These indeterminates form a set of sufficient statistics for the given model.
$q_j$ The average of all Fourier coordinates in the $j$th equivalence class.
dim The dimension of the model. This corresponds to the least number of parameters needed to specify the model plus one. This numerical invariant is actually the algebraic dimension of the toric ideal. After slicing the corresponding variety with the hyperplane given by the polynomial obtained from adding all indeterminates qj minus 1, the dimension drops by one giving exactly the minimum number of parameters needed to specify the model. This algebraic operation corresponds to considering only the points in the variety whose coordinates sum up to 1.
deg The degree of the model. Algebraically, this is defined as the number of points in the intersection of the model and a generic (i.e. "random") subspace of complementary dimension. This is one of the most important numerical invariants of a variety. In particular, when the variety is zero-dimensional (a collection of points), this numerical invariant gives the actual number of points.
MLdeg The maximum likelihood degree of the model. This is the degree of the ML equations of the model. Given some data and a model, the likelihood function is a rational function on the corresponding (projective) variety. The set of solutions to the ML equations consists of all critical points of the Likelihood function. These gives an algebraic algorithm to compute symbolically all local maxima of the Likelihood function. the ML degree is a measure of how many local maxima are there. This numerical invariant should be useful when applying hill-climbing algorithms to compute local maxima if it is used as a measure of confidence on the chances that the local maxima approximated is actually a global maximum.
dim Sing The dimension of the set of singular points of the model. The singular locus of a model is very important to understand the geometry of the model. In particular, it is necessary for the ML degree computation.
deg Sing The algebraic degree of the set of singular points of the model.