Copied to clipboard!
Database v0.3
Notions and notation

Introduction

Algebraic phylogenetics frames Markov processes of sequence evolution as the study of algebraic varieties. Understanding the geometric properties of these varieties is essential for robust statistical inference and model selection. However, this requires computing complex algebraic invariants, such as the vanishing ideal of the variety, or specific invariants like the maximum likelihood (ML) and the Euclidean distance (ED) degree.

Deriving these invariants for phylogenetic trees and networks is highly resource-intensive. The original 2007 version of this database, discussed in the chapter Catalog of Small Trees in , addressed this by compiling tables of pre-computed invariants for trees up to five species. This current website modernizes and extends that legacy, incorporating novel data structures, extended support for phylogenetic networks, and natively integrated data via the mrdi file format.

The computations underlying all models presented here are performed using the OSCAR computer algebra system . For further detail, please refer to the documentation for the phylogenetics section and the accompanying paper . For a detailed discussion on the architecture and design of this website, please refer to our associated paper: Making mathematical online resources FAIR: at the example of small phylogenetic trees .

Phylogenetic models

Evolutionary processes along a phylogeny are modeled as Markov processes on a phylogenetic tree or network. In this framework, the tree or network represents the evolutionary history: nodes correspond to species (extant at the leaves and ancestral at internal nodes), while edges represent the evolutionary lineages connecting them. Mutations along each edge are governed by a Markov transition matrix acting on a discrete state space, typically the four nucleotide bases {A, C, G, T}.

A phylogenetic model is uniquely identified by its underlying tree or network graph, together with an evolutionary model that imposes specific symmetry constraints on these transition matrices. The models considered here are: Cavender-Farris-Neyman (CFN) , Jukes-Cantor (JC) , Kimura 2-parameter (K2P) , Kimura 3-parameter (K3P) , and General Markov (GM) , .

The parametrization of a phylogenetic model is defined by the entries of the transition matrices associated with each edge and the state distribution at the root node. In standard probability coordinates, the joint probability of observing a nucleotide pattern at the leaves is obtained by marginalizing over these transition probabilities across the unobserved internal nodes, yielding a polynomial map from the parameter space to the probability distribution.

Group-based models (such as JC, CFN, K2P, and K3P) impose specific group symmetries. Their transition matrices simultaneously diagonalize, transforming the matrix entries into their eigenvalues, known as Fourier parameters. By applying a discrete Fourier transform, the standard probabilities linearly transform into Fourier coordinates. In this new coordinate system, the parametrization simplifies drastically into a strictly monomial map. This monomial structure algebraically identifies the group-based model as a toric variety, fundamentally reducing the complexity of algebraic computations such as finding the vanishing ideal.

We refer to , , and for a detailed introduction to algebraic phylogenetics and the technical background of the computations performed to derive these invariants.

Notation

To formalize the components of our phylogenetic models, we employ a consistent labeling scheme for the underlying trees and their corresponding transition matrices:

  • We name leaves from $1$ to $n$, starting from the top-left of the tree and proceeding counter-clockwise.
  • The transition matrix associated to the pendant edge connecting leaf $i$ is denoted by $M_i$.
  • The root of the tree is assumed to be the interior vertex that is the parent of leaf $1$.
  • Interior edges are labeled with transition matrices $M_i$, where $i$ ranges from $n+1$ onwards.
Example of tree labeling
Example of labeling notation for a 4-leaf binary tree.

Model identifiers

Each phylogenetic model in the database is uniquely identified by an id that captures its structural and evolutionary properties.
This identifier follows the format: L-C-maxC-minC-HL-CE-idx-M, where

What you can find in this website

This website is organized into a main database index and individual model pages:

Algebraic invariants of phylogenetic models

The numerical invariants presented in the main database are defined as follows:

Dimension The algebraic dimension of the model. This corresponds to the minimum number of parameters needed to specify the model plus one. After slicing the variety to consider only points whose coordinates sum to 1, the dimension drops by one, yielding exactly the number of parameters.
Degree The degree of the model. Algebraically, this is defined as the number of points in the intersection of the model and a generic subspace of complementary dimension.
No. of probability
coordinates
The number of equivalence classes of probability coordinates.
No. of Fourier
coordinates
The number of equivalence classes of Fourier coordinates (excluding class 0). This is also the dimension of the smallest linear space containing the model.
Dimension of the
Singular Locus
The dimension of the set of singular points of the model. The singular locus is critical for understanding the geometry of the model and computing the ML degree.
Degree of the
Singular Locus
The algebraic degree of the set of singular points of the model.
ML Degree The maximum likelihood degree of the model. The ML degree is the number of complex critical points of the Likelihood function. It acts as a measure of algebraic complexity when applying hill-climbing algorithms to find global maxima.
ED Degree The Euclidean distance degree of the model. Similar to the ML degree, it provides a measure of algebraic complexity for the nearest-point problem in terms of Euclidean distance on the variety. ML and ED degrees have been extracted from .

Additionally, the individual model pages provide the following invariants:

Probability
parametrization
The explicit mapping from the model's parameter space to its joint leaf probabilities. Each variable $p_{i_1, \ldots, i_n}$ represents the probability of observing the nucleotide pattern $(i_1, \ldots, i_n)$, with one representative pattern provided for each equivalence class.
Fourier
parametrization
The monomial mapping from the Fourier parameters to the Fourier coordinates. The list contains one representative for each equivalence class.
Equivalent classes of
probability parametrization
A partition of all possible nucleotide patterns at the leaves into classes that are algebraically equivalent under the probability parametrization.
Equivalent classes of
Fourier parametrization
A partition of the Fourier coordinates into classes such that coordinates within each class share the same monomial parametrization.
Minimal generating set of
the vanishing ideal
The smallest set of polynomials that fully define the algebraic variety representing the phylogenetic model.
Gröbner basis of
the vanishing ideal
A Gröbner basis for the vanishing ideal, calculated with respect to a specific term order.
Probability $\to$ Fourier The explicit linear transformation used to convert the joint probability distribution into the Fourier coordinate system.
Fourier $\to$ Probability The inverse linear transformation used to recover standard probability coordinates from Fourier coordinates.

Phylogenetics in OSCAR

All invariants stored in this database were computed using the OSCAR computer algebra system . In addition, OSCAR can be used to load the downloaded .mrdi files directly, giving access to the pre-computed algebraic objects (parametrizations, ideals, etc.) in a fully interactive Julia session.

Setting up OSCAR

The algebraic phylogenetics functionality used to generate and load the data on this website is currently available in a development branch of OSCAR that has not yet been merged into the official release. To use it, you need to install OSCAR directly from that branch.

Option A: Install from the development branch

The simplest approach is to add the development version of OSCAR directly to your Julia installation. Open a Julia session and run:

julia> using Pkg
julia> Pkg.add(url="https://github.com/marinagarrote/Oscar.jl", rev="load_phylogenetics_data-dev")
julia> using Oscar

⚠ Note: this installs the development branch into your global Julia environment, replacing any previously installed version of OSCAR. To revert to the latest official release afterwards, run Pkg.add("Oscar") or Pkg.free("Oscar") in a Julia session. If you prefer not to affect your global setup, use Option B instead.

Option B: Use an isolated Julia environment (recommended)

For reproducibility and to avoid conflicts with your main Julia installation, we recommend setting up an isolated environment using Julia's built-in package manager. This approach pins the exact version of OSCAR needed and keeps it separate from your other projects.

Create a new directory for your project and activate it:

# In your terminal, navigate to (or create) your project folder:
$ mkdir my-phylogenetics-project && cd my-phylogenetics-project

# Start Julia and activate the local environment:
$ julia

julia> using Pkg
julia> Pkg.activate(".")           # Creates a local Project.toml
julia> Pkg.add(url="https://github.com/marinagarrote/Oscar.jl", rev="load_phylogenetics_data-dev")
julia> Pkg.instantiate()           # Resolves and downloads all dependencies
julia> using Oscar

After this first setup, simply start Julia with julia --project=. from the same directory (or run Pkg.activate(".") at the start of each session) to automatically use the pinned version of OSCAR without reinstalling.

Option C: Use our Pluto notebook (quickest start)

Pluto.jl is a reactive Julia notebook environment that manages its own package environment automatically. We provide a ready-to-use Pluto notebook that sets up OSCAR and includes worked examples for loading and querying the database.

To get started, download the notebook, then open it in Pluto:

julia> using Pkg; Pkg.add("Pluto")
julia> import Pluto; Pluto.run()

Then open the downloaded .jl file from within the Pluto browser interface. Pluto will automatically install the correct version of OSCAR and all dependencies — no manual environment setup required.

Loading data in OSCAR

The data available for download on this site uses the mrdi file format, a JSON-based format developed for the long-term storage of algebraic data that preserves the necessary algebraic context . If you have downloaded a .mrdi file from this website, you can load it into your OSCAR session using the built-in load function. This returns a SmallPhylogeneticModel or SmallGroupBasedModel object (see documentation here), which provides access to all pre-computed algebraic properties:

julia> using Oscar

julia> model = load("SmallGroupBasedModel_4-0-0-0-0-1-0-CFN.mrdi")
Small group-based phylogenetic model 4-0-0-0-0-1-0-CFN

julia> graph(model)
Phylogenetic tree with QQFieldElem type coefficients

julia> # Extract the interactive model object:
julia> # (Use phylogenetic_model for general, non-group-based models)
julia> PM = group_based_phylogenetic_model(model)
Group-based phylogenetic model on a tree with 4 leaves and 5 edges 
  with root distribution [1//2, 1//2], 
    transition matrices of the form 
   [:a :b;
    :b :a]
  and fourier parameters of the form [:x, :y].

julia> vanishing_ideal(model)
Ideal generated by
  -q[2,2,2,2]*q[1,1,1,1] + q[2,2,1,1]*q[1,1,2,2]
  -q[2,1,2,1]*q[1,2,1,2] + q[2,1,1,2]*q[1,2,2,1]

Querying directly from OscarDB

Alternatively, you can query the OscarDB directly without manually downloading files. Currently, only trees and networks under the CFN model are available in OscarDB; support for other models is planned for future updates.

julia> using Oscar;

julia> db = Oscar.OscarDB.get_db();

julia> sgbms_collection = db["AlgebraicStatistics.SmallGroupBasedModels"]
Oscar.OscarDB.Collection: AlgebraicStatistics.SmallGroupBasedModels

julia> query = Oscar.OscarDB.find(sgbms_collection, Dict("data.model_type" => "CFN", "data.level" => 0, "data.n_leaves" => 4));

julia> SGBM = collect(query)
2-element Vector{Any}:
 Small group-based phylogenetic model 4-0-0-0-0-0-0-CFN
 Small group-based phylogenetic model 4-0-0-0-0-1-0-CFN

julia> query = Oscar.OscarDB.find(sgbms_collection, Dict("data.model_encoding" => "4-0-0-0-0-1-0-CFN"));

julia> SGBM = first(query)
Small group-based phylogenetic model 4-0-0-0-0-1-0-CFN

Computing directly in OSCAR

While OSCAR provides powerful tools to compute these algebraic invariants from scratch, we do not encourage re-computing them for models that are already in the database. The implicitization algorithms and algebraic computations involved can be extremely resource-intensive and slow, which is why we offer them pre-computed here!

However, if you want to explore models that are not yet part of this database, or apply different algebraic techniques, you can compute them directly in OSCAR. Please refer to the Algebraic Phylogenetics documentation in OSCAR for detailed documentation on how to use the relevant sections .

References

  1. Allman, E. S., and Rhodes, J. A. (2004). "Quartets and Parameter Recovery for the General Markov Model of Sequence Mutation". Applied Mathematics Research eXpress, 2004(4), 107–131. https://doi.org/10.1155/S1687120004020283.
  2. Allman, E. S., and Rhodes, J. A. (2005). The Mathematics of Phylogenetics. University of Alaska Fairbanks.
  3. Allman, E. S., and Rhodes, J. A. (2008). "Phylogenetic ideals and varieties for the general Markov model". Advances in Applied Mathematics, 40(2), 127–148. https://doi.org/10.1016/j.aam.2006.09.002.
  4. Bacher, T., Garrote-López, M., Görgen, C., and Neubert, M. J. (2026). "Making mathematical online resources FAIR: at the example of small phylogenetic trees". Notices of the American Mathematical Society, May (2026), pp 377–387. https://www.ams.org/journals/notices/202605/rnoti-p377.pdf.
  5. Boege, T., Della Vecchia, A., Garrote-López, M., and Hollering, B. (2026). "Algebraic statistics in OSCAR". arXiv preprint. https://arxiv.org/abs/2601.15807.
  6. Casanellas, M., Garcia, L. D., and Sullivant, S. (2005). "Catalog of Small Trees". In: Pachter, L. and Sturmfels, B. (eds.) Algebraic Statistics for Computational Biology, pp. 291–304. Cambridge University Press.
  7. Cavender, J. A., and Felsenstein, J. (1987). "Invariants of phylogenies in a simple case with discrete states". Journal of Classification, 4(1), 57–71. https://doi.org/10.1007/BF01890075.
  8. Della Vecchia, A., Joswig, M., and Lorenz, B. (2024). "A FAIR file format for mathematical software". In: Buzzard, K. et al. (eds.) Mathematical software – ICMS 2024, pp. 234–244. No. 14749 in Lecture Notes in Computer Science, Springer. https://doi.org/10.1007/978-3-031-64529-7_25.
  9. García-Puente, L. D., Garrote-López, M., and Shehu, E. (2023). "Computing algebraic degrees of phylogenetic varieties". Algebraic Statistics, 14(2), 215–236. https://doi.org/10.2140/astat.2023.14.215.
  10. Jukes, T. H., and Cantor, C. R. (1969). "Evolution of protein molecules". Mammalian protein metabolism, 3, 21–132. https://doi.org/10.1016/B978-1-4832-3211-9.50009-7.
  11. Kimura, M. (1980). "A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences". Journal of Molecular Evolution, 16(2), 111–120. https://doi.org/10.1007/BF01731581.
  12. Kimura, M. (1981). "Estimation of evolutionary distances between homologous nucleotide sequences". PNAS, 78(1), 454–458. https://doi.org/10.1073/pnas.78.1.454.
  13. The OSCAR Team. (2024). OSCAR – Open Source Computer Algebra Research system, Version 1.8.0. https://www.oscar-system.org.
  14. Semple, C., and Steel, M. A. (2003). Phylogenetics. Oxford Lecture Series in Mathematics and its Applications, Vol. 24. Oxford University Press.
  15. Sullivant, S. (2018). Algebraic Statistics. Graduate Studies in Mathematics, American Mathematical Society, Providence, RI.