Introduction
Algebraic phylogenetics frames Markov processes of sequence evolution as the study of algebraic varieties. Understanding the geometric properties of these varieties is essential for robust statistical inference and model selection. However, this requires computing complex algebraic invariants, such as the vanishing ideal of the variety, or specific invariants like the maximum likelihood (ML) and the Euclidean distance (ED) degree.
Deriving these invariants for phylogenetic trees and networks is highly resource-intensive. The original
2007
version
of this database, discussed in the chapter Catalog of Small Trees in , addressed this by compiling
tables of pre-computed invariants for trees up to five species.
This current website modernizes and extends that legacy, incorporating novel data structures, extended
support for phylogenetic networks, and natively integrated data via the mrdi file format.
The computations underlying all models presented here are performed using the OSCAR computer algebra system . For further detail, please refer to the documentation for the phylogenetics section and the accompanying paper . For a detailed discussion on the architecture and design of this website, please refer to our associated paper: Making mathematical online resources FAIR: at the example of small phylogenetic trees .
Phylogenetic models
Evolutionary processes along a phylogeny are modeled as Markov processes on a phylogenetic tree or network. In this framework, the tree or network represents the evolutionary history: nodes correspond to species (extant at the leaves and ancestral at internal nodes), while edges represent the evolutionary lineages connecting them. Mutations along each edge are governed by a Markov transition matrix acting on a discrete state space, typically the four nucleotide bases {A, C, G, T}.
A phylogenetic model is uniquely identified by its underlying tree or network graph, together with an evolutionary model that imposes specific symmetry constraints on these transition matrices. The models considered here are: Cavender-Farris-Neyman (CFN) , Jukes-Cantor (JC) , Kimura 2-parameter (K2P) , Kimura 3-parameter (K3P) , and General Markov (GM) , .
The parametrization of a phylogenetic model is defined by the entries of the transition matrices associated with each edge and the state distribution at the root node. In standard probability coordinates, the joint probability of observing a nucleotide pattern at the leaves is obtained by marginalizing over these transition probabilities across the unobserved internal nodes, yielding a polynomial map from the parameter space to the probability distribution.
Group-based models (such as JC, CFN, K2P, and K3P) impose specific group symmetries. Their transition matrices simultaneously diagonalize, transforming the matrix entries into their eigenvalues, known as Fourier parameters. By applying a discrete Fourier transform, the standard probabilities linearly transform into Fourier coordinates. In this new coordinate system, the parametrization simplifies drastically into a strictly monomial map. This monomial structure algebraically identifies the group-based model as a toric variety, fundamentally reducing the complexity of algebraic computations such as finding the vanishing ideal.
We refer to , , and for a detailed introduction to algebraic phylogenetics and the technical background of the computations performed to derive these invariants.
Notation
To formalize the components of our phylogenetic models, we employ a consistent labeling scheme for the underlying trees and their corresponding transition matrices:
- We name leaves from $1$ to $n$, starting from the top-left of the tree and proceeding counter-clockwise.
- The transition matrix associated to the pendant edge connecting leaf $i$ is denoted by $M_i$.
- The root of the tree is assumed to be the interior vertex that is the parent of leaf $1$.
- Interior edges are labeled with transition matrices $M_i$, where $i$ ranges from $n+1$ onwards.
Model identifiers
Each phylogenetic model in the database is uniquely identified by an id that
captures its structural and evolutionary properties.
This identifier follows the format:
L-C-maxC-minC-HL-CE-idx-M, where
L: Number of leaves of the graph.C: Number of cycles of the graph.maxC/minC: Size of the largest and smallest cycles of the graph, respectively.HL: Hybrid level, representing the maximum level of hybrid nodes in the graph.CE: Number of cut edges (edges whose removal increases the number of connected components).idx: A unique index assigned to distinguish between different topologies sharing the same structural parameters.M: The evolutionary model (e.g.,CFN,JC,K2P,K3P, orGM).
What you can find in this website
This website is organized into a main database index and individual model pages:
- Main Database: The landing page presents a comprehensive table of all computed phylogenetic models on small trees and networks. It summarizes their fundamental algebraic invariants, such as degrees, dimensions, and the number of coordinates.
- Individual Model Pages: By clicking on a specific model name, you can access its detailed parameters. These pages provide explicit parametrizations in both probability and Fourier coordinates (for group-based models), algebraic properties of the vanishing ideal (including its minimal generating set and Gröbner basis), and linear coordinate transformations.
Algebraic invariants of phylogenetic models
The numerical invariants presented in the main database are defined as follows:
Additionally, the individual model pages provide the following invariants:
| Probability parametrization |
The explicit mapping from the model's parameter space to its joint leaf probabilities. Each variable $p_{i_1, \ldots, i_n}$ represents the probability of observing the nucleotide pattern $(i_1, \ldots, i_n)$, with one representative pattern provided for each equivalence class. |
| Fourier parametrization |
The monomial mapping from the Fourier parameters to the Fourier coordinates. The list contains one representative for each equivalence class. |
| Equivalent classes of probability parametrization |
A partition of all possible nucleotide patterns at the leaves into classes that are algebraically equivalent under the probability parametrization. |
| Equivalent classes of Fourier parametrization |
A partition of the Fourier coordinates into classes such that coordinates within each class share the same monomial parametrization. |
| Minimal generating set of the vanishing ideal |
The smallest set of polynomials that fully define the algebraic variety representing the phylogenetic model. |
| Gröbner basis of the vanishing ideal |
A Gröbner basis for the vanishing ideal, calculated with respect to a specific term order. |
| Probability $\to$ Fourier | The explicit linear transformation used to convert the joint probability distribution into the Fourier coordinate system. |
| Fourier $\to$ Probability | The inverse linear transformation used to recover standard probability coordinates from Fourier coordinates. |
Phylogenetics in OSCAR
All invariants stored in this database were computed using the
OSCAR computer algebra system .
In addition, OSCAR can be used to load the downloaded .mrdi files directly,
giving access to the pre-computed algebraic objects (parametrizations, ideals, etc.)
in a fully interactive Julia session.
Setting up OSCAR
The algebraic phylogenetics functionality used to generate and load the data on this website is currently available in a development branch of OSCAR that has not yet been merged into the official release. To use it, you need to install OSCAR directly from that branch.
Option A: Install from the development branch
The simplest approach is to add the development version of OSCAR directly to your Julia installation. Open a Julia session and run:
julia> using Pkg
julia> Pkg.add(url="https://github.com/marinagarrote/Oscar.jl", rev="load_phylogenetics_data-dev")
julia> using Oscar
⚠ Note: this installs the development branch into your global Julia environment,
replacing any previously installed version of OSCAR.
To revert to the latest official release afterwards, run
Pkg.add("Oscar") or Pkg.free("Oscar") in a Julia session.
If you prefer not to affect your global setup, use Option
B instead.
Option B: Use an isolated Julia environment (recommended)
For reproducibility and to avoid conflicts with your main Julia installation, we recommend setting up an isolated environment using Julia's built-in package manager. This approach pins the exact version of OSCAR needed and keeps it separate from your other projects.
Create a new directory for your project and activate it:
# In your terminal, navigate to (or create) your project folder:
$ mkdir my-phylogenetics-project && cd my-phylogenetics-project
# Start Julia and activate the local environment:
$ julia
julia> using Pkg
julia> Pkg.activate(".") # Creates a local Project.toml
julia> Pkg.add(url="https://github.com/marinagarrote/Oscar.jl", rev="load_phylogenetics_data-dev")
julia> Pkg.instantiate() # Resolves and downloads all dependencies
julia> using Oscar
After this first setup, simply start Julia with julia --project=. from the same
directory (or run Pkg.activate(".") at the start of each session) to automatically
use the pinned version of OSCAR without reinstalling.
Option C: Use our Pluto notebook (quickest start)
Pluto.jl is a reactive Julia notebook environment that manages its own package environment automatically. We provide a ready-to-use Pluto notebook that sets up OSCAR and includes worked examples for loading and querying the database.
To get started, download the notebook, then open it in Pluto:
julia> using Pkg; Pkg.add("Pluto")
julia> import Pluto; Pluto.run()
Then open the downloaded .jl file from within the Pluto browser interface.
Pluto will automatically install the correct version of OSCAR and all dependencies —
no manual environment setup required.
Loading data in OSCAR
The data available for download on this site uses the mrdi file format, a
JSON-based
format developed for the long-term storage of algebraic data that preserves the necessary algebraic
context .
If you have downloaded a .mrdi file from this website, you can load it
into your OSCAR session using the built-in load function. This returns a
SmallPhylogeneticModel
or SmallGroupBasedModel object (see documentation here),
which provides access to all pre-computed
algebraic properties:
julia> using Oscar
julia> model = load("SmallGroupBasedModel_4-0-0-0-0-1-0-CFN.mrdi")
Small group-based phylogenetic model 4-0-0-0-0-1-0-CFN
julia> graph(model)
Phylogenetic tree with QQFieldElem type coefficients
julia> # Extract the interactive model object:
julia> # (Use phylogenetic_model for general, non-group-based models)
julia> PM = group_based_phylogenetic_model(model)
Group-based phylogenetic model on a tree with 4 leaves and 5 edges
with root distribution [1//2, 1//2],
transition matrices of the form
[:a :b;
:b :a]
and fourier parameters of the form [:x, :y].
julia> vanishing_ideal(model)
Ideal generated by
-q[2,2,2,2]*q[1,1,1,1] + q[2,2,1,1]*q[1,1,2,2]
-q[2,1,2,1]*q[1,2,1,2] + q[2,1,1,2]*q[1,2,2,1]
Querying directly from OscarDB
Alternatively, you can query the OscarDB directly without manually downloading files. Currently, only trees and networks under the CFN model are available in OscarDB; support for other models is planned for future updates.
julia> using Oscar;
julia> db = Oscar.OscarDB.get_db();
julia> sgbms_collection = db["AlgebraicStatistics.SmallGroupBasedModels"]
Oscar.OscarDB.Collection: AlgebraicStatistics.SmallGroupBasedModels
julia> query = Oscar.OscarDB.find(sgbms_collection, Dict("data.model_type" => "CFN", "data.level" => 0, "data.n_leaves" => 4));
julia> SGBM = collect(query)
2-element Vector{Any}:
Small group-based phylogenetic model 4-0-0-0-0-0-0-CFN
Small group-based phylogenetic model 4-0-0-0-0-1-0-CFN
julia> query = Oscar.OscarDB.find(sgbms_collection, Dict("data.model_encoding" => "4-0-0-0-0-1-0-CFN"));
julia> SGBM = first(query)
Small group-based phylogenetic model 4-0-0-0-0-1-0-CFN
Computing directly in OSCAR
While OSCAR provides powerful tools to compute these algebraic invariants from scratch, we do not encourage re-computing them for models that are already in the database. The implicitization algorithms and algebraic computations involved can be extremely resource-intensive and slow, which is why we offer them pre-computed here!
However, if you want to explore models that are not yet part of this database, or apply different algebraic techniques, you can compute them directly in OSCAR. Please refer to the Algebraic Phylogenetics documentation in OSCAR for detailed documentation on how to use the relevant sections .
References
- Allman, E. S., and Rhodes, J. A. (2004). "Quartets and Parameter Recovery for the General Markov Model of Sequence Mutation". Applied Mathematics Research eXpress, 2004(4), 107–131. https://doi.org/10.1155/S1687120004020283.
- Allman, E. S., and Rhodes, J. A. (2005). The Mathematics of Phylogenetics. University of Alaska Fairbanks.
- Allman, E. S., and Rhodes, J. A. (2008). "Phylogenetic ideals and varieties for the general Markov model". Advances in Applied Mathematics, 40(2), 127–148. https://doi.org/10.1016/j.aam.2006.09.002.
- Bacher, T., Garrote-López, M., Görgen, C., and Neubert, M. J. (2026). "Making mathematical online resources FAIR: at the example of small phylogenetic trees". Notices of the American Mathematical Society, May (2026), pp 377–387. https://www.ams.org/journals/notices/202605/rnoti-p377.pdf.
- Boege, T., Della Vecchia, A., Garrote-López, M., and Hollering, B. (2026). "Algebraic statistics in OSCAR". arXiv preprint. https://arxiv.org/abs/2601.15807.
- Casanellas, M., Garcia, L. D., and Sullivant, S. (2005). "Catalog of Small Trees". In: Pachter, L. and Sturmfels, B. (eds.) Algebraic Statistics for Computational Biology, pp. 291–304. Cambridge University Press.
- Cavender, J. A., and Felsenstein, J. (1987). "Invariants of phylogenies in a simple case with discrete states". Journal of Classification, 4(1), 57–71. https://doi.org/10.1007/BF01890075.
- Della Vecchia, A., Joswig, M., and Lorenz, B. (2024). "A FAIR file format for mathematical software". In: Buzzard, K. et al. (eds.) Mathematical software – ICMS 2024, pp. 234–244. No. 14749 in Lecture Notes in Computer Science, Springer. https://doi.org/10.1007/978-3-031-64529-7_25.
- García-Puente, L. D., Garrote-López, M., and Shehu, E. (2023). "Computing algebraic degrees of phylogenetic varieties". Algebraic Statistics, 14(2), 215–236. https://doi.org/10.2140/astat.2023.14.215.
- Jukes, T. H., and Cantor, C. R. (1969). "Evolution of protein molecules". Mammalian protein metabolism, 3, 21–132. https://doi.org/10.1016/B978-1-4832-3211-9.50009-7.
- Kimura, M. (1980). "A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences". Journal of Molecular Evolution, 16(2), 111–120. https://doi.org/10.1007/BF01731581.
- Kimura, M. (1981). "Estimation of evolutionary distances between homologous nucleotide sequences". PNAS, 78(1), 454–458. https://doi.org/10.1073/pnas.78.1.454.
- The OSCAR Team. (2024). OSCAR – Open Source Computer Algebra Research system, Version 1.8.0. https://www.oscar-system.org.
- Semple, C., and Steel, M. A. (2003). Phylogenetics. Oxford Lecture Series in Mathematics and its Applications, Vol. 24. Oxford University Press.
- Sullivant, S. (2018). Algebraic Statistics. Graduate Studies in Mathematics, American Mathematical Society, Providence, RI.