anjl is a Python package providing fast implementations of the neighbour-joining algorithm of Saitou and Nei and some associated utilities.

Installation¶

anjl is available from PyPI and can be installed via pip, e.g.:

%pip install anjl

Demo¶

import anjl

anjl requires a numpy array containing a distance matrix as input, either in square or condensed form. Optionally, anjl can also make use of a pandas DataFrame containing data about the original observations.

Load some demo data to try it out:

D, leaf_data = anjl.data.mosquitoes()

D is a square distance matrix containing genetic distances between mosquitoes:

array([[ 0., 69., 74., ..., 68., 65., 65.],
       [69.,  0., 75., ..., 73., 76., 70.],
       [74., 75.,  0., ..., 56., 69., 69.],
       ...,
       [68., 73., 56., ...,  0., 59., 63.],
       [65., 76., 69., ..., 59.,  0., 68.],
       [65., 70., 69., ..., 63., 68.,  0.]], dtype=float32)

D.shape

(181, 181)

leaf_data contains additional data about the mosquitoes:

leaf_data.head()

Compute a neighbour-joining tree from the given distance matrix, using the rapid algorithm:

Z = anjl.dynamic_nj(D)

The return value Z is a numpy array containing an encoding of the resulting tree.

Create an interactive plot to visualise the tree:

anjl.plot(
    Z,
    leaf_data=leaf_data,
    color="taxon",
    hover_name="sample_id",
    hover_data=["country", "location", "year", "month"],
)

This plot is rendered using plotly and is interactive - try hovering over leaf nodes for more information, pan and zoom, or click on the legend to hide or show items.

Benchmark¶

Here is a performance comparison between implementations of neighbour-joining in Python.

import pandas as pd
import plotly.express as px

df_results = pd.read_csv("../benchmark/results.csv")
df_plot = (
    df_results.groupby(["implementation", "size"])
    .agg(
        time_mean=("time", "mean"),
        time_std=("time", "std"),
    )
    .reset_index()
)
fig = px.line(
    df_plot,
    x="size",
    y="time_mean",
    error_y="time_std",
    width=750,
    height=500,
    color="implementation",
    category_orders=dict(
        implementation=[
            "BioPython 1.84",
            "scikit-bio 0.6.2",
            "biotite 1.0.1",
            "anjl 1.4.0 canonical_nj",
            "anjl 1.4.0 canonical_nj parallel",
            "anjl 1.4.0 rapid_nj",
            "anjl 1.4.0 dynamic_nj",
            "anjl 1.4.0 dynamic_nj parallel",
        ]
    ),
)
fig.update_layout(
    title="Neighbour-joining performance comparison",
    xaxis_title="No. observations (distance matrix size)",
    yaxis_title="Time (s)",
    legend_title="Implementation",
)
fig

This benchmark was run on a laptop with an Intel Xeon E3-1505M CPU with 4 cores and 2 threads per core, using data from the Malaria Vector Genome Observatory. The source data are available here.

API¶

Dynamic neighbour-joining¶

anjl.dynamic_nj?

Signature:
anjl.dynamic_nj(
    D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]],
    disallow_negative_distances: bool = True,
    progress: Optional[Callable] = None,
    progress_options: collections.abc.Mapping = {},
    copy: bool | None = True,
    parallel: bool = True,
) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]]
Docstring:
Perform neighbour-joining using the dynamic algorithm of Clausen [1]_.

This is the fastest and most scalable implementation currently
available. The dynamic algorithm exploits the fact that the neighbour-
joining criterion Q is gradually weakened with each iteration, and
therefore the minimum value of Q found initially within a given row
provides a lower bound for all values within the same row in subsequent
iterations. This allows many rows of the distance matrix to be skipped
in each iteration.

Parameters
----------
D : ndarray[Any, dtype[_ScalarType_co]]
    A distance matrix, either in square form, or condensed upper triangle
    form (e.g., as returned by scipy's pdist function). To minimise memory
    usage, provide a condensed distance matrix, and also pass copy=False,
    although be aware that the input data will be overwritten during tree
    construction.
disallow_negative_distances : bool, optional, default: True
    If True, set any negative distances to zero.
progress : Callable or None, optional
    A function which will be used to wrap the main loop iterator to
    provide information on progress. E.g., could be tqdm.
progress_options : Mapping, optional, default: {}
    Any options to be passed into the progress function.
copy : UnionType[bool, None], optional, default: True
    Passed through to numpy.array(). For numpy version 2.0 and later, if
    True (default), then the array data is copied. If None, a copy will
    only be made if necessary. If False it raises a ValueError if a copy
    cannot be avoided. If False, please note that the input data will be
    overwritten during tree construction.
parallel : bool, optional, default: True
    If True, attempt to use multiple CPU threads to accelerate the
    computation.

Returns
-------
ndarray[Any, dtype[float32]]
    A neighbour-joining tree encoded as a numpy array. Each row in the
    array contains data for one internal node in the tree, in the order
    in which they were created by the neighbour-joining algorithm.
    Within each row there are five values: left child node identifier,
    right child node identifier, distance to left child, distance to
    right child, total number of leaves. This data structure is similar
    to that returned by scipy's hierarchical clustering functions,
    except that here we have two distance values for each internal node
    rather than one because distances to the children may be different.

References
----------
.. [1] Https://doi.org/10.1093/bioinformatics/btac774.
File:      ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_dynamic.py
Type:      function

Rapid neighbour-joining¶

anjl.rapid_nj?

Signature:
anjl.rapid_nj(
    D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]],
    disallow_negative_distances: bool = True,
    progress: Optional[Callable] = None,
    progress_options: collections.abc.Mapping = {},
    copy: bool | None = True,
    gc: int | None = 100,
) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]]
Docstring:
Perform neighbour-joining using an implementation based on the rapid
algorithm of Simonsen et al. [1]_.

This implementation builds and maintains a sorted copy of the distance
matrix and uses heuristics to avoid searching pairs that cannot possibly
be neighbours in each iteration.

Parameters
----------
D : ndarray[Any, dtype[_ScalarType_co]]
    A distance matrix, either in square form, or condensed upper triangle
    form (e.g., as returned by scipy's pdist function). To minimise memory
    usage, provide a condensed distance matrix, and also pass copy=False,
    although be aware that the input data will be overwritten during tree
    construction.
disallow_negative_distances : bool, optional, default: True
    If True, set any negative distances to zero.
progress : Callable or None, optional
    A function which will be used to wrap the main loop iterator to
    provide information on progress. E.g., could be tqdm.
progress_options : Mapping, optional, default: {}
    Any options to be passed into the progress function.
copy : UnionType[bool, None], optional, default: True
    Passed through to numpy.array(). For numpy version 2.0 and later, if
    True (default), then the array data is copied. If None, a copy will
    only be made if necessary. If False it raises a ValueError if a copy
    cannot be avoided. If False, please note that the input data will be
    overwritten during tree construction.
gc : UnionType[int, None], optional, default: 100
    Number of iterations to perform between compacting data structures to
    remove any data corresponding to nodes that have been clustered.

Returns
-------
ndarray[Any, dtype[float32]]
    A neighbour-joining tree encoded as a numpy array. Each row in the
    array contains data for one internal node in the tree, in the order
    in which they were created by the neighbour-joining algorithm.
    Within each row there are five values: left child node identifier,
    right child node identifier, distance to left child, distance to
    right child, total number of leaves. This data structure is similar
    to that returned by scipy's hierarchical clustering functions,
    except that here we have two distance values for each internal node
    rather than one because distances to the children may be different.

Notes
-----
The ordering of the internal nodes may be different between the
canonical and the rapid algorithms, because these algorithms search
the distance matrix in a different order. However, the resulting trees
will be topologically equivalent.

References
----------
.. [1] Https://pure.au.dk/ws/files/19821675/rapidNJ.pdf.
File:      ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_rapid.py
Type:      function

Canonical neighbour-joining¶

anjl.canonical_nj?

Signature:
anjl.canonical_nj(
    D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]],
    disallow_negative_distances: bool = True,
    progress: Optional[Callable] = None,
    progress_options: collections.abc.Mapping = {},
    copy: bool | None = True,
    parallel: bool = True,
) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]]
Docstring:
Perform neighbour-joining using the canonical algorithm.

This implementation performs a full scan of the distance matrix in each
iteration of the algorithm to find the pair of nearest neighbours. It is
therefore slower and scales with the cube of the number of original
observations in the distance matrix, i.e., O(n^3).

Parameters
----------
D : ndarray[Any, dtype[_ScalarType_co]]
    A distance matrix, either in square form, or condensed upper triangle
    form (e.g., as returned by scipy's pdist function). To minimise memory
    usage, provide a condensed distance matrix, and also pass copy=False,
    although be aware that the input data will be overwritten during tree
    construction.
disallow_negative_distances : bool, optional, default: True
    If True, set any negative distances to zero.
progress : Callable or None, optional
    A function which will be used to wrap the main loop iterator to
    provide information on progress. E.g., could be tqdm.
progress_options : Mapping, optional, default: {}
    Any options to be passed into the progress function.
copy : UnionType[bool, None], optional, default: True
    Passed through to numpy.array(). For numpy version 2.0 and later, if
    True (default), then the array data is copied. If None, a copy will
    only be made if necessary. If False it raises a ValueError if a copy
    cannot be avoided. If False, please note that the input data will be
    overwritten during tree construction.
parallel : bool, optional, default: True
    If True, attempt to use multiple CPU threads to accelerate the
    computation.

Returns
-------
ndarray[Any, dtype[float32]]
    A neighbour-joining tree encoded as a numpy array. Each row in the
    array contains data for one internal node in the tree, in the order
    in which they were created by the neighbour-joining algorithm.
    Within each row there are five values: left child node identifier,
    right child node identifier, distance to left child, distance to
    right child, total number of leaves. This data structure is similar
    to that returned by scipy's hierarchical clustering functions,
    except that here we have two distance values for each internal node
    rather than one because distances to the children may be different.
File:      ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_canonical.py
Type:      function

Plotting¶

anjl.plot?

Signature:
anjl.plot(
    Z: numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]],
    leaf_data: pandas.core.frame.DataFrame | None = None,
    color: str | None = None,
    symbol: str | None = None,
    hover_name: str | None = None,
    hover_data: list[str] | None = None,
    center_x: int | float = 0,
    center_y: int | float = 0,
    arc_start: int | float = 0,
    arc_stop: int | float = 6.283185307179586,
    count_sort: bool = True,
    distance_sort: bool = False,
    line_width: int | float = 1,
    marker_size: int | float = 5,
    internal_marker_size: int | float = 0,
    color_discrete_sequence: list | None = None,
    color_discrete_map: collections.abc.Mapping | None = None,
    category_orders: list | collections.abc.Mapping | None = None,
    leaf_legend: bool = True,
    edge_legend: bool = False,
    default_line_color: str = 'gray',
    na_color: str = 'black',
    width: int | float | None = 700,
    height: int | float | None = 600,
    render_mode: Literal['auto', 'svg', 'webgl'] = 'auto',
    legend_sizing: Literal['constant', 'trace'] = 'constant',
) -> plotly.graph_objs._figure.Figure
Docstring:
Plot a neighbour-joining tree using the equal angles layout algorithm.

This function uses plotly to create an interactive plot. This allows
interactions like panning and zooming, hovering over leaf nodes to
inspect additional information, and showing or hiding entries from the
legend.

Parameters
----------
Z : ndarray[Any, dtype[float32]]
    A neighbour-joining tree encoded as a numpy array. Each row in the
    array contains data for one internal node in the tree, in the order in
    which they were created by the neighbour-joining algorithm. Within
    each row there are five values: left child node identifier, right
    child node identifier, distance to left child, distance to right
    child, total number of leaves. This data structure is similar to that
    returned by scipy's hierarchical clustering functions, except that
    here we have two distance values for each internal node rather than
    one because distances to the children may be different.
leaf_data : UnionType[DataFrame, None], optional
    A pandas DataFrame containing additional data about the original
    observations. Length of the dataframe should be the same as the size
    of each dimension of the distance matrix.
color : UnionType[str, None], optional
    Name of variable to use to color the markers.
symbol : UnionType[str, None], optional
    Name of the variable to use to choose marker symbols.
hover_name : UnionType[str, None], optional
    Name of variable to use as main label in hover tooltips.
hover_data : UnionType[list of str, None], optional
    Names of addition variables to show in hover tooltips.
center_x : UnionType[int, float], optional, default: 0
    X coordinate where plotting is centered.
center_y : UnionType[int, float], optional, default: 0
    Y coordinate where plotting is centered.
arc_start : UnionType[int, float], optional, default: 0
    Angle where tree layout begins.
arc_stop : UnionType[int, float], optional, default: 6.283185307179586
    Angle where tree layout ends.
count_sort : bool, optional, default: True
    If True, for each internal node, the child with the minimum number of
    descendants is plotted first. Note distance_sort and count_sort cannot
    both be True.
distance_sort : bool, optional, default: False
    If True, for each internal node, the child with the minimum distance
    is plotted first. Note distance_sort and count_sort cannot both be
    True.
line_width : UnionType[int, float], optional, default: 1
    Edge line width.
marker_size : UnionType[int, float], optional, default: 5
    Leaf node marker size.
internal_marker_size : UnionType[int, float], optional, default: 0
    Internal node marker size.
color_discrete_sequence : UnionType[list, None], optional
    A list of colours to use.
color_discrete_map : UnionType[Mapping, None], optional
    An explicit mapping from values to colours.
category_orders : UnionType[list, Mapping, None], optional
    Control the order in which values appear in the legend.
leaf_legend : bool, optional, default: True
    Show legend entries for the different leaf node (scatter) colors and
    symbols.
edge_legend : bool, optional, default: False
    Show legend entries for the different edge (line) colors.
default_line_color : str, optional, default: 'gray'
    Line color to use for edges where descendants are different colors.
na_color : str, optional, default: 'black'
    Color to use where data are missing.
width : UnionType[int, float, None], optional, default: 700
    Figure width in pixels (px).
height : UnionType[int, float, None], optional, default: 600
    Figure height in pixels (px).
render_mode : {'auto', 'svg', 'webgl'}, optional, default: 'auto'
    The type of rendering backend to use. See also
    https://plotly.com/python/webgl-vs-svg/.
legend_sizing : {'constant', 'trace'}, optional, default: 'constant'
    Determines if the legend items symbols scale with their corresponding
    "trace" attributes or remain "constant" independent of the symbol size
    on the graph.

Returns
-------
Figure
    A plotly figure.
File:      ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_plot.py
Type:      function

Usage notes¶

WebGL vs SVG¶

For smaller trees, plotly uses SVG for rendering, but for larger trees plotly automatically switches to use WebGL if possible. The trees rendered with WebGL should look identical, but the hover labels do not include all of the same information. If you need to maintain the same hover labels, you can override this automatic switching behaviour by passing render_mode="svg" to the anjl.plot() function. See also the plotly docs on webgl vs svg in Python.

Multithreaded execution¶

The anjl.dynamic_nj() and anjl.canonical_nj() functions will attempt to use multiple threads to speed up execution by default. If you prefer to disable this and force single-threaded execution, pass the parameter parallel=False to these functions.

Memory usage¶

All anjl neighbour-joining functions can accept a distance matrix either in square form or in condensed upper triangle form (e.g., as returned by scipy’s pdist function). To reduce memory usage, provide a condensed distance matrix. You can further reduce memory usage by passing the parameter copy=False, although be aware that the input data will be overwritten during tree construction.