Skip to article content

anjl 😇

A neighbour-joining library for Python

anjl is a Python package providing fast implementations of the neighbour-joining algorithm of Saitou and Nei and some associated utilities.

Installation¶

anjl is available from PyPI and can be installed via pip, e.g.:

%pip install anjl

Demo¶

import anjl

anjl requires a numpy array containing a distance matrix as input, either in square or condensed form. Optionally, anjl can also make use of a pandas DataFrame containing data about the original observations.

Load some demo data to try it out:

D, leaf_data = anjl.data.mosquitoes()

D is a square distance matrix containing genetic distances between mosquitoes:

D
array([[ 0., 69., 74., ..., 68., 65., 65.], [69., 0., 75., ..., 73., 76., 70.], [74., 75., 0., ..., 56., 69., 69.], ..., [68., 73., 56., ..., 0., 59., 63.], [65., 76., 69., ..., 59., 0., 68.], [65., 70., 69., ..., 63., 68., 0.]], dtype=float32)
D.shape
(181, 181)

leaf_data contains additional data about the mosquitoes:

leaf_data.head()
Loading...

Compute a neighbour-joining tree from the given distance matrix, using the rapid algorithm:

Z = anjl.dynamic_nj(D)

The return value Z is a numpy array containing an encoding of the resulting tree.

Create an interactive plot to visualise the tree:

anjl.plot(
    Z,
    leaf_data=leaf_data,
    color="taxon",
    hover_name="sample_id",
    hover_data=["country", "location", "year", "month"],
)
Loading...

This plot is rendered using plotly and is interactive - try hovering over leaf nodes for more information, pan and zoom, or click on the legend to hide or show items.

Benchmark¶

Here is a performance comparison between implementations of neighbour-joining in Python.

Loading...

This benchmark was run on a laptop with an Intel Xeon E3-1505M CPU with 4 cores and 2 threads per core, using data from the Malaria Vector Genome Observatory. The source data are available here.

API¶

Dynamic neighbour-joining¶

anjl.dynamic_nj?
Signature: anjl.dynamic_nj( D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]], disallow_negative_distances: bool = True, progress: Optional[Callable] = None, progress_options: collections.abc.Mapping = {}, copy: bool | None = True, parallel: bool = True, ) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]] Docstring: Perform neighbour-joining using the dynamic algorithm of Clausen [1]_. This is the fastest and most scalable implementation currently available. The dynamic algorithm exploits the fact that the neighbour- joining criterion Q is gradually weakened with each iteration, and therefore the minimum value of Q found initially within a given row provides a lower bound for all values within the same row in subsequent iterations. This allows many rows of the distance matrix to be skipped in each iteration. Parameters ---------- D : ndarray[Any, dtype[_ScalarType_co]] A distance matrix, either in square form, or condensed upper triangle form (e.g., as returned by scipy's pdist function). To minimise memory usage, provide a condensed distance matrix, and also pass copy=False, although be aware that the input data will be overwritten during tree construction. disallow_negative_distances : bool, optional, default: True If True, set any negative distances to zero. progress : Callable or None, optional A function which will be used to wrap the main loop iterator to provide information on progress. E.g., could be tqdm. progress_options : Mapping, optional, default: {} Any options to be passed into the progress function. copy : UnionType[bool, None], optional, default: True Passed through to numpy.array(). For numpy version 2.0 and later, if True (default), then the array data is copied. If None, a copy will only be made if necessary. If False it raises a ValueError if a copy cannot be avoided. If False, please note that the input data will be overwritten during tree construction. parallel : bool, optional, default: True If True, attempt to use multiple CPU threads to accelerate the computation. Returns ------- ndarray[Any, dtype[float32]] A neighbour-joining tree encoded as a numpy array. Each row in the array contains data for one internal node in the tree, in the order in which they were created by the neighbour-joining algorithm. Within each row there are five values: left child node identifier, right child node identifier, distance to left child, distance to right child, total number of leaves. This data structure is similar to that returned by scipy's hierarchical clustering functions, except that here we have two distance values for each internal node rather than one because distances to the children may be different. References ---------- .. [1] Https://doi.org/10.1093/bioinformatics/btac774. File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_dynamic.py Type: function

Rapid neighbour-joining¶

anjl.rapid_nj?
Signature: anjl.rapid_nj( D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]], disallow_negative_distances: bool = True, progress: Optional[Callable] = None, progress_options: collections.abc.Mapping = {}, copy: bool | None = True, gc: int | None = 100, ) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]] Docstring: Perform neighbour-joining using an implementation based on the rapid algorithm of Simonsen et al. [1]_. This implementation builds and maintains a sorted copy of the distance matrix and uses heuristics to avoid searching pairs that cannot possibly be neighbours in each iteration. Parameters ---------- D : ndarray[Any, dtype[_ScalarType_co]] A distance matrix, either in square form, or condensed upper triangle form (e.g., as returned by scipy's pdist function). To minimise memory usage, provide a condensed distance matrix, and also pass copy=False, although be aware that the input data will be overwritten during tree construction. disallow_negative_distances : bool, optional, default: True If True, set any negative distances to zero. progress : Callable or None, optional A function which will be used to wrap the main loop iterator to provide information on progress. E.g., could be tqdm. progress_options : Mapping, optional, default: {} Any options to be passed into the progress function. copy : UnionType[bool, None], optional, default: True Passed through to numpy.array(). For numpy version 2.0 and later, if True (default), then the array data is copied. If None, a copy will only be made if necessary. If False it raises a ValueError if a copy cannot be avoided. If False, please note that the input data will be overwritten during tree construction. gc : UnionType[int, None], optional, default: 100 Number of iterations to perform between compacting data structures to remove any data corresponding to nodes that have been clustered. Returns ------- ndarray[Any, dtype[float32]] A neighbour-joining tree encoded as a numpy array. Each row in the array contains data for one internal node in the tree, in the order in which they were created by the neighbour-joining algorithm. Within each row there are five values: left child node identifier, right child node identifier, distance to left child, distance to right child, total number of leaves. This data structure is similar to that returned by scipy's hierarchical clustering functions, except that here we have two distance values for each internal node rather than one because distances to the children may be different. Notes ----- The ordering of the internal nodes may be different between the canonical and the rapid algorithms, because these algorithms search the distance matrix in a different order. However, the resulting trees will be topologically equivalent. References ---------- .. [1] Https://pure.au.dk/ws/files/19821675/rapidNJ.pdf. File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_rapid.py Type: function

Canonical neighbour-joining¶

anjl.canonical_nj?
Signature: anjl.canonical_nj( D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]], disallow_negative_distances: bool = True, progress: Optional[Callable] = None, progress_options: collections.abc.Mapping = {}, copy: bool | None = True, parallel: bool = True, ) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]] Docstring: Perform neighbour-joining using the canonical algorithm. This implementation performs a full scan of the distance matrix in each iteration of the algorithm to find the pair of nearest neighbours. It is therefore slower and scales with the cube of the number of original observations in the distance matrix, i.e., O(n^3). Parameters ---------- D : ndarray[Any, dtype[_ScalarType_co]] A distance matrix, either in square form, or condensed upper triangle form (e.g., as returned by scipy's pdist function). To minimise memory usage, provide a condensed distance matrix, and also pass copy=False, although be aware that the input data will be overwritten during tree construction. disallow_negative_distances : bool, optional, default: True If True, set any negative distances to zero. progress : Callable or None, optional A function which will be used to wrap the main loop iterator to provide information on progress. E.g., could be tqdm. progress_options : Mapping, optional, default: {} Any options to be passed into the progress function. copy : UnionType[bool, None], optional, default: True Passed through to numpy.array(). For numpy version 2.0 and later, if True (default), then the array data is copied. If None, a copy will only be made if necessary. If False it raises a ValueError if a copy cannot be avoided. If False, please note that the input data will be overwritten during tree construction. parallel : bool, optional, default: True If True, attempt to use multiple CPU threads to accelerate the computation. Returns ------- ndarray[Any, dtype[float32]] A neighbour-joining tree encoded as a numpy array. Each row in the array contains data for one internal node in the tree, in the order in which they were created by the neighbour-joining algorithm. Within each row there are five values: left child node identifier, right child node identifier, distance to left child, distance to right child, total number of leaves. This data structure is similar to that returned by scipy's hierarchical clustering functions, except that here we have two distance values for each internal node rather than one because distances to the children may be different. File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_canonical.py Type: function

Plotting¶

anjl.plot?
Signature: anjl.plot( Z: numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]], leaf_data: pandas.core.frame.DataFrame | None = None, color: str | None = None, symbol: str | None = None, hover_name: str | None = None, hover_data: list[str] | None = None, center_x: int | float = 0, center_y: int | float = 0, arc_start: int | float = 0, arc_stop: int | float = 6.283185307179586, count_sort: bool = True, distance_sort: bool = False, line_width: int | float = 1, marker_size: int | float = 5, internal_marker_size: int | float = 0, color_discrete_sequence: list | None = None, color_discrete_map: collections.abc.Mapping | None = None, category_orders: list | collections.abc.Mapping | None = None, leaf_legend: bool = True, edge_legend: bool = False, default_line_color: str = 'gray', na_color: str = 'black', width: int | float | None = 700, height: int | float | None = 600, render_mode: Literal['auto', 'svg', 'webgl'] = 'auto', legend_sizing: Literal['constant', 'trace'] = 'constant', ) -> plotly.graph_objs._figure.Figure Docstring: Plot a neighbour-joining tree using the equal angles layout algorithm. This function uses plotly to create an interactive plot. This allows interactions like panning and zooming, hovering over leaf nodes to inspect additional information, and showing or hiding entries from the legend. Parameters ---------- Z : ndarray[Any, dtype[float32]] A neighbour-joining tree encoded as a numpy array. Each row in the array contains data for one internal node in the tree, in the order in which they were created by the neighbour-joining algorithm. Within each row there are five values: left child node identifier, right child node identifier, distance to left child, distance to right child, total number of leaves. This data structure is similar to that returned by scipy's hierarchical clustering functions, except that here we have two distance values for each internal node rather than one because distances to the children may be different. leaf_data : UnionType[DataFrame, None], optional A pandas DataFrame containing additional data about the original observations. Length of the dataframe should be the same as the size of each dimension of the distance matrix. color : UnionType[str, None], optional Name of variable to use to color the markers. symbol : UnionType[str, None], optional Name of the variable to use to choose marker symbols. hover_name : UnionType[str, None], optional Name of variable to use as main label in hover tooltips. hover_data : UnionType[list of str, None], optional Names of addition variables to show in hover tooltips. center_x : UnionType[int, float], optional, default: 0 X coordinate where plotting is centered. center_y : UnionType[int, float], optional, default: 0 Y coordinate where plotting is centered. arc_start : UnionType[int, float], optional, default: 0 Angle where tree layout begins. arc_stop : UnionType[int, float], optional, default: 6.283185307179586 Angle where tree layout ends. count_sort : bool, optional, default: True If True, for each internal node, the child with the minimum number of descendants is plotted first. Note distance_sort and count_sort cannot both be True. distance_sort : bool, optional, default: False If True, for each internal node, the child with the minimum distance is plotted first. Note distance_sort and count_sort cannot both be True. line_width : UnionType[int, float], optional, default: 1 Edge line width. marker_size : UnionType[int, float], optional, default: 5 Leaf node marker size. internal_marker_size : UnionType[int, float], optional, default: 0 Internal node marker size. color_discrete_sequence : UnionType[list, None], optional A list of colours to use. color_discrete_map : UnionType[Mapping, None], optional An explicit mapping from values to colours. category_orders : UnionType[list, Mapping, None], optional Control the order in which values appear in the legend. leaf_legend : bool, optional, default: True Show legend entries for the different leaf node (scatter) colors and symbols. edge_legend : bool, optional, default: False Show legend entries for the different edge (line) colors. default_line_color : str, optional, default: 'gray' Line color to use for edges where descendants are different colors. na_color : str, optional, default: 'black' Color to use where data are missing. width : UnionType[int, float, None], optional, default: 700 Figure width in pixels (px). height : UnionType[int, float, None], optional, default: 600 Figure height in pixels (px). render_mode : {'auto', 'svg', 'webgl'}, optional, default: 'auto' The type of rendering backend to use. See also https://plotly.com/python/webgl-vs-svg/. legend_sizing : {'constant', 'trace'}, optional, default: 'constant' Determines if the legend items symbols scale with their corresponding "trace" attributes or remain "constant" independent of the symbol size on the graph. Returns ------- Figure A plotly figure. File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_plot.py Type: function

Usage notes¶

WebGL vs SVG¶

For smaller trees, plotly uses SVG for rendering, but for larger trees plotly automatically switches to use WebGL if possible. The trees rendered with WebGL should look identical, but the hover labels do not include all of the same information. If you need to maintain the same hover labels, you can override this automatic switching behaviour by passing render_mode="svg" to the anjl.plot() function. See also the plotly docs on webgl vs svg in Python.

Multithreaded execution¶

The anjl.dynamic_nj() and anjl.canonical_nj() functions will attempt to use multiple threads to speed up execution by default. If you prefer to disable this and force single-threaded execution, pass the parameter parallel=False to these functions.

Memory usage¶

All anjl neighbour-joining functions can accept a distance matrix either in square form or in condensed upper triangle form (e.g., as returned by scipy’s pdist function). To reduce memory usage, provide a condensed distance matrix. You can further reduce memory usage by passing the parameter copy=False, although be aware that the input data will be overwritten during tree construction.