Skip to article content

anjl 😇

A neighbour-joining library for Python

anjl is a Python package providing fast implementations of the neighbour-joining algorithm of Saitou and Nei and some associated utilities.

Installation¶

anjl is available from PyPI and can be installed via pip, e.g.:

%pip install anjl

Demo¶

import anjl

anjl requires a numpy array containing a square distance matrix as input. Optionally, anjl can also make use of a pandas DataFrame containing data about the original observations.

Load some demo data to try it out:

D, leaf_data = anjl.data.mosquitoes()

D is a square distance matrix containing genetic distances between mosquitoes:

D
array([[ 0., 69., 74., ..., 68., 65., 65.], [69., 0., 75., ..., 73., 76., 70.], [74., 75., 0., ..., 56., 69., 69.], ..., [68., 73., 56., ..., 0., 59., 63.], [65., 76., 69., ..., 59., 0., 68.], [65., 70., 69., ..., 63., 68., 0.]], dtype=float32)
D.shape
(181, 181)

leaf_data contains additional data about the mosquitoes:

leaf_data.head()
Loading...

Compute a neighbour-joining tree from the given distance matrix, using the rapid algorithm:

Z = anjl.dynamic_nj(D)

The return value Z is a numpy array containing an encoding of the resulting tree.

Create an interactive plot to visualise the tree:

anjl.plot(
    Z,
    leaf_data=leaf_data,
    color="taxon",
    hover_name="sample_id",
    hover_data=["country", "location", "year", "month"],
)
Loading...

This plot is rendered using plotly and is interactive - try hovering over leaf nodes for more information, pan and zoom, or click on the legend to hide or show items.

Benchmark¶

Here is a performance comparison between anjl and other currently available implementations of neighbour-joining in Python.

Loading...

This benchmark was run on pairwise genetic distances between mosquitoes using data from the Malaria Vector Genome Observatory. The source data is available here.

API¶

Dynamic neighbour-joining¶

anjl.dynamic_nj?
Signature: anjl.dynamic_nj( D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]], disallow_negative_distances: bool = True, progress: Optional[Callable] = None, progress_options: collections.abc.Mapping = {}, copy: bool | None = True, ) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]] Docstring: Perform neighbour-joining using the dynamic algorithm of Clausen [1]_. This is the fastest and most scalable implementation currently available. The dynamic algorithm exploits the fact that the neighbour- joining criterion Q is gradually weakened with each iteration, and therefore the minimum value of Q found initially within a given row provides a lower bound for all values within the same row in subsequent iterations. This allows many rows of the distance matrix to be skipped in each iteration. Parameters ---------- D : ndarray[Any, dtype[_ScalarType_co]] A distance matrix in square form. disallow_negative_distances : bool, optional, default: True If True, set any negative distances to zero. progress : Callable or None, optional A function which will be used to wrap the main loop iterator to provide information on progress. E.g., could be tqdm. progress_options : Mapping, optional, default: {} Any options to be passed into the progress function. copy : UnionType[bool, None], optional, default: True Passed through to numpy.array(). For numpy version 2.0 and later, if True (default), then the array data is copied. If None, a copy will only be made if necessary. If False it raises a ValueError if a copy cannot be avoided. Returns ------- ndarray[Any, dtype[float32]] A neighbour-joining tree encoded as a numpy array. Each row in the array contains data for one internal node in the tree, in the order in which they were created by the neighbour-joining algorithm. Within each row there are five values: left child node identifier, right child node identifier, distance to left child, distance to right child, total number of leaves. This data structure is similar to that returned by scipy's hierarchical clustering functions, except that here we have two distance values for each internal node rather than one because distances to the children may be different. References ---------- .. [1] Https://doi.org/10.1093/bioinformatics/btac774. File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_dynamic.py Type: function

Rapid neighbour-joining¶

anjl.rapid_nj?
Signature: anjl.rapid_nj( D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]], disallow_negative_distances: bool = True, progress: Optional[Callable] = None, progress_options: collections.abc.Mapping = {}, copy: bool | None = True, gc: int | None = 100, ) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]] Docstring: Perform neighbour-joining using an implementation based on the rapid algorithm of Simonsen et al. [1]_. This implementation builds and maintains a sorted copy of the distance matrix and uses heuristics to avoid searching pairs that cannot possibly be neighbours in each iteration. Parameters ---------- D : ndarray[Any, dtype[_ScalarType_co]] A distance matrix in square form. disallow_negative_distances : bool, optional, default: True If True, set any negative distances to zero. progress : Callable or None, optional A function which will be used to wrap the main loop iterator to provide information on progress. E.g., could be tqdm. progress_options : Mapping, optional, default: {} Any options to be passed into the progress function. copy : UnionType[bool, None], optional, default: True Passed through to numpy.array(). For numpy version 2.0 and later, if True (default), then the array data is copied. If None, a copy will only be made if necessary. If False it raises a ValueError if a copy cannot be avoided. gc : UnionType[int, None], optional, default: 100 Number of iterations to perform between compacting data structures to remove any data corresponding to nodes that have been clustered. Returns ------- ndarray[Any, dtype[float32]] A neighbour-joining tree encoded as a numpy array. Each row in the array contains data for one internal node in the tree, in the order in which they were created by the neighbour-joining algorithm. Within each row there are five values: left child node identifier, right child node identifier, distance to left child, distance to right child, total number of leaves. This data structure is similar to that returned by scipy's hierarchical clustering functions, except that here we have two distance values for each internal node rather than one because distances to the children may be different. Notes ----- The ordering of the internal nodes may be different between the canonical and the rapid algorithms, because these algorithms search the distance matrix in a different order. However, the resulting trees will be topologically equivalent. References ---------- .. [1] Https://pure.au.dk/ws/files/19821675/rapidNJ.pdf. File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_rapid.py Type: function

Canonical neighbour-joining¶

anjl.canonical_nj?
Signature: anjl.canonical_nj( D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]], disallow_negative_distances: bool = True, progress: Optional[Callable] = None, progress_options: collections.abc.Mapping = {}, copy: bool | None = True, ) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]] Docstring: Perform neighbour-joining using the canonical algorithm. This implementation performs a full scan of the distance matrix in each iteration of the algorithm to find the pair of nearest neighbours. It is therefore slower and scales with the cube of the number of original observations in the distance matrix, i.e., O(n^3). Parameters ---------- D : ndarray[Any, dtype[_ScalarType_co]] A distance matrix in square form. disallow_negative_distances : bool, optional, default: True If True, set any negative distances to zero. progress : Callable or None, optional A function which will be used to wrap the main loop iterator to provide information on progress. E.g., could be tqdm. progress_options : Mapping, optional, default: {} Any options to be passed into the progress function. copy : UnionType[bool, None], optional, default: True Passed through to numpy.array(). For numpy version 2.0 and later, if True (default), then the array data is copied. If None, a copy will only be made if necessary. If False it raises a ValueError if a copy cannot be avoided. Returns ------- ndarray[Any, dtype[float32]] A neighbour-joining tree encoded as a numpy array. Each row in the array contains data for one internal node in the tree, in the order in which they were created by the neighbour-joining algorithm. Within each row there are five values: left child node identifier, right child node identifier, distance to left child, distance to right child, total number of leaves. This data structure is similar to that returned by scipy's hierarchical clustering functions, except that here we have two distance values for each internal node rather than one because distances to the children may be different. File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_canonical.py Type: function

Plotting¶

anjl.plot?
Signature: anjl.plot( Z: numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]], leaf_data: pandas.core.frame.DataFrame | None = None, color: str | None = None, symbol: str | None = None, hover_name: str | None = None, hover_data: list[str] | None = None, center_x: int | float = 0, center_y: int | float = 0, arc_start: int | float = 0, arc_stop: int | float = 6.283185307179586, count_sort: bool = True, distance_sort: bool = False, line_width: int | float = 1, marker_size: int | float = 5, internal_marker_size: int | float = 0, color_discrete_sequence: list | None = None, color_discrete_map: collections.abc.Mapping | None = None, category_orders: list | collections.abc.Mapping | None = None, leaf_legend: bool = True, edge_legend: bool = False, default_line_color: str = 'gray', na_color: str = 'black', width: int | float | None = 700, height: int | float | None = 600, render_mode: Literal['auto', 'svg', 'webgl'] = 'auto', legend_sizing: Literal['constant', 'trace'] = 'constant', ) -> plotly.graph_objs._figure.Figure Docstring: Plot a neighbour-joining tree using the equal angles layout algorithm. This function uses plotly to create an interactive plot. This allows interactions like panning and zooming, hovering over leaf nodes to inspect additional information, and showing or hiding entries from the legend. Parameters ---------- Z : ndarray[Any, dtype[float32]] A neighbour-joining tree encoded as a numpy array. Each row in the array contains data for one internal node in the tree, in the order in which they were created by the neighbour-joining algorithm. Within each row there are five values: left child node identifier, right child node identifier, distance to left child, distance to right child, total number of leaves. This data structure is similar to that returned by scipy's hierarchical clustering functions, except that here we have two distance values for each internal node rather than one because distances to the children may be different. leaf_data : UnionType[DataFrame, None], optional A pandas DataFrame containing additional data about the original observations. Length of the dataframe should be the same as the size of each dimension of the distance matrix. color : UnionType[str, None], optional Name of variable to use to color the markers. symbol : UnionType[str, None], optional Name of the variable to use to choose marker symbols. hover_name : UnionType[str, None], optional Name of variable to use as main label in hover tooltips. hover_data : UnionType[list of str, None], optional Names of addition variables to show in hover tooltips. center_x : UnionType[int, float], optional, default: 0 X coordinate where plotting is centered. center_y : UnionType[int, float], optional, default: 0 Y coordinate where plotting is centered. arc_start : UnionType[int, float], optional, default: 0 Angle where tree layout begins. arc_stop : UnionType[int, float], optional, default: 6.283185307179586 Angle where tree layout ends. count_sort : bool, optional, default: True If True, for each internal node, the child with the minimum number of descendants is plotted first. Note distance_sort and count_sort cannot both be True. distance_sort : bool, optional, default: False If True, for each internal node, the child with the minimum distance is plotted first. Note distance_sort and count_sort cannot both be True. line_width : UnionType[int, float], optional, default: 1 Edge line width. marker_size : UnionType[int, float], optional, default: 5 Leaf node marker size. internal_marker_size : UnionType[int, float], optional, default: 0 Internal node marker size. color_discrete_sequence : UnionType[list, None], optional A list of colours to use. color_discrete_map : UnionType[Mapping, None], optional An explicit mapping from values to colours. category_orders : UnionType[list, Mapping, None], optional Control the order in which values appear in the legend. leaf_legend : bool, optional, default: True Show legend entries for the different leaf node (scatter) colors and symbols. edge_legend : bool, optional, default: False Show legend entries for the different edge (line) colors. default_line_color : str, optional, default: 'gray' Line color to use for edges where descendants are different colors. na_color : str, optional, default: 'black' Color to use where data are missing. width : UnionType[int, float, None], optional, default: 700 Figure width in pixels (px). height : UnionType[int, float, None], optional, default: 600 Figure height in pixels (px). render_mode : {'auto', 'svg', 'webgl'}, optional, default: 'auto' The type of rendering backend to use. See also https://plotly.com/python/webgl-vs-svg/. legend_sizing : {'constant', 'trace'}, optional, default: 'constant' Determines if the legend items symbols scale with their corresponding "trace" attributes or remain "constant" independent of the symbol size on the graph. Returns ------- Figure A plotly figure. File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_plot.py Type: function