anjl
is a Python package providing fast implementations of the neighbour-joining algorithm of Saitou and Nei and some associated utilities.
%pip install anjl
Demo¶
import anjl
anjl
requires a numpy array containing a distance matrix as input, either in square or condensed form. Optionally, anjl
can also make use of a pandas DataFrame containing data about the original observations.
Load some demo data to try it out:
D, leaf_data = anjl.data.mosquitoes()
D
is a square distance matrix containing genetic distances between mosquitoes:
D
array([[ 0., 69., 74., ..., 68., 65., 65.],
[69., 0., 75., ..., 73., 76., 70.],
[74., 75., 0., ..., 56., 69., 69.],
...,
[68., 73., 56., ..., 0., 59., 63.],
[65., 76., 69., ..., 59., 0., 68.],
[65., 70., 69., ..., 63., 68., 0.]], dtype=float32)
D.shape
(181, 181)
leaf_data
contains additional data about the mosquitoes:
leaf_data.head()
Compute a neighbour-joining tree from the given distance matrix, using the rapid algorithm:
Z = anjl.dynamic_nj(D)
The return value Z
is a numpy array containing an encoding of the resulting tree.
Create an interactive plot to visualise the tree:
anjl.plot(
Z,
leaf_data=leaf_data,
color="taxon",
hover_name="sample_id",
hover_data=["country", "location", "year", "month"],
)
This plot is rendered using plotly and is interactive - try hovering over leaf nodes for more information, pan and zoom, or click on the legend to hide or show items.
Benchmark¶
Here is a performance comparison between implementations of neighbour-joining in Python.
This benchmark was run on a laptop with an Intel Xeon E3-1505M CPU with 4 cores and 2 threads per core, using data from the Malaria Vector Genome Observatory. The source data are available here.
API¶
Dynamic neighbour-joining¶
anjl.dynamic_nj?
Signature:
anjl.dynamic_nj(
D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]],
disallow_negative_distances: bool = True,
progress: Optional[Callable] = None,
progress_options: collections.abc.Mapping = {},
copy: bool | None = True,
parallel: bool = True,
) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]]
Docstring:
Perform neighbour-joining using the dynamic algorithm of Clausen [1]_.
This is the fastest and most scalable implementation currently
available. The dynamic algorithm exploits the fact that the neighbour-
joining criterion Q is gradually weakened with each iteration, and
therefore the minimum value of Q found initially within a given row
provides a lower bound for all values within the same row in subsequent
iterations. This allows many rows of the distance matrix to be skipped
in each iteration.
Parameters
----------
D : ndarray[Any, dtype[_ScalarType_co]]
A distance matrix, either in square form, or condensed upper triangle
form (e.g., as returned by scipy's pdist function). To minimise memory
usage, provide a condensed distance matrix, and also pass copy=False,
although be aware that the input data will be overwritten during tree
construction.
disallow_negative_distances : bool, optional, default: True
If True, set any negative distances to zero.
progress : Callable or None, optional
A function which will be used to wrap the main loop iterator to
provide information on progress. E.g., could be tqdm.
progress_options : Mapping, optional, default: {}
Any options to be passed into the progress function.
copy : UnionType[bool, None], optional, default: True
Passed through to numpy.array(). For numpy version 2.0 and later, if
True (default), then the array data is copied. If None, a copy will
only be made if necessary. If False it raises a ValueError if a copy
cannot be avoided. If False, please note that the input data will be
overwritten during tree construction.
parallel : bool, optional, default: True
If True, attempt to use multiple CPU threads to accelerate the
computation.
Returns
-------
ndarray[Any, dtype[float32]]
A neighbour-joining tree encoded as a numpy array. Each row in the
array contains data for one internal node in the tree, in the order
in which they were created by the neighbour-joining algorithm.
Within each row there are five values: left child node identifier,
right child node identifier, distance to left child, distance to
right child, total number of leaves. This data structure is similar
to that returned by scipy's hierarchical clustering functions,
except that here we have two distance values for each internal node
rather than one because distances to the children may be different.
References
----------
.. [1] Https://doi.org/10.1093/bioinformatics/btac774.
File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_dynamic.py
Type: function
Rapid neighbour-joining¶
anjl.rapid_nj?
Signature:
anjl.rapid_nj(
D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]],
disallow_negative_distances: bool = True,
progress: Optional[Callable] = None,
progress_options: collections.abc.Mapping = {},
copy: bool | None = True,
gc: int | None = 100,
) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]]
Docstring:
Perform neighbour-joining using an implementation based on the rapid
algorithm of Simonsen et al. [1]_.
This implementation builds and maintains a sorted copy of the distance
matrix and uses heuristics to avoid searching pairs that cannot possibly
be neighbours in each iteration.
Parameters
----------
D : ndarray[Any, dtype[_ScalarType_co]]
A distance matrix, either in square form, or condensed upper triangle
form (e.g., as returned by scipy's pdist function). To minimise memory
usage, provide a condensed distance matrix, and also pass copy=False,
although be aware that the input data will be overwritten during tree
construction.
disallow_negative_distances : bool, optional, default: True
If True, set any negative distances to zero.
progress : Callable or None, optional
A function which will be used to wrap the main loop iterator to
provide information on progress. E.g., could be tqdm.
progress_options : Mapping, optional, default: {}
Any options to be passed into the progress function.
copy : UnionType[bool, None], optional, default: True
Passed through to numpy.array(). For numpy version 2.0 and later, if
True (default), then the array data is copied. If None, a copy will
only be made if necessary. If False it raises a ValueError if a copy
cannot be avoided. If False, please note that the input data will be
overwritten during tree construction.
gc : UnionType[int, None], optional, default: 100
Number of iterations to perform between compacting data structures to
remove any data corresponding to nodes that have been clustered.
Returns
-------
ndarray[Any, dtype[float32]]
A neighbour-joining tree encoded as a numpy array. Each row in the
array contains data for one internal node in the tree, in the order
in which they were created by the neighbour-joining algorithm.
Within each row there are five values: left child node identifier,
right child node identifier, distance to left child, distance to
right child, total number of leaves. This data structure is similar
to that returned by scipy's hierarchical clustering functions,
except that here we have two distance values for each internal node
rather than one because distances to the children may be different.
Notes
-----
The ordering of the internal nodes may be different between the
canonical and the rapid algorithms, because these algorithms search
the distance matrix in a different order. However, the resulting trees
will be topologically equivalent.
References
----------
.. [1] Https://pure.au.dk/ws/files/19821675/rapidNJ.pdf.
File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_rapid.py
Type: function
Canonical neighbour-joining¶
anjl.canonical_nj?
Signature:
anjl.canonical_nj(
D: numpy.ndarray[typing.Any, numpy.dtype[+_ScalarType_co]],
disallow_negative_distances: bool = True,
progress: Optional[Callable] = None,
progress_options: collections.abc.Mapping = {},
copy: bool | None = True,
parallel: bool = True,
) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]]
Docstring:
Perform neighbour-joining using the canonical algorithm.
This implementation performs a full scan of the distance matrix in each
iteration of the algorithm to find the pair of nearest neighbours. It is
therefore slower and scales with the cube of the number of original
observations in the distance matrix, i.e., O(n^3).
Parameters
----------
D : ndarray[Any, dtype[_ScalarType_co]]
A distance matrix, either in square form, or condensed upper triangle
form (e.g., as returned by scipy's pdist function). To minimise memory
usage, provide a condensed distance matrix, and also pass copy=False,
although be aware that the input data will be overwritten during tree
construction.
disallow_negative_distances : bool, optional, default: True
If True, set any negative distances to zero.
progress : Callable or None, optional
A function which will be used to wrap the main loop iterator to
provide information on progress. E.g., could be tqdm.
progress_options : Mapping, optional, default: {}
Any options to be passed into the progress function.
copy : UnionType[bool, None], optional, default: True
Passed through to numpy.array(). For numpy version 2.0 and later, if
True (default), then the array data is copied. If None, a copy will
only be made if necessary. If False it raises a ValueError if a copy
cannot be avoided. If False, please note that the input data will be
overwritten during tree construction.
parallel : bool, optional, default: True
If True, attempt to use multiple CPU threads to accelerate the
computation.
Returns
-------
ndarray[Any, dtype[float32]]
A neighbour-joining tree encoded as a numpy array. Each row in the
array contains data for one internal node in the tree, in the order
in which they were created by the neighbour-joining algorithm.
Within each row there are five values: left child node identifier,
right child node identifier, distance to left child, distance to
right child, total number of leaves. This data structure is similar
to that returned by scipy's hierarchical clustering functions,
except that here we have two distance values for each internal node
rather than one because distances to the children may be different.
File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_canonical.py
Type: function
Plotting¶
anjl.plot?
Signature:
anjl.plot(
Z: numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]],
leaf_data: pandas.core.frame.DataFrame | None = None,
color: str | None = None,
symbol: str | None = None,
hover_name: str | None = None,
hover_data: list[str] | None = None,
center_x: int | float = 0,
center_y: int | float = 0,
arc_start: int | float = 0,
arc_stop: int | float = 6.283185307179586,
count_sort: bool = True,
distance_sort: bool = False,
line_width: int | float = 1,
marker_size: int | float = 5,
internal_marker_size: int | float = 0,
color_discrete_sequence: list | None = None,
color_discrete_map: collections.abc.Mapping | None = None,
category_orders: list | collections.abc.Mapping | None = None,
leaf_legend: bool = True,
edge_legend: bool = False,
default_line_color: str = 'gray',
na_color: str = 'black',
width: int | float | None = 700,
height: int | float | None = 600,
render_mode: Literal['auto', 'svg', 'webgl'] = 'auto',
legend_sizing: Literal['constant', 'trace'] = 'constant',
) -> plotly.graph_objs._figure.Figure
Docstring:
Plot a neighbour-joining tree using the equal angles layout algorithm.
This function uses plotly to create an interactive plot. This allows
interactions like panning and zooming, hovering over leaf nodes to
inspect additional information, and showing or hiding entries from the
legend.
Parameters
----------
Z : ndarray[Any, dtype[float32]]
A neighbour-joining tree encoded as a numpy array. Each row in the
array contains data for one internal node in the tree, in the order in
which they were created by the neighbour-joining algorithm. Within
each row there are five values: left child node identifier, right
child node identifier, distance to left child, distance to right
child, total number of leaves. This data structure is similar to that
returned by scipy's hierarchical clustering functions, except that
here we have two distance values for each internal node rather than
one because distances to the children may be different.
leaf_data : UnionType[DataFrame, None], optional
A pandas DataFrame containing additional data about the original
observations. Length of the dataframe should be the same as the size
of each dimension of the distance matrix.
color : UnionType[str, None], optional
Name of variable to use to color the markers.
symbol : UnionType[str, None], optional
Name of the variable to use to choose marker symbols.
hover_name : UnionType[str, None], optional
Name of variable to use as main label in hover tooltips.
hover_data : UnionType[list of str, None], optional
Names of addition variables to show in hover tooltips.
center_x : UnionType[int, float], optional, default: 0
X coordinate where plotting is centered.
center_y : UnionType[int, float], optional, default: 0
Y coordinate where plotting is centered.
arc_start : UnionType[int, float], optional, default: 0
Angle where tree layout begins.
arc_stop : UnionType[int, float], optional, default: 6.283185307179586
Angle where tree layout ends.
count_sort : bool, optional, default: True
If True, for each internal node, the child with the minimum number of
descendants is plotted first. Note distance_sort and count_sort cannot
both be True.
distance_sort : bool, optional, default: False
If True, for each internal node, the child with the minimum distance
is plotted first. Note distance_sort and count_sort cannot both be
True.
line_width : UnionType[int, float], optional, default: 1
Edge line width.
marker_size : UnionType[int, float], optional, default: 5
Leaf node marker size.
internal_marker_size : UnionType[int, float], optional, default: 0
Internal node marker size.
color_discrete_sequence : UnionType[list, None], optional
A list of colours to use.
color_discrete_map : UnionType[Mapping, None], optional
An explicit mapping from values to colours.
category_orders : UnionType[list, Mapping, None], optional
Control the order in which values appear in the legend.
leaf_legend : bool, optional, default: True
Show legend entries for the different leaf node (scatter) colors and
symbols.
edge_legend : bool, optional, default: False
Show legend entries for the different edge (line) colors.
default_line_color : str, optional, default: 'gray'
Line color to use for edges where descendants are different colors.
na_color : str, optional, default: 'black'
Color to use where data are missing.
width : UnionType[int, float, None], optional, default: 700
Figure width in pixels (px).
height : UnionType[int, float, None], optional, default: 600
Figure height in pixels (px).
render_mode : {'auto', 'svg', 'webgl'}, optional, default: 'auto'
The type of rendering backend to use. See also
https://plotly.com/python/webgl-vs-svg/.
legend_sizing : {'constant', 'trace'}, optional, default: 'constant'
Determines if the legend items symbols scale with their corresponding
"trace" attributes or remain "constant" independent of the symbol size
on the graph.
Returns
-------
Figure
A plotly figure.
File: ~/github/alimanfoo/anjl/docs/.pixi/envs/default/lib/python3.12/site-packages/anjl/_plot.py
Type: function
Usage notes¶
WebGL vs SVG¶
For smaller trees, plotly uses SVG for rendering, but for larger trees plotly automatically switches to use WebGL if possible. The trees rendered with WebGL should look identical, but the hover labels do not include all of the same information. If you need to maintain the same hover labels, you can override this automatic switching behaviour by passing render_mode="svg"
to the anjl.plot()
function. See also the plotly docs on webgl vs svg in Python.
Multithreaded execution¶
The anjl.dynamic_nj()
and anjl.canonical_nj()
functions will attempt to use multiple threads to speed up execution by default. If you prefer to disable this and force single-threaded execution, pass the parameter parallel=False
to these functions.
Memory usage¶
All anjl
neighbour-joining functions can accept a distance matrix either in square form or in condensed upper triangle form (e.g., as returned by scipy’s pdist function). To reduce memory usage, provide a condensed distance matrix. You can further reduce memory usage by passing the parameter copy=False
, although be aware that the input data will be overwritten during tree construction.