pycea.tl.tree_distance

Contents

pycea.tl.tree_distance#

pycea.tl.tree_distance(tdata, depth_key='depth', obs=None, metric='path', sample_n=None, connect_key=None, random_state=None, key_added=None, update=True, tree=None, copy=False)#
Overloads:
  • tdata (td.TreeData), depth_key (str), obs (str | int | Sequence[Any] | None), metric (_TreeMetric), sample_n (int | None), connect_key (str | None), random_state (int | None), key_added (str | None), update (bool), tree (str | Sequence[Any] | None), copy (Literal[True, False]) → sp.sparse.csr_matrix | np.ndarray

  • tdata (td.TreeData), depth_key (str), obs (str | int | Sequence[Any] | None), metric (_TreeMetric), sample_n (int | None), connect_key (str | None), random_state (int | None), key_added (str | None), update (bool), tree (str | Sequence[Any] | None), copy (Literal[True, False]) → None

Computes tree distances between observations.

This function calculates distances between observations based on their positions and depths in the tree. For tdata.alignment == "leaves", this computes distances between leaf nodes. For tdata.alignment == "nodes" or "subset", distances are computed between all observed nodes (leaves and internal nodes in tdata.obs). It supports lowest common ancestor (lca) and path distances.

Given two nodes ii and jj in a rooted tree, with depths did_i and djd_j, and with their lowest common ancestor having depth dLCA(i,j)d_{LCA(i,j)}:

Dijlca=dLCA(i,j)D_{ij}^{lca} = d_{LCA(i,j)}
Dijpath=di+dj2dLCA(i,j)D_{ij}^{path} = || d_i + d_j - 2 d_{LCA(i,j)} ||

DijlcaD_{ij}^{lca} represents the depth of the node’s shared ancestor (larger values indicate greater shared ancestry). In contrast, DijpathD_{ij}^{path} measures the distance along the tree between two nodes (smaller values indicate closer proximity).

Parameters:
  • tdata (TreeData) – The TreeData object.

  • depth_key (str (default: 'depth')) – Attribute of tdata.obst[tree].nodes where depth is stored.

  • obs (str | int | Sequence[Any] | None (default: None)) –

    The observations to use:

    • If None, pairwise distance for all observed nodes is stored in tdata.obsp.

    • If a string, distance to all other observed nodes is stored in tdata.obs.

    • If a sequence, pairwise distance is stored in tdata.obsp.

    • If a sequence of pairs, distance between pairs is stored in tdata.obsp.

  • metric (Literal['lca', 'path'] (default: 'path')) –

    The type of tree distance to compute:

    • 'lca': lowest common ancestor depth.

    • 'path': abs(node1 depth + node2 depth - 2 * lca depth).

  • sample_n (int | None (default: None)) – If specified, randomly sample sample_n pairs of observations.

  • connect_key (str | None (default: None)) – If specified, compute distances only between connected observations specified by tdata.obsp['{connect_key}_connectivities'].

  • random_state (int | None (default: None)) – Random seed for sampling.

  • key_added (str | None (default: None)) – Distances are stored in tdata.obsp['{key_added}_distances'] and connectivities in tdata.obsp['{key_added}_connectivities']. Defaults to ‘tree’.

  • update (bool (default: True)) – If True, updates existing distances instead of overwriting.

  • tree (str | Sequence[Any] | None (default: None)) – The obst key or keys of the trees to use. If None, all trees are used.

  • copy (Literal[True, False] (default: False)) – If True, returns a ndarray or csr_matrix with distances.

Returns:

Returns None if copy=False, else returns ndarray/csr_matrix.

Sets the following fields:

  • tdata.obsp['{key_added}_distances']ndarray/csr_matrix (dtype float) if obs is None or a sequence.
    • Distances between observations.

  • tdata.obsp['{key_added}_connectivities']csr_matrix (dtype float) if distance is sparse.
    • Connectivity between observations.

  • tdata.obs['{key_added}_distances']Series (dtype float) if obs is a string.
    • Distance from specified observation to others.

Examples

Compute full pairwise path distances for tree leaves:

>>> tdata = py.datasets.koblan25()
>>> py.tl.tree_distance(tdata, metric="path")

Sample 1000 random LCA distances using node ‘time’ as depth:

>>> py.tl.tree_distance(tdata, metric="lca", sample_n=1000, depth_key="time")