pycea.tl.tree_distance

Contents

pycea.tl.tree_distance#

pycea.tl.tree_distance(tdata, depth_key='depth', obs=None, metric='path', sample_n=None, connect_key=None, random_state=None, key_added=None, update=True, tree=None, copy=False)#

Computes tree distances between observations.

This function calculates distances between observations (typically tree leaves) based on their positions and depths in the tree. It supports lowest common ancestor (lca) and path distances.

Given two nodes \(i\) and \(j\) in a rooted tree, with depths \(d_i\) and \(d_j\), and with their lowest common ancestor having depth \(d_{LCA(i,j)}\):

\[D_{ij}^{lca} = d_{LCA(i,j)}\]
\[D_{ij}^{path} = || d_i + d_j - 2 d_{LCA(i,j)} ||\]

\(D_{ij}^{lca}\) represents the depth of the node’s shared ancestor (larger values indicate greater shared ancestry). In contrast, \(D_{ij}^{path}\) measures the distance along the tree between two nodes (smaller values indicate closer proximity).

Parameters:
  • tdata (TreeData) – The TreeData object.

  • depth_key (str (default: 'depth')) – Attribute of tdata.obst[tree].nodes where depth is stored.

  • obs (str | int | Sequence[Any] | None (default: None)) –

    The observations to use:

    • If None, pairwise distance for tree leaves is stored in tdata.obsp.

    • If a string, distance to all other tree leaves is tdata.obs.

    • If a sequence, pairwise distance is stored in tdata.obsp.

    • If a sequence of pairs, distance between pairs is stored in tdata.obsp.

  • metric (Literal['lca', 'path'] (default: 'path')) –

    The type of tree distance to compute:

    • 'lca': lowest common ancestor depth.

    • 'path': abs(node1 depth + node2 depth - 2 * lca depth).

  • sample_n (int | None (default: None)) – If specified, randomly sample sample_n pairs of observations.

  • connect_key (str | None (default: None)) – If specified, compute distances only between connected observations specified by tdata.obsp['{connect_key}_connectivities'].

  • random_state (int | None (default: None)) – Random seed for sampling.

  • key_added (str | None (default: None)) – Distances are stored in tdata.obsp['{key_added}_distances'] and connectivities in tdata.obsp['{key_added}_connectivities']. Defaults to ‘tree’.

  • update (bool (default: True)) – If True, updates existing distances instead of overwriting.

  • tree (str | Sequence[Any] | None (default: None)) – The obst key or keys of the trees to use. If None, all trees are used.

  • copy (Literal[True, False] (default: False)) – If True, returns a ndarray or csr_matrix with distances.

Return type:

None | csr_matrix | ndarray

Returns:

Returns None if copy=False, else returns ndarray/csr_matrix.

Sets the following fields:

  • tdata.obsp['{key_added}_distances']ndarray/csr_matrix (dtype float) if obs is None or a sequence.
    • Distances between observations.

  • tdata.obsp['{key_added}_connectivities']csr_matrix (dtype float) if distance is sparse.
    • Connectivity between observations.

  • tdata.obs['{key_added}_distances']Series (dtype float) if obs is a string.
    • Distance from specified observation to others.

Examples

Compute full pairwise path distances for tree leaves:

>>> tdata = py.datasets.koblan25()
>>> py.tl.tree_distance(tdata, metric="path")

Sample 1000 random LCA distances using node ‘time’ as depth:

>>> py.tl.tree_distance(tdata, metric="lca", sample_n=1000, depth_key="time")