pycea.tl.partition_test

Contents

pycea.tl.partition_test#

pycea.tl.partition_test(tdata, keys, comparison='siblings', test='permutation', aggregate='mean', metric='mean_difference', metric_kwds=None, n_permutations=100, random_state=None, equal_var=True, min_group_leaves=10, keys_added=None, tree=None, copy=True)#
Overloads:
  • tdata (td.TreeData), keys (str | Sequence[str]), comparison (Literal[‘siblings’, ‘rest’]), test (Literal[‘permutation’, ‘t-test’] | None), aggregate (_AggregatorFn | _Aggregator), metric (_MetricFn | _Metric | Literal[‘mean_difference’]), metric_kwds (Mapping | None), n_permutations (int), random_state (int | None), equal_var (bool), min_group_leaves (int), keys_added (str | Sequence[str] | None), tree (str | Sequence[str] | None), copy (Literal[True, False]) → pd.DataFrame

  • tdata (td.TreeData), keys (str | Sequence[str]), comparison (Literal[‘siblings’, ‘rest’]), test (Literal[‘permutation’, ‘t-test’] | None), aggregate (_AggregatorFn | _Aggregator), metric (_MetricFn | _Metric | Literal[‘mean_difference’]), metric_kwds (Mapping | None), n_permutations (int), random_state (int | None), equal_var (bool), min_group_leaves (int), keys_added (str | Sequence[str] | None), tree (str | Sequence[str] | None), copy (Literal[True, False]) → None

Test for differences between leaf partitions.

For each requested observation key, this function compares the set of leaves descended from each internal node (group1) to the set of leaves defined by the comparison parameter (group2):

  • comparison='siblings':

    Compare to the descendants of sibling nodes. When there is more than one sibling (i.e., a non-binary split), each child node is compared individually to the pooled set of all other siblings.

  • comparison='rest':

    Compare to all other leaves in the tree not descended from the given node.

The test parameter defines how the two groups are compared:

  • test='permutation':

    a two-sided permutation test is performed by repeatedly shuffling the pooled rows (group1 + group2), applying the aggregate function, and then recomputing the split statistic using the metric function. The number of permutations executed is the minimum of the user-requested n_permutations and the theoretical maximum number of distinct labelings ( comb(n_left + n_right, n_left)). The p-value is computed with standard +1 smoothing:

pval=#{perm_statobserved}+1Nperm+1p_\text{val} = \frac{ \#\{\,|\mathrm{perm\_stat}| \ge |\mathrm{observed}|\,\} + 1 }{ N_\text{perm} + 1 }
  • test='test-t':

    a two-sided t-test is performed for each group. Note that for small numbers of leaves the p-value of this t-test can be unreliable.

  • test=None:

    no statistical test is performed; only the partition statistic is computed.

P-values are calculated as long as both groups have at least min_group_leaves leaves; otherwise, no test is performed for that partition and the p-value is set to NaN.

Parameters:
  • tdata (TreeData) – TreeData object.

  • keys (str | Sequence[str]) – One or more obs.keys(), var_names, obsm.keys(), or obsp.keys() to reconstruct.

  • comparison (Literal['siblings', 'rest'] (default: 'siblings')) –

    Set of leaves to compare to:

    • ’siblings’ : leaves descending from a given node are compared to leaves descending from its siblings.

    • ’rest’ : leaves descending from a given node are compared to all other leaves of the tree.

  • test (Optional[Literal['permutation', 't-test']] (default: 'permutation')) – Type of test to perform to compare the two groups. “t-test” can only be used for scalar keys.

  • aggregate (Union[Callable[[ndarray], ndarray | float], Literal['mean', 'median', 'sum', 'min', 'max', 'var']] (default: 'mean')) – Function to reduce the data from all the leaves of a given group to a vector or scalar. Can be a known aggregator or a callable. Only used for test=”permutation”.

  • metric (Union[Callable[[ndarray, ndarray], float], Literal['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'cosine', 'correlation', 'dice', 'euclidean', 'hamming', 'jaccard', 'kulsinski', 'l1', 'l2', 'mahalanobis', 'minkowski', 'manhattan', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'], Literal['mean_difference']] (default: 'mean_difference')) – A metric to compare the children from both sides of the tree. Can be a known metric or a callable. Only used for test=”permutation”.

  • metric_kwds (Mapping | None (default: None)) – Options for the metric.

  • equal_var (bool (default: True)) – Boolean indicating if the variance in the two groups should be assumed to be equal. Only used for test=”t-test”.

  • n_permutations (int (default: 100)) – Upper bound on the number of permutations to run. The actually executed number is min(n_permutations, comb(n_left + n_right, n_left)) per group.

  • random_state (int | None (default: None)) – Random seed to ensure reproducibility of permutation test.

  • min_group_leaves (int (default: 10)) – Minimum number of leaves required in each group to perform a statistical test. The t-test may be particularly unreliable with small sample sizes.

  • keys_added (str | Sequence[str] | None (default: None)) – Attribute keys of tdata.obst[tree].nodes where group statistics will be stored. If None, keys are used.

  • tree (str | Sequence[str] | None (default: None)) – The obst key or keys of the trees to use. If None, all trees are used.

  • copy (Literal[True, False] (default: True)) – If True, returns a DataFrame with group statistics.

Returns:

Returns None if copy=False, else returns DataFrame with columns:
  • 'tree' - Tree name.

  • 'key' - Observation key.

  • 'parent' - Parent of group1 node.

  • 'group1' - Node defining group1 leaf set.

  • 'group2' - Node(s) defining group2 leaf set or “rest”.

  • 'value1' - Aggregate leaf value for group1.

  • 'value2' - Aggregate leaf value for group2.

  • 'pval' - p-value from the statistical test (if performed).

Sets the following fields:

  • tdata.obst[tree].nodes[f"{key_added}_value"]float/ndarray
    • Aggregate value of leaves descended from that node.

  • tdata.obst[tree].edges[f"{key_added}_pval"]float
    • P-value for the partition test at that edge (if performed).

  • tdata.obst[tree].edges[f"{key_added}_metric"]float
    • Metric value for the partition at that edge (only if test=”permutation”).

Examples

Identify clades with the highest expression of “elt-2”:

>>> tdata = py.datasets.packer19()
>>> py.tl.partition_test(tdata, keys=["elt-2"], test="t-test", comparison="rest")