abaco.metrics module#

abaco.metrics.ARI(data, interest_label='tissue', n_clusters=None)[source]#

Compute the Adjusted Rand Index (ARI) between true labels and KMeans clusters.

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • interest_label (str, optional) – Column name for the label of interest, by default ‘tissue’.

  • n_clusters (int, optional) – Number of clusters for KMeans. If None, uses the number of unique labels.

Returns:

Adjusted Rand Index score.

Return type:

float

abaco.metrics.ASW(data, interest_label='tissue')[source]#

Compute the average silhouette width (ASW) for a given label.

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • interest_label (str, optional) – Column name for the label of interest, by default ‘tissue’.

Returns:

Average silhouette score.

Return type:

float

abaco.metrics.NMI(data, bio_label, n_cluster=None, average_method='arithmetic')[source]#

Compute the Normalized Mutual Information (NMI) between true labels and KMeans clusters.

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • bio_label (str) – Column name for biological labels.

  • n_cluster (int, optional) – Number of clusters for KMeans. If None, uses the number of unique labels.

  • average_method (str, optional) – Method for averaging NMI, by default ‘arithmetic’.

Returns:

Normalized Mutual Information score.

Return type:

float

abaco.metrics.PERMANOVA(data, sample_label, batch_label, bio_label)[source]#

Perform PERMANOVA (permutational multivariate analysis of variance) using Bray-Curtis and Aitchison distances.

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • sample_label (str) – Column name for sample identifiers.

  • batch_label (str) – Column name for batch identifiers.

  • bio_label (str) – Column name for biological group identifiers.

Returns:

(res_bc, res_ait) res_bc : pandas.Series

PERMANOVA results for Bray-Curtis distance.

res_aitpandas.Series

PERMANOVA results for Aitchison distance.

Return type:

tuple

abaco.metrics.all_metrics(data, bio_label, batch_label, n_cluster=None)[source]#

Compute all batch correction and biological conservation metrics.

Batch correction:

kBET, ARI (batch), ASW (batch)

Biological conservation:

NMI, ARI (bio), ASW (bio)

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • bio_label (str) – Column name for biological labels.

  • batch_label (str) – Column name for batch identifiers.

  • n_cluster (int, optional) – Number of clusters for KMeans. If None, uses the number of unique labels.

Returns:

(batch_metrics, bio_metrics) batch_metrics: dict with keys ‘kBET’, ‘ARI’, ‘ASW’ bio_metrics: dict with keys ‘NMI’, ‘ARI’, ‘ASW’

Return type:

tuple of dict

abaco.metrics.cLISI_full_rank(data: DataFrame, bio_label: str, n_bio=None, perplexities: list = None)[source]#

Compute normalized cLISI for a range of perplexities.

Parameters:
  • data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.

  • bio_label (str) – Name of the column with the categorical bio-type labels.

  • n_bio (int, optional) – Number of biological labels. If not provided, inferred from data.

  • perplexities (list of int, optional) – List of neighborhood sizes (k) to use. If None, uses all possible k.

Returns:

DataFrame with columns [‘perplexity’, ‘cLISI’] for each k.

Return type:

pandas.DataFrame

abaco.metrics.cLISI_norm(data: DataFrame, bio_label: str, k: int = None, n_bio=None)[source]#

Compute normalized cell-type LISI (cLISI) in [0, 1].

Parameters:
  • data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.

  • bio_label (str) – Name of the column with the categorical bio-type labels.

  • k (int, optional) – Neighborhood size. By default, uses the square root of the number of samples.

  • n_bio (int, optional) – Number of biological labels. If not provided, inferred from data.

Returns:

Mean normalized cLISI across all samples (range: 0 to 1).

Return type:

float

abaco.metrics.cLISI_raw(data: DataFrame, bio_label: str, k: int = None)[source]#

Compute non-normalized cell-type LISI (cLISI).

Parameters:
  • data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.

  • bio_label (str) – Name of the column with the categorical bio-type labels.

  • k (int, optional) – Neighborhood size. By default, uses the square root of the number of samples.

Returns:

Mean raw cLISI across all samples (range: 1 to number of unique bio-type labels).

Return type:

float

abaco.metrics.iLISI_full_rank(data: DataFrame, batch_label: str, n_batch=None, perplexities: list = None)[source]#

Compute normalized iLISI for a range of perplexities.

Parameters:
  • data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.

  • batch_label (str) – Name of the column with the categorical batch-type labels.

  • n_batch (int, optional) – Number of batch labels. If not provided, inferred from data.

  • perplexities (list of int, optional) – List of neighborhood sizes (k) to use. If None, uses all possible k.

Returns:

DataFrame with columns [‘perplexity’, ‘iLISI’] for each k.

Return type:

pandas.DataFrame

abaco.metrics.iLISI_norm(data: DataFrame, batch_label: str, k: int = None, n_batch: int = None)[source]#

Compute normalized batch-mixing LISI (iLISI) in [0, 1].

Parameters:
  • data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.

  • batch_label (str) – Name of the column with the categorical batch-type labels.

  • k (int, optional) – Neighborhood size. By default, uses the square root of the number of samples.

  • n_batch (int, optional) – Number of batch labels. If not provided, inferred from data.

Returns:

Mean normalized iLISI across all samples (range: 0 to 1).

Return type:

float

abaco.metrics.iLISI_raw(data: DataFrame, batch_label: str, k: int = None)[source]#

Compute non-normalized batch-mixing LISI (iLISI).

Parameters:
  • data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.

  • batch_label (str) – Name of the column with the categorical batch-type labels.

  • k (int, optional) – Neighborhood size. By default, uses the square root of the number of samples.

Returns:

Mean raw iLISI across all samples (range: 1 to number of unique batch-type labels).

Return type:

float

abaco.metrics.kBET(data, batch_label='batch')[source]#

Compute the k-nearest neighbor batch effect test (kBET).

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • batch_label (str, optional) – Column name for batch identifiers, by default ‘batch’.

Returns:

Proportion of samples with p-value > 0.05 (null hypothesis not rejected).

Return type:

float

abaco.metrics.pairwise_distance(data, sample_label, batch_label, bio_label)[source]#

Compute mean pairwise Euclidean distances within and between biological groups.

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • sample_label (str) – Column name for sample identifiers.

  • batch_label (str) – Column name for batch identifiers.

  • bio_label (str) – Column name for biological group identifiers.

Returns:

(mean_all_dists, mean_within_dists, mean_between_dists)

Return type:

tuple of float

abaco.metrics.pairwise_distance_multi_run(data, sample_label, batch_label, bio_label)[source]#

Compute all pairwise Euclidean distances and return as a long-form DataFrame.

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • sample_label (str) – Column name for sample identifiers.

  • batch_label (str) – Column name for batch identifiers.

  • bio_label (str) – Column name for biological group identifiers.

Returns:

Long-form DataFrame with columns [‘pointA’, ‘pointB’, ‘distance’, ‘pointA_bio’, ‘pointB_bio’].

Return type:

pandas.DataFrame

abaco.metrics.pairwise_distance_std(data, sample_label, batch_label, bio_label)[source]#

Compute standard deviation of pairwise Euclidean distances within and between biological groups.

Parameters:
  • data (pandas.DataFrame) – Input data containing OTU counts and metadata.

  • sample_label (str) – Column name for sample identifiers.

  • batch_label (str) – Column name for batch identifiers.

  • bio_label (str) – Column name for biological group identifiers.

Returns:

(std_all_dists, std_within_dists, std_between_dists)

Return type:

tuple of float