abaco.metrics module#
- abaco.metrics.ARI(data, interest_label='tissue', n_clusters=None)[source]#
Compute the Adjusted Rand Index (ARI) between true labels and KMeans clusters.
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
interest_label (str, optional) – Column name for the label of interest, by default ‘tissue’.
n_clusters (int, optional) – Number of clusters for KMeans. If None, uses the number of unique labels.
- Returns:
Adjusted Rand Index score.
- Return type:
- abaco.metrics.ASW(data, interest_label='tissue')[source]#
Compute the average silhouette width (ASW) for a given label.
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
interest_label (str, optional) – Column name for the label of interest, by default ‘tissue’.
- Returns:
Average silhouette score.
- Return type:
- abaco.metrics.NMI(data, bio_label, n_cluster=None, average_method='arithmetic')[source]#
Compute the Normalized Mutual Information (NMI) between true labels and KMeans clusters.
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
bio_label (str) – Column name for biological labels.
n_cluster (int, optional) – Number of clusters for KMeans. If None, uses the number of unique labels.
average_method (str, optional) – Method for averaging NMI, by default ‘arithmetic’.
- Returns:
Normalized Mutual Information score.
- Return type:
- abaco.metrics.PERMANOVA(data, sample_label, batch_label, bio_label)[source]#
Perform PERMANOVA (permutational multivariate analysis of variance) using Bray-Curtis and Aitchison distances.
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
sample_label (str) – Column name for sample identifiers.
batch_label (str) – Column name for batch identifiers.
bio_label (str) – Column name for biological group identifiers.
- Returns:
(res_bc, res_ait) res_bc : pandas.Series
PERMANOVA results for Bray-Curtis distance.
- res_aitpandas.Series
PERMANOVA results for Aitchison distance.
- Return type:
- abaco.metrics.all_metrics(data, bio_label, batch_label, n_cluster=None)[source]#
Compute all batch correction and biological conservation metrics.
- Batch correction:
kBET, ARI (batch), ASW (batch)
- Biological conservation:
NMI, ARI (bio), ASW (bio)
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
bio_label (str) – Column name for biological labels.
batch_label (str) – Column name for batch identifiers.
n_cluster (int, optional) – Number of clusters for KMeans. If None, uses the number of unique labels.
- Returns:
(batch_metrics, bio_metrics) batch_metrics: dict with keys ‘kBET’, ‘ARI’, ‘ASW’ bio_metrics: dict with keys ‘NMI’, ‘ARI’, ‘ASW’
- Return type:
- abaco.metrics.cLISI_full_rank(data: DataFrame, bio_label: str, n_bio=None, perplexities: list = None)[source]#
Compute normalized cLISI for a range of perplexities.
- Parameters:
data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.
bio_label (str) – Name of the column with the categorical bio-type labels.
n_bio (int, optional) – Number of biological labels. If not provided, inferred from data.
perplexities (list of int, optional) – List of neighborhood sizes (k) to use. If None, uses all possible k.
- Returns:
DataFrame with columns [‘perplexity’, ‘cLISI’] for each k.
- Return type:
- abaco.metrics.cLISI_norm(data: DataFrame, bio_label: str, k: int = None, n_bio=None)[source]#
Compute normalized cell-type LISI (cLISI) in [0, 1].
- Parameters:
data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.
bio_label (str) – Name of the column with the categorical bio-type labels.
k (int, optional) – Neighborhood size. By default, uses the square root of the number of samples.
n_bio (int, optional) – Number of biological labels. If not provided, inferred from data.
- Returns:
Mean normalized cLISI across all samples (range: 0 to 1).
- Return type:
- abaco.metrics.cLISI_raw(data: DataFrame, bio_label: str, k: int = None)[source]#
Compute non-normalized cell-type LISI (cLISI).
- Parameters:
data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.
bio_label (str) – Name of the column with the categorical bio-type labels.
k (int, optional) – Neighborhood size. By default, uses the square root of the number of samples.
- Returns:
Mean raw cLISI across all samples (range: 1 to number of unique bio-type labels).
- Return type:
- abaco.metrics.iLISI_full_rank(data: DataFrame, batch_label: str, n_batch=None, perplexities: list = None)[source]#
Compute normalized iLISI for a range of perplexities.
- Parameters:
data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.
batch_label (str) – Name of the column with the categorical batch-type labels.
n_batch (int, optional) – Number of batch labels. If not provided, inferred from data.
perplexities (list of int, optional) – List of neighborhood sizes (k) to use. If None, uses all possible k.
- Returns:
DataFrame with columns [‘perplexity’, ‘iLISI’] for each k.
- Return type:
- abaco.metrics.iLISI_norm(data: DataFrame, batch_label: str, k: int = None, n_batch: int = None)[source]#
Compute normalized batch-mixing LISI (iLISI) in [0, 1].
- Parameters:
data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.
batch_label (str) – Name of the column with the categorical batch-type labels.
k (int, optional) – Neighborhood size. By default, uses the square root of the number of samples.
n_batch (int, optional) – Number of batch labels. If not provided, inferred from data.
- Returns:
Mean normalized iLISI across all samples (range: 0 to 1).
- Return type:
- abaco.metrics.iLISI_raw(data: DataFrame, batch_label: str, k: int = None)[source]#
Compute non-normalized batch-mixing LISI (iLISI).
- Parameters:
data (pandas.DataFrame) – DataFrame containing normalized numeric type taxonomic groups and categorical columns.
batch_label (str) – Name of the column with the categorical batch-type labels.
k (int, optional) – Neighborhood size. By default, uses the square root of the number of samples.
- Returns:
Mean raw iLISI across all samples (range: 1 to number of unique batch-type labels).
- Return type:
- abaco.metrics.kBET(data, batch_label='batch')[source]#
Compute the k-nearest neighbor batch effect test (kBET).
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
batch_label (str, optional) – Column name for batch identifiers, by default ‘batch’.
- Returns:
Proportion of samples with p-value > 0.05 (null hypothesis not rejected).
- Return type:
- abaco.metrics.pairwise_distance(data, sample_label, batch_label, bio_label)[source]#
Compute mean pairwise Euclidean distances within and between biological groups.
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
sample_label (str) – Column name for sample identifiers.
batch_label (str) – Column name for batch identifiers.
bio_label (str) – Column name for biological group identifiers.
- Returns:
(mean_all_dists, mean_within_dists, mean_between_dists)
- Return type:
- abaco.metrics.pairwise_distance_multi_run(data, sample_label, batch_label, bio_label)[source]#
Compute all pairwise Euclidean distances and return as a long-form DataFrame.
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
sample_label (str) – Column name for sample identifiers.
batch_label (str) – Column name for batch identifiers.
bio_label (str) – Column name for biological group identifiers.
- Returns:
Long-form DataFrame with columns [‘pointA’, ‘pointB’, ‘distance’, ‘pointA_bio’, ‘pointB_bio’].
- Return type:
- abaco.metrics.pairwise_distance_std(data, sample_label, batch_label, bio_label)[source]#
Compute standard deviation of pairwise Euclidean distances within and between biological groups.
- Parameters:
data (pandas.DataFrame) – Input data containing OTU counts and metadata.
sample_label (str) – Column name for sample identifiers.
batch_label (str) – Column name for batch identifiers.
bio_label (str) – Column name for biological group identifiers.
- Returns:
(std_all_dists, std_within_dists, std_between_dists)
- Return type: