Hypothesis Testing¶

multi_tax_test¶

Per-taxon differential abundance test between two groups. Applies a statistical test to each taxon independently, then corrects for multiple comparisons:

from pyloseq import multi_tax_test

results = multi_tax_test(ps, grouping_var="SampleType", test="t", method="BH")
print(results.head(10))

The function requires sample_data with a column that has exactly two distinct non-NaN values. Samples with NaN in the grouping column are dropped silently.

Test statistics¶

`test`	Method
`"t"`	Welch's t-test (`equal_var=False`)
`"wilcoxon"`	Wilcoxon rank-sum test

Welch's t-test is appropriate when group variances may differ and both groups have at least a few samples. The Wilcoxon rank-sum test is a non-parametric alternative; use it when count distributions are highly skewed or sample sizes are very small.

Multiple-testing correction¶

`method`	Type	Description
`"BH"`	FDR	Benjamini-Hochberg (default). Controls false discovery rate.
`"BY"`	FDR	Benjamini-Yekutieli. More conservative than BH; valid under arbitrary correlation.
`"holm"`	FWER	Holm step-down. Controls family-wise error rate without assuming independence.
`"bonferroni"`	FWER	Bonferroni. Most conservative; appropriate when any false positive is unacceptable.
`"westfall_young"`	FWER	Permutation-based step-down (Westfall & Young 1993). Equivalent to R's `multtest::mt.minP`. Respects the correlation structure of the test statistics.

Return value¶

A DataFrame with one row per taxon, sorted by ascending adjp:

Column	Description
`statistic`	Per-taxon test statistic
`rawp`	Uncorrected p-value
`adjp`	Corrected p-value
`mean_<group1>`	Mean abundance in group 1
`mean_<group2>`	Mean abundance in group 2

Group names in the mean columns come from the sorted unique values of grouping_var.

Examples¶

# Default: Welch t-test, BH correction
results = multi_tax_test(ps, "SampleType")
significant = results[results["adjp"] < 0.05]

# Wilcoxon with Holm FWER control
results = multi_tax_test(ps, "SampleType", test="wilcoxon", method="holm")

# Permutation FWER — use more permutations for stable estimates
results = multi_tax_test(
    ps, "SampleType",
    method="westfall_young",
    n_permutations=5000,
    rng_seed=0,
)

Note

westfall_young runs n_permutations separate tests per permutation and scales as O(n_taxa × n_permutations). For datasets with tens of thousands of taxa, use a smaller n_permutations (e.g. 500–1000) for exploration and increase it only for final analysis.

pyloseq.multi_tax_test ¶

multi_tax_test(
    ps: Phyloseq,
    grouping_var: str,
    test: Literal["t", "wilcoxon"] = "t",
    method: Literal[
        "BH", "BY", "holm", "bonferroni", "westfall_young"
    ] = "BH",
    alternative: Literal[
        "two-sided", "greater", "less"
    ] = "two-sided",
    n_permutations: int = 1000,
    rng_seed: int | None = 42,
) -> pd.DataFrame

Test each taxon for differential abundance between two groups.

R reference: phyloseq::mt()

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	`Phyloseq` object (must have `sample_data`).	required
`grouping_var`	`str`	Column in `sample_data` defining the two groups to compare. Samples with `NaN` in this column are silently dropped.	required
`test`	`Literal['t', 'wilcoxon']`	Per-taxon test statistic. `"t"` uses Welch's t-test (`equal_var=False`); `"wilcoxon"` uses the Wilcoxon rank-sum test.	`'t'`
`method`	`Literal['BH', 'BY', 'holm', 'bonferroni', 'westfall_young']`	Multiple-testing correction method: `"BH"` — Benjamini-Hochberg FDR (default) `"BY"` — Benjamini-Yekutieli FDR `"holm"` — Holm step-down FWER `"bonferroni"` — Bonferroni FWER `"westfall_young"` — permutation-based step-down FWER (R `multtest::mt.minP`)	`'BH'`
`alternative`	`Literal['two-sided', 'greater', 'less']`	Direction of the alternative hypothesis.	`'two-sided'`
`n_permutations`	`int`	Number of label permutations for `method="westfall_young"`.	`1000`
`rng_seed`	`int \| None`	Seed for the permutation RNG (`"westfall_young"` only). Pass `None` for non-reproducible draws.	`42`

Returns:

Type	Description
`DataFrame`	One row per taxon, sorted by ascending `adjp`. Columns: `statistic` — per-taxon test statistic `rawp` — uncorrected p-value `adjp` — corrected p-value (using `method`) `mean_<g1>` — mean abundance in group 1 `mean_<g2>` — mean abundance in group 2

Raises:

Type	Description
`pyloseqValidationError`	If `sample_data` is missing, `grouping_var` is not found, the variable does not have exactly 2 distinct non-NaN levels, or either group has fewer than 2 samples.

permanova¶

PERMANOVA (Permutational Multivariate Analysis of Variance) tests whether the centroids of two or more groups differ in multivariate space. Thin wrapper around skbio.stats.distance.permanova that extracts group labels from sample_data automatically:

from pyloseq import distance, permanova

dm = distance(ps, "bray")
result = permanova(dm, ps, grouping_var="SampleType", permutations=999)
print(result["p-value"])
print(result["test statistic"])  # pseudo-F

The distance matrix and the Phyloseq object do not need to have identical sample sets. Only samples present in distance_matrix.ids are used; the rest of ps is ignored. This means you can compute a distance matrix on a filtered subset and still pass the original ps:

ps_sub = subset_samples(ps, ps.sample_data.to_frame()["Env"] == "Soil")
dm_sub = distance(ps_sub, "bray")
result = permanova(dm_sub, ps_sub, "Treatment")

The return value is a pd.Series from scikit-bio with keys method name, test statistic name, sample size, number of groups, test statistic, p-value, and number of permutations.

R reference: vegan::adonis2(dist ~ group, data = sample_data(physeq))

pyloseq.permanova ¶

permanova(
    distance_matrix: DistanceMatrix,
    ps: Phyloseq,
    grouping_var: str,
    permutations: int = 999,
) -> pd.Series

PERMANOVA test on a precomputed distance matrix.

Thin wrapper around :func:skbio.stats.distance.permanova that extracts the grouping variable from ps.sample_data and aligns it to the distance matrix IDs automatically.

R reference: vegan::adonis2(dist ~ group, data=sample_data(physeq))

Parameters:

Name	Type	Description	Default
`distance_matrix`	`DistanceMatrix`	Pairwise distance matrix (e.g. from :func:`pyloseq.distance` or :func:`pyloseq.gunifrac`).	required
`ps`	`Phyloseq`	`Phyloseq` object whose `sample_data` contains `grouping_var`. Only the samples present in `distance_matrix.ids` are used; the rest of `ps` is ignored, so a filtered/subsetted distance matrix works correctly alongside the original `ps`.	required
`grouping_var`	`str`	Column name in `sample_data` defining the groups to compare.	required
`permutations`	`int`	Number of permutations for the pseudo-F null distribution.	`999`

Returns:

Type	Description
`Series`	scikit-bio PERMANOVA result with keys `method name`, `test statistic name`, `sample size`, `number of groups`, `test statistic`, `p-value`, `number of permutations`.

Raises:

Type	Description
`pyloseqValidationError`	If `sample_data` is missing or `grouping_var` is not found.

betadisper¶

PERMDISP test for homogeneity of multivariate dispersions. Tests whether groups have similar spread around their centroids — a complement to PERMANOVA that checks the equal-dispersion assumption. Thin wrapper around skbio.stats.distance.permdisp:

from pyloseq import betadisper, distance

dm = distance(ps, "bray")
result = betadisper(dm, ps, grouping_var="SampleType", permutations=999)
print(result["p-value"])

Use betadisper together with permanova to distinguish centroid differences (PERMANOVA) from dispersion differences (betadisper):

dm = distance(ps, "bray")
perm = permanova(dm, ps, "Treatment")
disp = betadisper(dm, ps, "Treatment")

# Significant PERMANOVA + non-significant betadisper → true centroid shift
# Significant betadisper → group variances differ (confounds PERMANOVA)

R reference: vegan::betadisper() + vegan::permutest()

pyloseq.betadisper ¶

betadisper(
    distance_matrix: DistanceMatrix,
    ps: Phyloseq,
    grouping_var: str,
    permutations: int = 999,
) -> pd.Series

PERMDISP test for homogeneity of multivariate dispersions.

Thin wrapper around :func:skbio.stats.distance.permdisp that extracts the grouping variable from ps.sample_data and aligns it to the distance matrix IDs automatically.

R reference: vegan::betadisper() + vegan::permutest()

Parameters:

Name	Type	Description	Default
`distance_matrix`	`DistanceMatrix`	Pairwise distance matrix (e.g. from :func:`pyloseq.distance` or :func:`pyloseq.gunifrac`).	required
`ps`	`Phyloseq`	`Phyloseq` object whose `sample_data` contains `grouping_var`. Only samples present in `distance_matrix.ids` are used.	required
`grouping_var`	`str`	Column name in `sample_data` defining the groups.	required
`permutations`	`int`	Number of permutations for the null distribution.	`999`

Returns:

Type	Description
`Series`	scikit-bio PERMDISP result with keys `method name`, `test statistic name`, `sample size`, `number of groups`, `test statistic`, `p-value`, `number of permutations`.

Raises:

Type	Description
`pyloseqValidationError`	If `sample_data` is missing or `grouping_var` is not found.