Diversity¶

Alpha diversity¶

estimate_richness¶

Estimates within-sample diversity. Returns a DataFrame indexed by sample name:

from pyloseq import estimate_richness

alpha = estimate_richness(ps, measures=["Observed", "Chao1", "Shannon"])

Available measures:

Measure	Description
`Observed`	Count of taxa with non-zero abundance
`Chao1`	Chao1 richness estimator (Chao 1984)
`se.chao1`	Standard error of Chao1
`ACE`	Abundance Coverage Estimator (Chao & Lee 1992); matches vegan
`se.ACE`	SE of ACE — not implemented, always `NaN`
`Shannon`	Shannon entropy (H = −Σ p log p)
`Simpson`	Simpson's diversity (1 − Σ p²)
`InvSimpson`	Inverse Simpson (1 / Σ p²)
`Fisher`	Fisher's log-series alpha
`PD`	Faith's phylogenetic diversity (Faith 1992); requires `phy_tree`

Pass measures=None (the default) to compute all available measures. When ps.phy_tree is None, PD is excluded from the defaults. Unrecognized measure names raise pyloseqValidationError.

Pooled richness:

Pass split=False to pool all samples into a single community before computing:

pooled = estimate_richness(ps, split=False)
# Returns a single row labelled "pooled"

Faith's Phylogenetic Diversity:

# PD requires a phylogenetic tree on the Phyloseq object
pd_df = estimate_richness(ps_with_tree, measures=["PD"])

# Mix PD with standard measures in one call
alpha = estimate_richness(ps_with_tree, measures=["Observed", "Shannon", "PD"])

PD is the sum of branch lengths in the minimum spanning tree connecting all observed taxa plus the root. The tree is midpoint-rooted internally if unrooted, matching the convention used by R's phangorn::midpoint.

Note

Chao1, ACE, Observed, and Fisher are count-based; pass raw integer counts, not relative abundances. Applying estimate_richness after transform_sample_counts will produce incorrect estimates for those measures.

pyloseq.estimate_richness ¶

estimate_richness(
    ps: Phyloseq,
    measures: list[str] | None = None,
    split: bool = True,
) -> pd.DataFrame

Estimate richness (alpha diversity) for each sample.

R reference: estimate_richness(physeq, split, measures)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	`Phyloseq` object. OTU table values should be integer counts.	required
`measures`	`list[str] \| None`	Subset of `["Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "PD"]`. `None` (default) returns all non-tree measures (`PD` is excluded from the default set when the object has no `phy_tree`).	`None`
`split`	`bool`	If `True` (default), compute per sample. If `False`, pool all samples first (matches R behavior). The single pooled row is labelled `"pooled"`.	`True`

Returns:

Type	Description
`DataFrame`	Indexed by sample name (or `"pooled"` when `split=False`); columns are the requested measures.

Raises:

Type	Description
`pyloseqValidationError`	If `"PD"` is requested but `ps.phy_tree` is `None`, or if any measure name is not recognised.

Notes

"PD" (Faith's phylogenetic diversity, Faith 1992) sums the total branch length of the minimum spanning clade connecting all observed taxa on the tree. It requires a rooted tree; the tree is midpoint-rooted internally if not already rooted, matching R's phangorn::midpoint convention.

Beta diversity¶

distance¶

Computes a pairwise distance matrix between samples (or taxa). Returns an skbio.stats.distance.DistanceMatrix:

from pyloseq import distance

dm = distance(ps, "bray")
dm = distance(ps, "unifrac")
dm = distance(ps, "jaccard", kind="samples")

Available methods:

Method	Backend	Notes
`"bray"`	scipy	Bray-Curtis dissimilarity
`"jaccard"`	scipy	Binary Jaccard (presence/absence)
`"euclidean"`	scipy
`"manhattan"`	scipy	City-block / L1
`"canberra"`	scipy
`"minkowski"`	scipy	Pass `p=` to control exponent
`"cosine"`	scipy
`"correlation"`	scipy	Pearson correlation distance
`"maximum"`	scipy	Chebyshev / L∞
`"binary"`	scipy	Synonym for `"jaccard"`
`"sorensen"`	scipy	Sørensen-Dice (presence/absence)
`"unifrac"`	scikit-bio	Unweighted UniFrac; requires `phy_tree`
`"wunifrac"`	scikit-bio	Weighted UniFrac; requires `phy_tree`
`"jsd"`	scipy	Jensen-Shannon divergence (√JSD, base 2)
`"dpcoa"`	custom	Double PCoA patristic distance; requires `phy_tree`

kind parameter:

kind="samples" (default) computes an n_samples × n_samples matrix. kind="taxa" transposes before computing, yielding an n_taxa × n_taxa matrix. Most phylogenetic methods (unifrac, wunifrac, dpcoa) only support kind="samples".

Passing kwargs to scipy:

# Minkowski with p=1 (= Manhattan)
dm = distance(ps, "minkowski", p=1)

pyloseq.distance ¶

distance(
    ps: Phyloseq,
    method: str,
    kind: str = "samples",
    **kwargs: Any,
) -> DistanceMatrix

Compute a pairwise distance (or dissimilarity) matrix.

R reference: distance(physeq, method, type, ...)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	`Phyloseq` object.	required
`method`	`str`	Distance method. See :func:`distance_method_list` for all options.	required
`kind`	`str`	`"samples"` (default) or `"taxa"`. Most phylogenetic methods require `"samples"`.	`'samples'`
`**kwargs`	`Any`	Passed to the underlying implementation. For UniFrac, only `normalized` and `n_jobs` are accepted. For scipy metrics, extra keywords are forwarded to :func:`scipy.spatial.distance.pdist` (e.g. `p=` for `"minkowski"`, which otherwise defaults to `p=2`).	`{}`

Returns:

Type	Description
`DistanceMatrix`

unifrac¶

Direct interface to UniFrac, bypassing the distance dispatcher:

from pyloseq import unifrac

dm_uw = unifrac(ps, weighted=False)
dm_w  = unifrac(ps, weighted=True, normalized=True)

normalized=True divides by total branch length; this only affects weighted UniFrac. The n_jobs parameter controls parallelism in the scikit-bio implementation.

pyloseq.unifrac ¶

unifrac(
    ps: Phyloseq,
    weighted: bool = False,
    normalized: bool = True,
    n_jobs: int = 1,
) -> DistanceMatrix

Compute (weighted or unweighted) UniFrac distances.

R reference: UniFrac(physeq, weighted, normalized, parallel, fast)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	`Phyloseq` object with both `otu_table` and `phy_tree`.	required
`weighted`	`bool`	If `True`, compute weighted UniFrac; otherwise unweighted.	`False`
`normalized`	`bool`	Normalize by total branch length (meaningful only for weighted UniFrac).	`True`
`n_jobs`	`int`	Number of parallel workers (passed to scikit-bio).	`1`

Returns:

Type	Description
`DistanceMatrix`

gunifrac¶

Computes the Generalized UniFrac family of distance matrices (Chen et al. 2012 Bioinformatics 28:2106–2113), matching the R GUniFrac package API:

from pyloseq import gunifrac

results = gunifrac(ps)                   # default alpha=(0, 0.5, 1)
dm_half = results["d_0.5"]              # GUniFrac at alpha=0.5
dm_uw   = results["d_UW"]               # unweighted UniFrac
dm_vaw  = results["d_VAW"]              # variance-adjusted weighted UniFrac

Return value — a GUnifracResult (exported from pyloseq). Supports subscript access, .keys(), .values(), and .items() like a dict, and exposes fixed matrices as attributes:

Key / attribute	Description
`result["d_{a}"]` / `result.d_0_5`	GUniFrac at exponent a for each value in `alpha` (e.g. `"d_0.5"`)
`result["d_UW"]` / `result.d_UW`	Unweighted UniFrac (Chen 2012 definition)
`result["d_VAW"]` / `result.d_VAW`	Variance-adjusted weighted UniFrac (Hamady et al. 2010)

Note

isinstance(result, dict) and .get() no longer work — use subscript or attribute access instead.

Alpha = 0 up-weights rare lineages; alpha = 1 is equivalent to normalized weighted UniFrac. The default alpha=(0, 0.5, 1) covers the full range.

Piping into make_network:

from pyloseq import make_network, plot_network

g = make_network(ps, distance=results["d_0.5"], max_dist=0.5)
p = plot_network(g, ps, color="SampleType")

Note

d_UW from gunifrac matches R's GUniFrac package definition, which counts any branch whose cumulative proportion differs between the two samples (including branches shared by both but at different abundances). This differs slightly from unifrac(), which uses the Lozupone & Knight (2005) definition counting only branches exclusive to one sample.

pyloseq.gunifrac ¶

gunifrac(
    ps: Phyloseq, alpha: tuple[float, ...] = (0, 0.5, 1)
) -> GUnifracResult

Compute Generalized UniFrac distance matrices.

Implements the GUniFrac family from Chen et al. (2012) Bioinformatics 28(16):2106-2113, matching the R GUniFrac package API.

R reference: GUniFrac(otu.tab, tree, alpha=c(0, 0.5, 1))$unifracs

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	`Phyloseq` object with both `otu_table` and `phy_tree`.	required
`alpha`	`tuple[float, ...]`	Exponents to compute. Each value in `[0, 1]` produces one distance matrix keyed `"d_{alpha}"` (e.g. `"d_0.5"`). Alpha = 0 gives the same result as unweighted UniFrac; alpha = 1 gives (un-normalised) weighted UniFrac.	`(0, 0.5, 1)`

Returns:

Type	Description
`GUnifracResult`	Structured result with attributes `d_UW`, `d_VAW`, and `alpha_matrices` (keyed `"d_{a}"` for each requested alpha). Supports `result["d_0.5"]`, `result.keys()`, `result.values()`, and `result.items()` for backwards compatibility.
`.. note::`	`d_UW` from this function matches R's `GUniFrac` package definition, which counts any branch whose cumulative proportion differs between the two samples (including branches shared by both but in different amounts). This differs slightly from scikit-bio / :func:`unifrac` which counts only branches exclusive to one sample (the Lozupone & Knight 2005 definition).

distance_method_list¶

Returns all supported methods grouped by backend:

from pyloseq import distance_method_list

methods = distance_method_list()
# {
#   "phylogenetic":    ["dpcoa", "unifrac", "wunifrac"],
#   "information":     ["jsd"],
#   "vegan-equivalent": ["bray", "canberra", ...]
# }

pyloseq.distance_method_list ¶

distance_method_list() -> dict[str, list[str]]

Return all supported distance methods, grouped by backend.

R reference: distanceMethodList