Skip to content

Diversity


Alpha diversity

estimate_richness

Estimates within-sample diversity. Returns a DataFrame indexed by sample name:

from pyloseq import estimate_richness

alpha = estimate_richness(ps, measures=["Observed", "Chao1", "Shannon"])

Available measures:

Measure Description
Observed Count of taxa with non-zero abundance
Chao1 Chao1 richness estimator (Chao 1984)
se.chao1 Standard error of Chao1
ACE Abundance Coverage Estimator (Chao & Lee 1992); matches vegan
se.ACE SE of ACE — not implemented, always NaN
Shannon Shannon entropy (H = −Σ p log p)
Simpson Simpson's diversity (1 − Σ p²)
InvSimpson Inverse Simpson (1 / Σ p²)
Fisher Fisher's log-series alpha

Pass measures=None (the default) to compute all nine. Unrecognized measure names raise pyloseqValidationError.

Pooled richness:

Pass split=False to pool all samples into a single community before computing:

pooled = estimate_richness(ps, split=False)
# Returns a single row labelled "pooled"

Note

Chao1, ACE, Observed, and Fisher are count-based; pass raw integer counts, not relative abundances. Applying estimate_richness after transform_sample_counts will produce incorrect estimates for those measures.

pyloseq.estimate_richness

estimate_richness(
    ps: Phyloseq,
    measures: list[str] | None = None,
    split: bool = True,
) -> pd.DataFrame

Estimate richness (alpha diversity) for each sample.

R reference: estimate_richness(physeq, split, measures)

Parameters:

Name Type Description Default
ps Phyloseq

Phyloseq object. OTU table values should be integer counts.

required
measures list[str] | None

Subset of ["Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"]. None (default) returns all.

None
split bool

If True (default), compute per sample. If False, pool all samples first (matches R behavior). The single pooled row is labelled "pooled".

True

Returns:

Type Description
DataFrame

Indexed by sample name (or "pooled" when split=False); columns are the requested measures.


Beta diversity

distance

Computes a pairwise distance matrix between samples (or taxa). Returns an skbio.stats.distance.DistanceMatrix:

from pyloseq import distance

dm = distance(ps, "bray")
dm = distance(ps, "unifrac")
dm = distance(ps, "jaccard", kind="samples")

Available methods:

Method Backend Notes
"bray" scipy Bray-Curtis dissimilarity
"jaccard" scipy Binary Jaccard (presence/absence)
"euclidean" scipy
"manhattan" scipy City-block / L1
"canberra" scipy
"minkowski" scipy Pass p= to control exponent
"cosine" scipy
"correlation" scipy Pearson correlation distance
"maximum" scipy Chebyshev / L∞
"binary" scipy Synonym for "jaccard"
"sorensen" scipy Sørensen-Dice (presence/absence)
"unifrac" scikit-bio Unweighted UniFrac; requires phy_tree
"wunifrac" scikit-bio Weighted UniFrac; requires phy_tree
"jsd" scipy Jensen-Shannon divergence (√JSD, base 2)
"dpcoa" custom Double PCoA patristic distance; requires phy_tree

kind parameter:

kind="samples" (default) computes an n_samples × n_samples matrix. kind="taxa" transposes before computing, yielding an n_taxa × n_taxa matrix. Most phylogenetic methods (unifrac, wunifrac, dpcoa) only support kind="samples".

Passing kwargs to scipy:

# Minkowski with p=1 (= Manhattan)
dm = distance(ps, "minkowski", p=1)

pyloseq.distance

distance(
    ps: Phyloseq,
    method: str,
    kind: str = "samples",
    **kwargs: Any,
) -> DistanceMatrix

Compute a pairwise distance (or dissimilarity) matrix.

R reference: distance(physeq, method, type, ...)

Parameters:

Name Type Description Default
ps Phyloseq

Phyloseq object.

required
method str

Distance method. See :func:distance_method_list for all options.

required
kind str

"samples" (default) or "taxa". Most phylogenetic methods require "samples".

'samples'
**kwargs Any

Passed to the underlying implementation. For UniFrac, only normalized and n_jobs are accepted. For scipy metrics, extra keywords are forwarded to :func:scipy.spatial.distance.pdist (e.g. p= for "minkowski", which otherwise defaults to p=2).

{}

Returns:

Type Description
DistanceMatrix

unifrac

Direct interface to UniFrac, bypassing the distance dispatcher:

from pyloseq import unifrac

dm_uw = unifrac(ps, weighted=False)
dm_w  = unifrac(ps, weighted=True, normalized=True)

normalized=True divides by total branch length; this only affects weighted UniFrac. The n_jobs parameter controls parallelism in the scikit-bio implementation.

pyloseq.unifrac

unifrac(
    ps: Phyloseq,
    weighted: bool = False,
    normalized: bool = True,
    n_jobs: int = 1,
) -> DistanceMatrix

Compute (weighted or unweighted) UniFrac distances.

R reference: UniFrac(physeq, weighted, normalized, parallel, fast)

Parameters:

Name Type Description Default
ps Phyloseq

Phyloseq object with both otu_table and phy_tree.

required
weighted bool

If True, compute weighted UniFrac; otherwise unweighted.

False
normalized bool

Normalize by total branch length (meaningful only for weighted UniFrac).

True
n_jobs int

Number of parallel workers (passed to scikit-bio).

1

Returns:

Type Description
DistanceMatrix

distance_method_list

Returns all supported methods grouped by backend:

from pyloseq import distance_method_list

methods = distance_method_list()
# {
#   "phylogenetic":    ["dpcoa", "unifrac", "wunifrac"],
#   "information":     ["jsd"],
#   "vegan-equivalent": ["bray", "canberra", ...]
# }

pyloseq.distance_method_list

distance_method_list() -> dict[str, list[str]]

Return all supported distance methods, grouped by backend.

R reference: distanceMethodList