Diversity¶
Alpha diversity¶
estimate_richness¶
Estimates within-sample diversity. Returns a DataFrame indexed by sample name:
from pyloseq import estimate_richness
alpha = estimate_richness(ps, measures=["Observed", "Chao1", "Shannon"])
Available measures:
| Measure | Description |
|---|---|
Observed |
Count of taxa with non-zero abundance |
Chao1 |
Chao1 richness estimator (Chao 1984) |
se.chao1 |
Standard error of Chao1 |
ACE |
Abundance Coverage Estimator (Chao & Lee 1992); matches vegan |
se.ACE |
SE of ACE — not implemented, always NaN |
Shannon |
Shannon entropy (H = −Σ p log p) |
Simpson |
Simpson's diversity (1 − Σ p²) |
InvSimpson |
Inverse Simpson (1 / Σ p²) |
Fisher |
Fisher's log-series alpha |
Pass measures=None (the default) to compute all nine. Unrecognized measure names raise pyloseqValidationError.
Pooled richness:
Pass split=False to pool all samples into a single community before computing:
Note
Chao1, ACE, Observed, and Fisher are count-based; pass raw integer counts, not relative abundances. Applying estimate_richness after transform_sample_counts will produce incorrect estimates for those measures.
pyloseq.estimate_richness ¶
estimate_richness(
ps: Phyloseq,
measures: list[str] | None = None,
split: bool = True,
) -> pd.DataFrame
Estimate richness (alpha diversity) for each sample.
R reference: estimate_richness(physeq, split, measures)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
|
required |
measures
|
list[str] | None
|
Subset of |
None
|
split
|
bool
|
If |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Indexed by sample name (or |
Beta diversity¶
distance¶
Computes a pairwise distance matrix between samples (or taxa). Returns an skbio.stats.distance.DistanceMatrix:
from pyloseq import distance
dm = distance(ps, "bray")
dm = distance(ps, "unifrac")
dm = distance(ps, "jaccard", kind="samples")
Available methods:
| Method | Backend | Notes |
|---|---|---|
"bray" |
scipy | Bray-Curtis dissimilarity |
"jaccard" |
scipy | Binary Jaccard (presence/absence) |
"euclidean" |
scipy | |
"manhattan" |
scipy | City-block / L1 |
"canberra" |
scipy | |
"minkowski" |
scipy | Pass p= to control exponent |
"cosine" |
scipy | |
"correlation" |
scipy | Pearson correlation distance |
"maximum" |
scipy | Chebyshev / L∞ |
"binary" |
scipy | Synonym for "jaccard" |
"sorensen" |
scipy | Sørensen-Dice (presence/absence) |
"unifrac" |
scikit-bio | Unweighted UniFrac; requires phy_tree |
"wunifrac" |
scikit-bio | Weighted UniFrac; requires phy_tree |
"jsd" |
scipy | Jensen-Shannon divergence (√JSD, base 2) |
"dpcoa" |
custom | Double PCoA patristic distance; requires phy_tree |
kind parameter:
kind="samples" (default) computes an n_samples × n_samples matrix. kind="taxa" transposes before computing, yielding an n_taxa × n_taxa matrix. Most phylogenetic methods (unifrac, wunifrac, dpcoa) only support kind="samples".
Passing kwargs to scipy:
pyloseq.distance ¶
Compute a pairwise distance (or dissimilarity) matrix.
R reference: distance(physeq, method, type, ...)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
|
required |
method
|
str
|
Distance method. See :func: |
required |
kind
|
str
|
|
'samples'
|
**kwargs
|
Any
|
Passed to the underlying implementation. For UniFrac, only
|
{}
|
Returns:
| Type | Description |
|---|---|
DistanceMatrix
|
|
unifrac¶
Direct interface to UniFrac, bypassing the distance dispatcher:
from pyloseq import unifrac
dm_uw = unifrac(ps, weighted=False)
dm_w = unifrac(ps, weighted=True, normalized=True)
normalized=True divides by total branch length; this only affects weighted UniFrac. The n_jobs parameter controls parallelism in the scikit-bio implementation.
pyloseq.unifrac ¶
unifrac(
ps: Phyloseq,
weighted: bool = False,
normalized: bool = True,
n_jobs: int = 1,
) -> DistanceMatrix
Compute (weighted or unweighted) UniFrac distances.
R reference: UniFrac(physeq, weighted, normalized, parallel, fast)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
|
required |
weighted
|
bool
|
If |
False
|
normalized
|
bool
|
Normalize by total branch length (meaningful only for weighted UniFrac). |
True
|
n_jobs
|
int
|
Number of parallel workers (passed to scikit-bio). |
1
|
Returns:
| Type | Description |
|---|---|
DistanceMatrix
|
|
distance_method_list¶
Returns all supported methods grouped by backend:
from pyloseq import distance_method_list
methods = distance_method_list()
# {
# "phylogenetic": ["dpcoa", "unifrac", "wunifrac"],
# "information": ["jsd"],
# "vegan-equivalent": ["bray", "canberra", ...]
# }
pyloseq.distance_method_list ¶
Return all supported distance methods, grouped by backend.
R reference: distanceMethodList