Manipulation¶
All manipulation functions return new Phyloseq objects; inputs are never modified. Every operation propagates all components that are not directly affected — so filtering taxa keeps the sample data unchanged, filtering samples keeps the tax table unchanged, and so on.
Subsetting¶
prune_taxa / prune_samples¶
Take an explicit list of names. Names not present in the object are silently ignored; the output preserves the order of the input list:
from pyloseq import prune_taxa, prune_samples
ps2 = prune_taxa(["OTU1", "OTU3", "OTU9"], ps)
ps2 = prune_samples(["S1", "S4", "S7"], ps)
Use these when you already have the names. Use subset_* when you need to filter by a condition.
pyloseq.prune_taxa ¶
Return a new Phyloseq containing only the specified taxa.
R reference: prune_taxa(taxa, x)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
names
|
list[str] | Index
|
Taxa to keep. Order is preserved; names absent from |
required |
ps
|
Phyloseq
|
Source |
required |
pyloseq.prune_samples ¶
Return a new Phyloseq containing only the specified samples.
R reference: prune_samples(samples, x)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
names
|
list[str] | Index
|
Samples to keep. Order is preserved; names absent from |
required |
ps
|
Phyloseq
|
Source |
required |
subset_taxa / subset_samples¶
Filter by a predicate applied to the tax table or sample data. The predicate can be a callable or a pandas query string:
from pyloseq import subset_taxa, subset_samples
# Callable form — receives one row as a Series
ps2 = subset_taxa(ps, lambda t: t["Phylum"] == "Firmicutes")
ps2 = subset_samples(ps, lambda s: s["SampleType"] == "Soil")
# Query string form
ps2 = subset_taxa(ps, 'Phylum == "Bacteroidetes"')
ps2 = subset_samples(ps, 'SampleType == "Ocean" and Depth > 100')
subset_taxa requires tax_table; subset_samples requires sample_data.
For the string form, column names with spaces must be backtick-quoted: 'Consensus\ Lineage == "Bacteria"' → '`Consensus Lineage` == "Bacteria"'. Use the callable form to avoid this.
pyloseq.subset_taxa ¶
Return a new Phyloseq with taxa matching a filter expression.
R reference: subset_taxa(physeq, ...)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
expr
|
Callable[..., Any] | str
|
Either a callable applied row-wise to |
required |
Notes
For the string form, rank columns with spaces (e.g. "Consensus Lineage")
must be backtick-quoted for :meth:pandas.DataFrame.query. Use the callable
form to avoid these quoting rules.
Examples:
pyloseq.subset_samples ¶
Return a new Phyloseq with samples matching a filter expression.
R reference: subset_samples(physeq, ...)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
expr
|
Callable[..., Any] | str
|
Either a callable applied row-wise to |
required |
Notes
For the string form, sample_data is evaluated with
:meth:pandas.DataFrame.query. The sample ID is the index, so reference it
as index (e.g. 'index == "S1"'), and column names containing spaces
(such as "Consensus Lineage") must be wrapped in backticks. Use the
callable form to avoid these quoting rules.
Examples:
Filtering¶
filter_taxa¶
Applies a predicate to each taxon's abundance vector (one value per sample). Returns a pruned Phyloseq with only the taxa where the predicate returns True:
from pyloseq import filter_taxa
# Keep taxa with mean abundance > 0.001
ps2 = filter_taxa(ps, lambda x: x.mean() > 0.001)
This is equivalent to R's filter_taxa(physeq, flist, prune=TRUE). To get the boolean mask without pruning, use taxa_filter_mask.
pyloseq.filter_taxa ¶
Return a new Phyloseq containing only taxa that satisfy predicate.
R reference: filter_taxa(physeq, flist, prune=TRUE)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
predicate
|
Callable[[Series], bool]
|
A callable accepting a |
required |
Notes
This corresponds to R's filter_taxa(physeq, flist, prune=TRUE). For the
prune=FALSE behaviour (return the boolean mask without pruning), use
:func:taxa_filter_mask.
kOverA¶
Factory for a common filter: keep taxa present in at least k samples with abundance greater than A:
from pyloseq import kOverA, filter_taxa
# Keep taxa with raw count > 10 in at least 3 samples
ps2 = filter_taxa(ps, kOverA(3, 10))
pyloseq.kOverA ¶
Return a predicate: True if >= k samples have abundance > A.
R reference: kOverA(k, A)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
Minimum number of samples that must exceed threshold. |
required |
A
|
float
|
Abundance threshold. |
required |
taxa_filter_mask¶
Returns the boolean pd.Series from a predicate without pruning. Useful for inspecting which taxa would be removed before committing:
from pyloseq import taxa_filter_mask, kOverA
mask = taxa_filter_mask(ps, kOverA(3, 10))
print(mask.sum(), "taxa would be kept")
pyloseq.taxa_filter_mask ¶
Return a boolean pd.Series (indexed by taxa) for a per-taxon predicate.
Use this when you want to inspect the mask before pruning.
To obtain a pruned Phyloseq directly, use :func:filter_taxa.
R reference: filter_taxa(physeq, flist, prune=FALSE)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
predicate
|
Callable[[Series], bool]
|
A callable accepting a |
required |
Transformation¶
transform_sample_counts¶
Applies a function column-by-column across the OTU table. Each call receives a pd.Series of abundances for one sample (indexed by taxa name) and must return a Series of equal length:
from pyloseq import transform_sample_counts
import numpy as np
# Relative abundance
ps_rel = transform_sample_counts(ps, lambda x: x / x.sum())
# Log-transform (adding a pseudocount)
ps_log = transform_sample_counts(ps, lambda x: np.log1p(x))
Warning
Samples where the total count is zero will produce NaN or inf after division. Filter out zero-count samples before normalizing, or handle them explicitly in the function.
pyloseq.transform_sample_counts ¶
Apply a per-sample transformation to the abundance table.
R reference: transform_sample_counts(physeq, function(x) x / sum(x))
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
fn
|
Callable[[Series], Series]
|
A callable accepting a |
required |
Examples:
rarefy_even_depth¶
Random subsampling (rarefaction) to a uniform sequencing depth:
from pyloseq import rarefy_even_depth
ps_rare = rarefy_even_depth(ps, sample_size=10000, replace=False, rng_seed=42)
Samples with fewer than sample_size counts are dropped. The default sample_size is the minimum sample sum in the dataset. Set replace=True for sampling with replacement (not recommended for most analyses).
pyloseq.rarefy_even_depth ¶
rarefy_even_depth(
ps: Phyloseq,
sample_size: int | None = None,
rng_seed: int | None = 42,
replace: bool = True,
trim_otus: bool = True,
verbose: bool = True,
compat: str | None = None,
) -> Phyloseq
Rarefy all samples to even sequencing depth by subsampling.
R reference: rarefy_even_depth(physeq, sample.size, rngseed, replace, trimOTUs, verbose)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
sample_size
|
int | None
|
Target depth. Defaults to |
None
|
rng_seed
|
int | None
|
Seed for |
42
|
replace
|
bool
|
If |
True
|
trim_otus
|
bool
|
Remove taxa that are zero in all samples after rarefaction. |
True
|
verbose
|
bool
|
Emit a warning when samples are dropped. |
True
|
compat
|
str | None
|
Reserved for future R-compatible RNG mode; currently ignored. |
None
|
Merging¶
merge_phyloseq¶
Combines multiple Phyloseq objects by taking the union of taxa and samples, summing counts where both are present:
pyloseq.merge_phyloseq ¶
Merge two or more Phyloseq objects into one.
OTU abundances are summed for overlapping (taxa, sample) pairs. The union of all taxa and all samples is included (filling zeros where absent).
R reference: merge_phyloseq(...)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*objs
|
Phyloseq
|
Two or more |
()
|
Notes
For sample_data, tax_table, and refseq, overlapping keys are
resolved first-wins (the earliest object in objs that defines the key).
For the tree, the first non-null phy_tree is kept and pruned to the
merged taxa by the constructor; differing trees across inputs are not
reconciled.
merge_samples¶
Collapses samples that share the same value of a metadata variable. Abundance counts are summed across samples within each group; numeric metadata columns are averaged; non-numeric columns that are constant within a group are retained, others become NaN:
from pyloseq import merge_samples
# One row per SampleType
ps_grouped = merge_samples(ps, "SampleType")
pyloseq.merge_samples ¶
merge_samples(
ps: Phyloseq,
group_var: str,
fn: Callable[[Series], Any] | None = None,
) -> Phyloseq
Collapse samples that share a value in a metadata variable.
OTU abundances are summed within each group. Sample metadata is
aggregated: numeric columns via fn (default :func:numpy.mean), other
columns via mode.
R reference: merge_samples(x, group, fun)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
group_var
|
str
|
Column name in |
required |
fn
|
Callable[[Series], Any] | None
|
Aggregation function for numeric metadata columns. |
None
|
Notes
As in R's merge_samples, every numeric metadata column is aggregated
with fn (default mean). Numeric columns that are really identifiers
(e.g. a numeric subject ID) will be averaged into meaningless values; drop
or stringify such columns before merging if that is not desired.
merge_taxa¶
Collapses a list of taxa into a single representative, summing their counts. The archetype parameter names which taxon inherits the taxonomy annotation and reference sequence of the merged group:
pyloseq.merge_taxa ¶
Merge a set of taxa into a single representative.
Abundances are summed; the archetype's taxonomy row is retained.
R reference: merge_taxa(x, eqtaxa, archetype)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
eqtaxa
|
list[str]
|
Taxa to merge. |
required |
archetype
|
str | None
|
The taxon whose metadata row is retained. Defaults to the most-abundant member (sum across all samples). |
None
|
Aggregation¶
tax_glom¶
Collapses taxa that share the same annotation at a given taxonomic rank. Abundance counts are summed within each unique value. This is the primary way to work at phylum or genus level:
Taxa with NaN at the target rank are collected into a single "Unknown" bin by default. Pass na_rm=True to drop them instead.
pyloseq.tax_glom ¶
tax_glom(
ps: Phyloseq,
taxrank: str,
na_rm: bool = True,
bad_empty: tuple[Any, ...] = (None, "", " ", "\t"),
) -> Phyloseq
Agglomerate taxa to a specified taxonomic rank.
Taxa that share the same value at taxrank (and all coarser ranks) are
summed; the most-abundant member's row is kept as the archetype.
R reference: tax_glom(physeq, taxrank, NArm, bad_empty)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
taxrank
|
str
|
Rank to agglomerate at (e.g. |
required |
na_rm
|
bool
|
Drop taxa whose value at |
True
|
bad_empty
|
tuple[Any, ...]
|
Values treated as missing at |
(None, '', ' ', '\t')
|
tip_glom¶
Collapses taxa that are within a given patristic distance of each other on the phylogenetic tree. Requires phy_tree:
pyloseq.tip_glom ¶
Agglomerate taxa by phylogenetic distance.
Hierarchical clustering on pairwise patristic distances groups tips whose
within-cluster distance is <= h; each group is merged via
:func:merge_taxa.
R reference: tip_glom(physeq, h, hcfun)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ps
|
Phyloseq
|
Source |
required |
h
|
float
|
Height cutoff for :func: |
0.2
|
hcfun
|
str
|
Linkage method passed to :func: |
'average'
|
Reshaping¶
psmelt¶
Converts from wide format (taxa × samples matrix) to long format (one row per taxon-sample combination). The output DataFrame includes columns for OTU, Sample, Abundance, all sample metadata variables, and all taxonomic rank columns:
from pyloseq import psmelt
long_df = psmelt(ps)
# Columns: OTU, Sample, Abundance, Kingdom, Phylum, ..., SampleType, ...
This is the same as calling ps.melt(). Use the resulting DataFrame for custom ggplot layers or non-phyloseq statistical tests.