Skip to content

Manipulation

All manipulation functions return new Phyloseq objects; inputs are never modified. Every operation propagates all components that are not directly affected — so filtering taxa keeps the sample data unchanged, filtering samples keeps the tax table unchanged, and so on.


Subsetting

prune_taxa / prune_samples

Take an explicit list of names. Names not present in the object are silently ignored; the output preserves the order of the input list:

from pyloseq import prune_taxa, prune_samples

ps2 = prune_taxa(["OTU1", "OTU3", "OTU9"], ps)
ps2 = prune_samples(["S1", "S4", "S7"], ps)

Use these when you already have the names. Use subset_* when you need to filter by a condition.

pyloseq.prune_taxa

prune_taxa(
    names: list[str] | Index, ps: Phyloseq
) -> Phyloseq

Return a new Phyloseq containing only the specified taxa.

R reference: prune_taxa(taxa, x)

Parameters:

Name Type Description Default
names list[str] | Index

Taxa to keep. Order is preserved; names absent from ps are ignored.

required
ps Phyloseq

Source Phyloseq object.

required

pyloseq.prune_samples

prune_samples(
    names: list[str] | Index, ps: Phyloseq
) -> Phyloseq

Return a new Phyloseq containing only the specified samples.

R reference: prune_samples(samples, x)

Parameters:

Name Type Description Default
names list[str] | Index

Samples to keep. Order is preserved; names absent from ps are ignored.

required
ps Phyloseq

Source Phyloseq object.

required

subset_taxa / subset_samples

Filter by a predicate applied to the tax table or sample data. The predicate can be a callable or a pandas query string:

from pyloseq import subset_taxa, subset_samples

# Callable form — receives one row as a Series
ps2 = subset_taxa(ps, lambda t: t["Phylum"] == "Firmicutes")
ps2 = subset_samples(ps, lambda s: s["SampleType"] == "Soil")

# Query string form
ps2 = subset_taxa(ps, 'Phylum == "Bacteroidetes"')
ps2 = subset_samples(ps, 'SampleType == "Ocean" and Depth > 100')

subset_taxa requires tax_table; subset_samples requires sample_data.

For the string form, column names with spaces must be backtick-quoted: 'Consensus\ Lineage == "Bacteria"''`Consensus Lineage` == "Bacteria"'. Use the callable form to avoid this.

pyloseq.subset_taxa

subset_taxa(
    ps: Phyloseq, expr: Callable[..., Any] | str
) -> Phyloseq

Return a new Phyloseq with taxa matching a filter expression.

R reference: subset_taxa(physeq, ...)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object (must have tax_table).

required
expr Callable[..., Any] | str

Either a callable applied row-wise to tax_table returning a bool, or a pandas-query string evaluated against tax_table.

required
Notes

For the string form, rank columns with spaces (e.g. "Consensus Lineage") must be backtick-quoted for :meth:pandas.DataFrame.query. Use the callable form to avoid these quoting rules.

Examples:

>>> subset_taxa(ps, lambda t: t["Phylum"] == "Chlamydiae")
>>> subset_taxa(ps, 'Phylum == "Chlamydiae"')

pyloseq.subset_samples

subset_samples(
    ps: Phyloseq, expr: Callable[..., Any] | str
) -> Phyloseq

Return a new Phyloseq with samples matching a filter expression.

R reference: subset_samples(physeq, ...)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object (must have sample_data).

required
expr Callable[..., Any] | str

Either a callable applied row-wise to sample_data returning a bool, or a pandas-query string evaluated against sample_data.

required
Notes

For the string form, sample_data is evaluated with :meth:pandas.DataFrame.query. The sample ID is the index, so reference it as index (e.g. 'index == "S1"'), and column names containing spaces (such as "Consensus Lineage") must be wrapped in backticks. Use the callable form to avoid these quoting rules.

Examples:

>>> subset_samples(ps, lambda s: s["SampleType"] == "Soil")
>>> subset_samples(ps, 'SampleType == "Soil"')

Filtering

filter_taxa

Applies a predicate to each taxon's abundance vector (one value per sample). Returns a pruned Phyloseq with only the taxa where the predicate returns True:

from pyloseq import filter_taxa

# Keep taxa with mean abundance > 0.001
ps2 = filter_taxa(ps, lambda x: x.mean() > 0.001)

This is equivalent to R's filter_taxa(physeq, flist, prune=TRUE). To get the boolean mask without pruning, use taxa_filter_mask.

pyloseq.filter_taxa

filter_taxa(
    ps: Phyloseq, predicate: Callable[[Series], bool]
) -> Phyloseq

Return a new Phyloseq containing only taxa that satisfy predicate.

R reference: filter_taxa(physeq, flist, prune=TRUE)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object.

required
predicate Callable[[Series], bool]

A callable accepting a pd.Series of abundances across all samples for one taxon, returning True to keep, False to drop.

required
Notes

This corresponds to R's filter_taxa(physeq, flist, prune=TRUE). For the prune=FALSE behaviour (return the boolean mask without pruning), use :func:taxa_filter_mask.

kOverA

Factory for a common filter: keep taxa present in at least k samples with abundance greater than A:

from pyloseq import kOverA, filter_taxa

# Keep taxa with raw count > 10 in at least 3 samples
ps2 = filter_taxa(ps, kOverA(3, 10))

pyloseq.kOverA

kOverA(k: int, A: float) -> Callable[[pd.Series], bool]

Return a predicate: True if >= k samples have abundance > A.

R reference: kOverA(k, A)

Parameters:

Name Type Description Default
k int

Minimum number of samples that must exceed threshold.

required
A float

Abundance threshold.

required

taxa_filter_mask

Returns the boolean pd.Series from a predicate without pruning. Useful for inspecting which taxa would be removed before committing:

from pyloseq import taxa_filter_mask, kOverA

mask = taxa_filter_mask(ps, kOverA(3, 10))
print(mask.sum(), "taxa would be kept")

pyloseq.taxa_filter_mask

taxa_filter_mask(
    ps: Phyloseq, predicate: Callable[[Series], bool]
) -> pd.Series

Return a boolean pd.Series (indexed by taxa) for a per-taxon predicate.

Use this when you want to inspect the mask before pruning. To obtain a pruned Phyloseq directly, use :func:filter_taxa.

R reference: filter_taxa(physeq, flist, prune=FALSE)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object.

required
predicate Callable[[Series], bool]

A callable accepting a pd.Series of abundances across all samples for one taxon, returning True to keep, False to drop.

required

Transformation

transform_sample_counts

Applies a function column-by-column across the OTU table. Each call receives a pd.Series of abundances for one sample (indexed by taxa name) and must return a Series of equal length:

from pyloseq import transform_sample_counts
import numpy as np

# Relative abundance
ps_rel = transform_sample_counts(ps, lambda x: x / x.sum())

# Log-transform (adding a pseudocount)
ps_log = transform_sample_counts(ps, lambda x: np.log1p(x))

Warning

Samples where the total count is zero will produce NaN or inf after division. Filter out zero-count samples before normalizing, or handle them explicitly in the function.

pyloseq.transform_sample_counts

transform_sample_counts(
    ps: Phyloseq, fn: Callable[[Series], Series]
) -> Phyloseq

Apply a per-sample transformation to the abundance table.

R reference: transform_sample_counts(physeq, function(x) x / sum(x))

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object.

required
fn Callable[[Series], Series]

A callable accepting a pd.Series of abundances for one sample (indexed by taxa name), returning a transformed series of equal length.

required

Examples:

>>> transform_sample_counts(ps, lambda x: x / x.sum())

rarefy_even_depth

Random subsampling (rarefaction) to a uniform sequencing depth:

from pyloseq import rarefy_even_depth

ps_rare = rarefy_even_depth(ps, sample_size=10000, replace=False, rng_seed=42)

Samples with fewer than sample_size counts are dropped. The default sample_size is the minimum sample sum in the dataset. Set replace=True for sampling with replacement (not recommended for most analyses).

pyloseq.rarefy_even_depth

rarefy_even_depth(
    ps: Phyloseq,
    sample_size: int | None = None,
    rng_seed: int | None = 42,
    replace: bool = True,
    trim_otus: bool = True,
    verbose: bool = True,
    compat: str | None = None,
) -> Phyloseq

Rarefy all samples to even sequencing depth by subsampling.

R reference: rarefy_even_depth(physeq, sample.size, rngseed, replace, trimOTUs, verbose)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object.

required
sample_size int | None

Target depth. Defaults to min(sample_sums(ps)).

None
rng_seed int | None

Seed for numpy.random.default_rng. Pass None for non-reproducible draws.

42
replace bool

If True (default), use multinomial sampling (with replacement). If False, sample without replacement from the read pool. Note that the without-replacement path materialises a read pool of size equal to each sample's depth, which can be large for very deep samples.

True
trim_otus bool

Remove taxa that are zero in all samples after rarefaction.

True
verbose bool

Emit a warning when samples are dropped.

True
compat str | None

Reserved for future R-compatible RNG mode; currently ignored.

None

Merging

merge_phyloseq

Combines multiple Phyloseq objects by taking the union of taxa and samples, summing counts where both are present:

from pyloseq import merge_phyloseq

merged = merge_phyloseq(ps1, ps2, ps3)

pyloseq.merge_phyloseq

merge_phyloseq(*objs: Phyloseq) -> Phyloseq

Merge two or more Phyloseq objects into one.

OTU abundances are summed for overlapping (taxa, sample) pairs. The union of all taxa and all samples is included (filling zeros where absent).

R reference: merge_phyloseq(...)

Parameters:

Name Type Description Default
*objs Phyloseq

Two or more Phyloseq objects.

()
Notes

For sample_data, tax_table, and refseq, overlapping keys are resolved first-wins (the earliest object in objs that defines the key). For the tree, the first non-null phy_tree is kept and pruned to the merged taxa by the constructor; differing trees across inputs are not reconciled.

merge_samples

Collapses samples that share the same value of a metadata variable. Abundance counts are summed across samples within each group; numeric metadata columns are averaged; non-numeric columns that are constant within a group are retained, others become NaN:

from pyloseq import merge_samples

# One row per SampleType
ps_grouped = merge_samples(ps, "SampleType")

pyloseq.merge_samples

merge_samples(
    ps: Phyloseq,
    group_var: str,
    fn: Callable[[Series], Any] | None = None,
) -> Phyloseq

Collapse samples that share a value in a metadata variable.

OTU abundances are summed within each group. Sample metadata is aggregated: numeric columns via fn (default :func:numpy.mean), other columns via mode.

R reference: merge_samples(x, group, fun)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object (must have sample_data).

required
group_var str

Column name in sample_data that defines the grouping.

required
fn Callable[[Series], Any] | None

Aggregation function for numeric metadata columns.

None
Notes

As in R's merge_samples, every numeric metadata column is aggregated with fn (default mean). Numeric columns that are really identifiers (e.g. a numeric subject ID) will be averaged into meaningless values; drop or stringify such columns before merging if that is not desired.

merge_taxa

Collapses a list of taxa into a single representative, summing their counts. The archetype parameter names which taxon inherits the taxonomy annotation and reference sequence of the merged group:

from pyloseq import merge_taxa

ps2 = merge_taxa(ps, ["OTU1", "OTU2", "OTU7"], archetype="OTU1")

pyloseq.merge_taxa

merge_taxa(
    ps: Phyloseq,
    eqtaxa: list[str],
    archetype: str | None = None,
) -> Phyloseq

Merge a set of taxa into a single representative.

Abundances are summed; the archetype's taxonomy row is retained.

R reference: merge_taxa(x, eqtaxa, archetype)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object.

required
eqtaxa list[str]

Taxa to merge.

required
archetype str | None

The taxon whose metadata row is retained. Defaults to the most-abundant member (sum across all samples).

None

Aggregation

tax_glom

Collapses taxa that share the same annotation at a given taxonomic rank. Abundance counts are summed within each unique value. This is the primary way to work at phylum or genus level:

from pyloseq import tax_glom

ps_phylum = tax_glom(ps, "Phylum")
ps_genus  = tax_glom(ps, "Genus")

Taxa with NaN at the target rank are collected into a single "Unknown" bin by default. Pass na_rm=True to drop them instead.

pyloseq.tax_glom

tax_glom(
    ps: Phyloseq,
    taxrank: str,
    na_rm: bool = True,
    bad_empty: tuple[Any, ...] = (None, "", " ", "\t"),
) -> Phyloseq

Agglomerate taxa to a specified taxonomic rank.

Taxa that share the same value at taxrank (and all coarser ranks) are summed; the most-abundant member's row is kept as the archetype.

R reference: tax_glom(physeq, taxrank, NArm, bad_empty)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object (must have tax_table).

required
taxrank str

Rank to agglomerate at (e.g. "Genus").

required
na_rm bool

Drop taxa whose value at taxrank is None, empty, or in bad_empty.

True
bad_empty tuple[Any, ...]

Values treated as missing at taxrank.

(None, '', ' ', '\t')

tip_glom

Collapses taxa that are within a given patristic distance of each other on the phylogenetic tree. Requires phy_tree:

from pyloseq import tip_glom

ps2 = tip_glom(ps, h=0.05)

pyloseq.tip_glom

tip_glom(
    ps: Phyloseq, h: float = 0.2, hcfun: str = "average"
) -> Phyloseq

Agglomerate taxa by phylogenetic distance.

Hierarchical clustering on pairwise patristic distances groups tips whose within-cluster distance is <= h; each group is merged via :func:merge_taxa.

R reference: tip_glom(physeq, h, hcfun)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object (must have phy_tree).

required
h float

Height cutoff for :func:scipy.cluster.hierarchy.fcluster.

0.2
hcfun str

Linkage method passed to :func:scipy.cluster.hierarchy.linkage (e.g. "average", "complete", "ward").

'average'

Reshaping

psmelt

Converts from wide format (taxa × samples matrix) to long format (one row per taxon-sample combination). The output DataFrame includes columns for OTU, Sample, Abundance, all sample metadata variables, and all taxonomic rank columns:

from pyloseq import psmelt

long_df = psmelt(ps)
# Columns: OTU, Sample, Abundance, Kingdom, Phylum, ..., SampleType, ...

This is the same as calling ps.melt(). Use the resulting DataFrame for custom ggplot layers or non-phyloseq statistical tests.

pyloseq.psmelt

psmelt(ps: Phyloseq) -> pd.DataFrame

Melt a Phyloseq into a long-form tidy DataFrame.

Returns one row per (OTU, Sample) pair with columns: ["OTU", "Sample", "Abundance", *sample_variables, *rank_names].

R reference: psmelt(physeq)

Parameters:

Name Type Description Default
ps Phyloseq

Source Phyloseq object.

required