Manipulation¶

All manipulation functions return new Phyloseq objects; inputs are never modified. Every operation propagates all components that are not directly affected — so filtering taxa keeps the sample data unchanged, filtering samples keeps the tax table unchanged, and so on.

Subsetting¶

prune_taxa / prune_samples¶

Take an explicit list of names. Names not present in the object are silently ignored; the output preserves the order of the input list:

from pyloseq import prune_taxa, prune_samples

ps2 = prune_taxa(["OTU1", "OTU3", "OTU9"], ps)
ps2 = prune_samples(["S1", "S4", "S7"], ps)

Use these when you already have the names. Use subset_* when you need to filter by a condition.

pyloseq.prune_taxa ¶

prune_taxa(
    names: list[str] | Index, ps: Phyloseq
) -> Phyloseq

Return a new Phyloseq containing only the specified taxa.

R reference: prune_taxa(taxa, x)

Parameters:

Name	Type	Description	Default
`names`	`list[str] \| Index`	Taxa to keep. Order is preserved; names absent from `ps` are ignored.	required
`ps`	`Phyloseq`	Source `Phyloseq` object.	required

pyloseq.prune_samples ¶

prune_samples(
    names: list[str] | Index, ps: Phyloseq
) -> Phyloseq

Return a new Phyloseq containing only the specified samples.

R reference: prune_samples(samples, x)

Parameters:

Name	Type	Description	Default
`names`	`list[str] \| Index`	Samples to keep. Order is preserved; names absent from `ps` are ignored.	required
`ps`	`Phyloseq`	Source `Phyloseq` object.	required

subset_taxa / subset_samples¶

Filter by a predicate applied to the tax table or sample data. The predicate can be a callable or a pandas query string:

from pyloseq import subset_taxa, subset_samples

# Callable form — receives one row as a Series
ps2 = subset_taxa(ps, lambda t: t["Phylum"] == "Firmicutes")
ps2 = subset_samples(ps, lambda s: s["SampleType"] == "Soil")

# Query string form
ps2 = subset_taxa(ps, 'Phylum == "Bacteroidetes"')
ps2 = subset_samples(ps, 'SampleType == "Ocean" and Depth > 100')

subset_taxa requires tax_table; subset_samples requires sample_data.

For the string form, column names with spaces must be backtick-quoted: 'Consensus\ Lineage == "Bacteria"' → '`Consensus Lineage` == "Bacteria"'. Use the callable form to avoid this.

pyloseq.subset_taxa ¶

subset_taxa(
    ps: Phyloseq, expr: Callable[..., Any] | str
) -> Phyloseq

Return a new Phyloseq with taxa matching a filter expression.

R reference: subset_taxa(physeq, ...)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object (must have `tax_table`).	required
`expr`	`Callable[..., Any] \| str`	Either a callable applied row-wise to `tax_table` returning a bool, or a pandas-query string evaluated against `tax_table`.	required

Notes

For the string form, rank columns with spaces (e.g. "Consensus Lineage") must be backtick-quoted for :meth:pandas.DataFrame.query. Use the callable form to avoid these quoting rules.

Examples:

>>> subset_taxa(ps, lambda t: t["Phylum"] == "Chlamydiae")
>>> subset_taxa(ps, 'Phylum == "Chlamydiae"')

pyloseq.subset_samples ¶

subset_samples(
    ps: Phyloseq, expr: Callable[..., Any] | str
) -> Phyloseq

Return a new Phyloseq with samples matching a filter expression.

R reference: subset_samples(physeq, ...)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object (must have `sample_data`).	required
`expr`	`Callable[..., Any] \| str`	Either a callable applied row-wise to `sample_data` returning a bool, or a pandas-query string evaluated against `sample_data`.	required

Notes

For the string form, sample_data is evaluated with :meth:pandas.DataFrame.query. The sample ID is the index, so reference it as index (e.g. 'index == "S1"'), and column names containing spaces (such as "Consensus Lineage") must be wrapped in backticks. Use the callable form to avoid these quoting rules.

Examples:

>>> subset_samples(ps, lambda s: s["SampleType"] == "Soil")
>>> subset_samples(ps, 'SampleType == "Soil"')

Filtering¶

filter_taxa¶

Applies a predicate to each taxon's abundance vector (one value per sample). Returns a pruned Phyloseq with only the taxa where the predicate returns True:

from pyloseq import filter_taxa

# Keep taxa with mean abundance > 0.001
ps2 = filter_taxa(ps, lambda x: x.mean() > 0.001)

This is equivalent to R's filter_taxa(physeq, flist, prune=TRUE). To get the boolean mask without pruning, use taxa_filter_mask.

pyloseq.filter_taxa ¶

filter_taxa(
    ps: Phyloseq, predicate: Callable[[Series], bool]
) -> Phyloseq

Return a new Phyloseq containing only taxa that satisfy predicate.

R reference: filter_taxa(physeq, flist, prune=TRUE)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object.	required
`predicate`	`Callable[[Series], bool]`	A callable accepting a `pd.Series` of abundances across all samples for one taxon, returning `True` to keep, `False` to drop.	required

Notes

This corresponds to R's filter_taxa(physeq, flist, prune=TRUE). For the prune=FALSE behaviour (return the boolean mask without pruning), use :func:taxa_filter_mask.

kOverA¶

Factory for a common filter: keep taxa present in at least k samples with abundance greater than A:

from pyloseq import kOverA, filter_taxa

# Keep taxa with raw count > 10 in at least 3 samples
ps2 = filter_taxa(ps, kOverA(3, 10))

pyloseq.kOverA ¶

kOverA(k: int, A: float) -> Callable[[pd.Series], bool]

Return a predicate: True if >= k samples have abundance > A.

R reference: kOverA(k, A)

Parameters:

Name	Type	Description	Default
`k`	`int`	Minimum number of samples that must exceed threshold.	required
`A`	`float`	Abundance threshold.	required

taxa_filter_mask¶

Returns the boolean pd.Series from a predicate without pruning. Useful for inspecting which taxa would be removed before committing:

from pyloseq import taxa_filter_mask, kOverA

mask = taxa_filter_mask(ps, kOverA(3, 10))
print(mask.sum(), "taxa would be kept")

pyloseq.taxa_filter_mask ¶

taxa_filter_mask(
    ps: Phyloseq, predicate: Callable[[Series], bool]
) -> pd.Series

Return a boolean pd.Series (indexed by taxa) for a per-taxon predicate.

Use this when you want to inspect the mask before pruning. To obtain a pruned Phyloseq directly, use :func:filter_taxa.

R reference: filter_taxa(physeq, flist, prune=FALSE)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object.	required
`predicate`	`Callable[[Series], bool]`	A callable accepting a `pd.Series` of abundances across all samples for one taxon, returning `True` to keep, `False` to drop.	required

Transformation¶

transform_sample_counts¶

Applies a function column-by-column across the OTU table. Each call receives a pd.Series of abundances for one sample (indexed by taxa name) and must return a Series of equal length:

from pyloseq import transform_sample_counts
import numpy as np

# Relative abundance
ps_rel = transform_sample_counts(ps, lambda x: x / x.sum())

# Log-transform (adding a pseudocount)
ps_log = transform_sample_counts(ps, lambda x: np.log1p(x))

Warning

Samples where the total count is zero will produce NaN or inf after division. Filter out zero-count samples before normalizing, or handle them explicitly in the function.

pyloseq.transform_sample_counts ¶

transform_sample_counts(
    ps: Phyloseq, fn: Callable[[Series], Series]
) -> Phyloseq

Apply a per-sample transformation to the abundance table.

R reference: transform_sample_counts(physeq, function(x) x / sum(x))

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object.	required
`fn`	`Callable[[Series], Series]`	A callable accepting a `pd.Series` of abundances for one sample (indexed by taxa name), returning a transformed series of equal length.	required

Examples:

>>> transform_sample_counts(ps, lambda x: x / x.sum())

rarefy_even_depth¶

Random subsampling (rarefaction) to a uniform sequencing depth:

from pyloseq import rarefy_even_depth

ps_rare = rarefy_even_depth(ps, sample_size=10000, replace=False, rng_seed=42)

Samples with fewer than sample_size counts are dropped. The default sample_size is the minimum sample sum in the dataset. Set replace=True for sampling with replacement (not recommended for most analyses).

pyloseq.rarefy_even_depth ¶

rarefy_even_depth(
    ps: Phyloseq,
    sample_size: int | None = None,
    rng_seed: int | None = 42,
    replace: bool = True,
    trim_otus: bool = True,
    verbose: bool = True,
    compat: str | None = None,
) -> Phyloseq

Rarefy all samples to even sequencing depth by subsampling.

R reference: rarefy_even_depth(physeq, sample.size, rngseed, replace, trimOTUs, verbose)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object.	required
`sample_size`	`int \| None`	Target depth. Defaults to `min(sample_sums(ps))`.	`None`
`rng_seed`	`int \| None`	Seed for `numpy.random.default_rng`. Pass `None` for non-reproducible draws.	`42`
`replace`	`bool`	If `True` (default), use multinomial sampling (with replacement). If `False`, sample without replacement from the read pool. Note that the without-replacement path materialises a read pool of size equal to each sample's depth, which can be large for very deep samples.	`True`
`trim_otus`	`bool`	Remove taxa that are zero in all samples after rarefaction.	`True`
`verbose`	`bool`	Emit a warning when samples are dropped.	`True`
`compat`	`str \| None`	Reserved for future R-compatible RNG mode; currently ignored.	`None`

Merging¶

merge_phyloseq¶

Combines multiple Phyloseq objects by taking the union of taxa and samples, summing counts where both are present:

from pyloseq import merge_phyloseq

merged = merge_phyloseq(ps1, ps2, ps3)

pyloseq.merge_phyloseq ¶

merge_phyloseq(*objs: Phyloseq) -> Phyloseq

Merge two or more Phyloseq objects into one.

OTU abundances are summed for overlapping (taxa, sample) pairs. The union of all taxa and all samples is included (filling zeros where absent).

R reference: merge_phyloseq(...)

Parameters:

Name	Type	Description	Default
`*objs`	`Phyloseq`	Two or more `Phyloseq` objects.	`()`

Notes

For sample_data, tax_table, and refseq, overlapping keys are resolved first-wins (the earliest object in objs that defines the key). For the tree, the first non-null phy_tree is kept and pruned to the merged taxa by the constructor; differing trees across inputs are not reconciled.

merge_samples¶

Collapses samples that share the same value of a metadata variable. Abundance counts are summed across samples within each group; numeric metadata columns are averaged; non-numeric columns that are constant within a group are retained, others become NaN:

from pyloseq import merge_samples

# One row per SampleType
ps_grouped = merge_samples(ps, "SampleType")

pyloseq.merge_samples ¶

merge_samples(
    ps: Phyloseq,
    group_var: str,
    fn: Callable[[Series], Any] | None = None,
) -> Phyloseq

Collapse samples that share a value in a metadata variable.

OTU abundances are summed within each group. Sample metadata is aggregated: numeric columns via fn (default :func:numpy.mean), other columns via mode.

R reference: merge_samples(x, group, fun)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object (must have `sample_data`).	required
`group_var`	`str`	Column name in `sample_data` that defines the grouping.	required
`fn`	`Callable[[Series], Any] \| None`	Aggregation function for numeric metadata columns.	`None`

Notes

As in R's merge_samples, every numeric metadata column is aggregated with fn (default mean). Numeric columns that are really identifiers (e.g. a numeric subject ID) will be averaged into meaningless values; drop or stringify such columns before merging if that is not desired.

merge_taxa¶

Collapses a list of taxa into a single representative, summing their counts. The archetype parameter names which taxon inherits the taxonomy annotation and reference sequence of the merged group:

from pyloseq import merge_taxa

ps2 = merge_taxa(ps, ["OTU1", "OTU2", "OTU7"], archetype="OTU1")

pyloseq.merge_taxa ¶

merge_taxa(
    ps: Phyloseq,
    eqtaxa: list[str],
    archetype: str | None = None,
) -> Phyloseq

Merge a set of taxa into a single representative.

Abundances are summed; the archetype's taxonomy row is retained.

R reference: merge_taxa(x, eqtaxa, archetype)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object.	required
`eqtaxa`	`list[str]`	Taxa to merge.	required
`archetype`	`str \| None`	The taxon whose metadata row is retained. Defaults to the most-abundant member (sum across all samples).	`None`

Aggregation¶

tax_glom¶

Collapses taxa that share the same annotation at a given taxonomic rank. Abundance counts are summed within each unique value. This is the primary way to work at phylum or genus level:

from pyloseq import tax_glom

ps_phylum = tax_glom(ps, "Phylum")
ps_genus  = tax_glom(ps, "Genus")

Taxa with NaN at the target rank are collected into a single "Unknown" bin by default. Pass na_rm=True to drop them instead.

pyloseq.tax_glom ¶

tax_glom(
    ps: Phyloseq,
    taxrank: str,
    na_rm: bool = True,
    bad_empty: tuple[Any, ...] = (None, "", " ", "\t"),
) -> Phyloseq

Agglomerate taxa to a specified taxonomic rank.

Taxa that share the same value at taxrank (and all coarser ranks) are summed; the most-abundant member's row is kept as the archetype.

R reference: tax_glom(physeq, taxrank, NArm, bad_empty)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object (must have `tax_table`).	required
`taxrank`	`str`	Rank to agglomerate at (e.g. `"Genus"`).	required
`na_rm`	`bool`	Drop taxa whose value at `taxrank` is `None`, empty, or in `bad_empty`.	`True`
`bad_empty`	`tuple[Any, ...]`	Values treated as missing at `taxrank`.	`(None, '', ' ', '\t')`

tip_glom¶

Collapses taxa that are within a given patristic distance of each other on the phylogenetic tree. Requires phy_tree:

from pyloseq import tip_glom

ps2 = tip_glom(ps, h=0.05)

pyloseq.tip_glom ¶

tip_glom(
    ps: Phyloseq, h: float = 0.2, hcfun: str = "average"
) -> Phyloseq

Agglomerate taxa by phylogenetic distance.

Hierarchical clustering on pairwise patristic distances groups tips whose within-cluster distance is <= h; each group is merged via :func:merge_taxa.

R reference: tip_glom(physeq, h, hcfun)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object (must have `phy_tree`).	required
`h`	`float`	Height cutoff for :func:`scipy.cluster.hierarchy.fcluster`.	`0.2`
`hcfun`	`str`	Linkage method passed to :func:`scipy.cluster.hierarchy.linkage` (e.g. `"average"`, `"complete"`, `"ward"`).	`'average'`

Reshaping¶

psmelt¶

Converts from wide format (taxa × samples matrix) to long format (one row per taxon-sample combination). The output DataFrame includes columns for OTU, Sample, Abundance, all sample metadata variables, and all taxonomic rank columns:

from pyloseq import psmelt

long_df = psmelt(ps)
# Columns: OTU, Sample, Abundance, Kingdom, Phylum, ..., SampleType, ...

This is the same as calling ps.melt(). Use the resulting DataFrame for custom ggplot layers or non-phyloseq statistical tests.

pyloseq.psmelt ¶

psmelt(ps: Phyloseq) -> pd.DataFrame

Melt a Phyloseq into a long-form tidy DataFrame.

Returns one row per (OTU, Sample) pair with columns: ["OTU", "Sample", "Abundance", *sample_variables, *rank_names].

R reference: psmelt(physeq)

Parameters:

Name	Type	Description	Default
`ps`	`Phyloseq`	Source `Phyloseq` object.	required