Skip to content

Containers

pyloseq represents microbiome data through a small set of typed container classes. Phyloseq is the top-level object that holds the others. All containers are immutable in the sense that manipulation functions never modify them in-place — they always return new objects.


Phyloseq

Phyloseq is the central data object. It bundles an OTU table with any combination of sample metadata, taxonomic annotations, a phylogenetic tree, and reference sequences. The constructor validates component consistency and silently prunes to the intersection of taxa and sample names across all attached components.

from pyloseq import Phyloseq, OtuTable, SampleData, TaxTable, PhyTree

ps = Phyloseq(
    otu=OtuTable(df, taxa_are_rows=True),
    sam=SampleData(metadata_df),
    tax=TaxTable(taxonomy_df),
    tree=PhyTree.from_newick(newick_str),
)

pyloseq.Phyloseq

Container for microbiome data: OTU table + optional metadata components.

Mirrors R's phyloseq-class. The constructor accepts any subset of components, runs the validator suite, and silently prunes to the intersection of names across components (unless strict=True).

By default, pruning during construction emits a warning so the data loss is discoverable; pass quiet=True to suppress it.

R reference: phyloseq::phyloseq(otu_table, sample_data, tax_table, phy_tree, refseq)

nsamples property

nsamples: int

Number of samples.

R reference: nsamples(x)

ntaxa property

ntaxa: int

Number of taxa.

R reference: ntaxa(x)

otu_table property writable

otu_table: OtuTable

The OTU/feature abundance table.

R reference: otu_table(x)

phy_tree property writable

phy_tree: PhyTree | None

Phylogenetic tree, or None if not provided.

R reference: phy_tree(x)

rank_names property

rank_names: list[str]

Taxonomic rank names, or [] if no tax table.

R reference: rank_names(x)

refseq property writable

refseq: RefSeq | None

Reference sequences, or None if not provided.

R reference: refseq(x)

sample_data property writable

sample_data: SampleData | None

Per-sample metadata, or None if not provided.

R reference: sample_data(x)

sample_names property

sample_names: Index

Sample identifiers from the OTU table.

R reference: sample_names(x)

sample_variables property

sample_variables: list[str]

Names of sample metadata columns, or [] if no sample data.

R reference: sample_variables(x)

tax_table property writable

tax_table: TaxTable | None

Taxonomic classification table, or None if not provided.

R reference: tax_table(x)

taxa_names property

taxa_names: Index

Taxon identifiers from the OTU table.

R reference: taxa_names(x)

distance

distance(
    method: str = "bray", **kwargs: Any
) -> DistanceMatrix

Compute a pairwise distance matrix.

Thin wrapper around :func:pyloseq.distance — returns a skbio.stats.distance.DistanceMatrix usable directly with skbio.stats.distance.permanova / anosim.

R reference: distance(physeq, method)

Parameters:

Name Type Description Default
method str

Distance method string (e.g. "bray", "unifrac").

'bray'
**kwargs Any

Forwarded to the underlying implementation.

{}

get_sample

get_sample(i: str) -> pd.Series

Return the abundance vector for a single sample across all taxa.

R reference: get_sample(x, i)

get_taxa

get_taxa(i: str) -> pd.Series

Return the abundance vector for a single taxon across all samples.

R reference: get_taxa(x, i)

get_variable

get_variable(v: str) -> pd.Series

Return a sample metadata column as a Series.

R reference: get_variable(x, v)

melt

melt() -> pd.DataFrame

Melt to a long-form tidy DataFrame (one row per OTU × Sample pair).

Equivalent to the free function :func:pyloseq.psmelt.

R reference: psmelt(physeq)

ordinate

ordinate(
    method: str = "PCoA",
    distance: str = "bray",
    formula: str | None = None,
    **kwargs: Any,
) -> OrdinationResults

Run multivariate ordination.

Thin wrapper around :func:pyloseq.ordinate — returns an skbio.stats.ordination.OrdinationResults.

R reference: ordinate(physeq, method, distance, formula)

Parameters:

Name Type Description Default
method str

Ordination method: "PCoA", "NMDS", "CCA", etc.

'PCoA'
distance str

Distance method or pre-computed DistanceMatrix.

'bray'
formula str | None

Model formula for constrained methods (e.g. "~SampleType").

None
**kwargs Any

Forwarded to the underlying implementation.

{}

sample_sums

sample_sums() -> pd.Series

Sum of abundances across all taxa for each sample.

R reference: sample_sums(x)

taxa_sums

taxa_sums() -> pd.Series

Sum of abundances across all samples for each taxon.

R reference: taxa_sums(x)

Validation and pruning

When taxa or sample names differ between components, the constructor prunes each to the intersection and emits a warning. Pass quiet=True to suppress the warning. Pass strict=True to raise pyloseqValidationError instead of pruning:

# Raises if otu and tax have mismatched taxa names
ps = Phyloseq(otu=otu, tax=tax, strict=True)

# Prune silently
ps = Phyloseq(otu=otu, tax=tax, quiet=True)

Component setters (e.g., ps.tax_table = new_tax) trigger re-validation using the same logic.


OtuTable

Stores the feature abundance matrix. Rows can be taxa or samples — track orientation with taxa_are_rows.

import pandas as pd
from pyloseq import OtuTable

df = pd.DataFrame(
    {"S1": [10, 0, 5], "S2": [0, 3, 7]},
    index=["OTU1", "OTU2", "OTU3"],
)
otu = OtuTable(df, taxa_are_rows=True)

Sparse input (NumPy sparse matrices, scipy CSR/CSC) is accepted. When the matrix density is below 50%, the internal representation is automatically converted to CSR format.

to_dataframe() always returns a DataFrame with taxa as rows, regardless of internal orientation:

df = otu.to_dataframe()   # taxa × samples

Flip orientation without copying data:

otu.taxa_are_rows = False  # now samples are rows internally

pyloseq.OtuTable

Stores an OTU/feature abundance table with orientation tracking.

Internally stores dense data as a pd.DataFrame and sparse data (density < 50 %) as a scipy.sparse.csr_matrix with separate index/column arrays.

R reference: phyloseq::otu_table(object, taxa_are_rows)

nsamples property

nsamples: int

Number of samples.

R reference: nsamples(x)

ntaxa property

ntaxa: int

Number of taxa.

R reference: ntaxa(x)

sample_names property writable

sample_names: Index

Sample identifiers.

R reference: sample_names(x)

taxa_are_rows property writable

taxa_are_rows: bool

Whether taxa occupy rows (True) or columns (False).

taxa_names property writable

taxa_names: Index

Taxa (OTU/ASV) identifiers.

R reference: taxa_names(x)

__init__

__init__(
    data: ndarray | DataFrame | spmatrix | list[Any],
    taxa_are_rows: bool = True,
) -> None

Parameters:

Name Type Description Default
data ndarray | DataFrame | spmatrix | list[Any]

Abundance matrix. Accepted types: pd.DataFrame, np.ndarray, any scipy.sparse matrix, or a list-of-lists. When a scipy.sparse matrix is supplied directly, sparse storage is always used regardless of matrix density (the 50 % density threshold only applies to dense inputs).

required
taxa_are_rows bool

If True (default), rows represent taxa and columns represent samples.

True

copy

copy() -> OtuTable

Return a deep copy.

R reference: otu_table(x) <- otu_table(x) (effectively)

sample_sums

sample_sums() -> pd.Series

Sum of abundances across all taxa for each sample.

R reference: sample_sums(x)

taxa_sums

taxa_sums() -> pd.Series

Sum of abundances across all samples for each taxon.

R reference: taxa_sums(x)

to_dataframe

to_dataframe() -> pd.DataFrame

Return the abundance matrix as a pd.DataFrame in current orientation.

Always returns a fresh copy; mutating the result never touches internal state.

R reference: as(otu_table(x), "matrix") then as.data.frame()


SampleData

Wraps per-sample metadata as a pandas.DataFrame. The DataFrame index is the sample identifier — it must be unique and must match sample names in the OTU table.

import pandas as pd
from pyloseq import SampleData

meta = pd.DataFrame(
    {"SampleType": ["Soil", "Ocean", "Skin"], "pH": [6.5, 8.1, 5.4]},
    index=["S1", "S2", "S3"],
)
sam = SampleData(meta)

Retrieve the underlying DataFrame with .to_frame().

pyloseq.SampleData

Wraps a pd.DataFrame of per-sample metadata.

The DataFrame index must be sample identifiers, and must be unique.

R reference: phyloseq::sample_data(object)

names property

names: Index

Deprecated alias for :attr:sample_names.

sample_names property

sample_names: Index

Sample identifiers (DataFrame index).

R reference: sample_names(x)

variables property

variables: Index

Sample variable names (DataFrame columns).

R reference: sample_variables(x)

copy

copy() -> SampleData

Return a deep copy of this SampleData.

to_frame

to_frame() -> pd.DataFrame

Return a copy of the underlying DataFrame.

R reference: as(sample_data(x), "data.frame")


TaxTable

Wraps the taxonomic classification table. Rows are taxa (indexed by the same names as the OTU table rows), columns are taxonomic ranks.

import pandas as pd
from pyloseq import TaxTable

tax_df = pd.DataFrame(
    {
        "Kingdom": ["Bacteria", "Bacteria"],
        "Phylum":  ["Firmicutes", "Bacteroidetes"],
        "Genus":   ["Lactobacillus", "Bacteroides"],
    },
    index=["OTU1", "OTU2"],
)
tax = TaxTable(tax_df)
print(tax.rank_names)   # ['Kingdom', 'Phylum', 'Genus']

pyloseq.TaxTable

Wraps a pd.DataFrame of taxonomic classifications.

The DataFrame index must be taxa identifiers; columns are rank names (e.g. ["Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"]). Rank names are user-supplied and not hardcoded.

R reference: phyloseq::tax_table(object)

rank_names property

rank_names: list[str]

Taxonomic rank names (column names).

R reference: rank_names(x)

taxa_names property

taxa_names: Index

Taxon identifiers (DataFrame index).

R reference: taxa_names(x)

copy

copy() -> TaxTable

Return a deep copy of this TaxTable.

to_frame

to_frame() -> pd.DataFrame

Return a copy of the underlying DataFrame.

R reference: as(tax_table(x), "matrix")


PhyTree

Wraps a skbio.TreeNode phylogenetic tree. Three constructors are available:

from pyloseq import PhyTree

# From a Newick string
tree = PhyTree.from_newick("((OTU1:0.1, OTU2:0.2):0.3, OTU3:0.4);")

# From a file
tree = PhyTree.from_newick_file("tree.nwk")

# From an existing skbio.TreeNode
import skbio
node = skbio.io.read("tree.nwk", format="newick", into=skbio.TreeNode)
tree = PhyTree(node)

Note

PhyTree.from_ape_rds() is not implemented. To use a tree saved from R with saveRDS, export it first: ape::write.tree(tree, "tree.nwk").

Prune to a specific set of tips:

pruned = tree.prune(["OTU1", "OTU3"])

pyloseq.PhyTree

Wraps a skbio.tree.TreeNode with a phyloseq-compatible interface.

R reference: phyloseq::phy_tree(object)

internal_names property

internal_names: list[str]

Names of all internal (non-tip) nodes, excluding the root if unnamed.

R reference: phy_tree(x)$node.label

is_rooted property

is_rooted: bool

True if the root has exactly 2 children (bifurcating root).

R reference: is.rooted(phy_tree(x))

n_tips property

n_tips: int

Number of tip (leaf) nodes.

R reference: ntaxa(phy_tree(x))

tip_names property

tip_names: list[str]

Names of all leaf nodes.

R reference: taxa_names(phy_tree(x))

total_branch_length property

total_branch_length: float

Sum of all branch lengths in the tree.

R reference: sum(phy_tree(x)$edge.length)

copy

copy() -> PhyTree

Return a deep copy of this PhyTree via Newick round-trip.

from_ape_rds classmethod

from_ape_rds(path: str | Path) -> PhyTree

Construct from an R phylo object serialized as .rds.

Requires pyreadr (pip install pyreadr).

R reference: readRDS(path)

from_newick classmethod

from_newick(s: str) -> PhyTree

Construct from a Newick string.

R reference: phy_tree(read.tree(text=s))

from_newick_file classmethod

from_newick_file(path: str | Path) -> PhyTree

Construct from a Newick file on disk.

R reference: phy_tree(read.tree(file=path))

prune

prune(keep: list[str]) -> PhyTree

Return a new tree containing only the specified tips and their ancestors.

Equivalent to ape::drop.tip with the complement set.

R reference: prune_taxa(keep, ps) (on the tree component)

to_newick

to_newick() -> str

Serialize to a Newick string.

R reference: ape::write.tree(phy_tree(x))


RefSeq

Stores reference sequences as a dict-like mapping from taxon name to skbio.DNA. Used for representative sequences from DADA2 or QIIME 2 denoising pipelines.

import skbio
from pyloseq import RefSeq

seqs = RefSeq({
    "OTU1": skbio.DNA("ACGTACGT"),
    "OTU2": skbio.DNA("TGCATGCA"),
})

# Round-trip through FASTA
seqs.to_fasta("representatives.fasta")
seqs2 = RefSeq.from_fasta("representatives.fasta")

pyloseq.RefSeq

Wraps a dictionary of reference sequences keyed by taxon ID.

R reference: phyloseq::refseq(object)

names property

names: Index

Deprecated alias for :attr:taxa_names.

taxa_names property

taxa_names: Index

Taxon identifiers for all stored sequences.

R reference: taxa_names(x)

copy

copy() -> RefSeq

Return a deep copy of this RefSeq.

from_fasta classmethod

from_fasta(path: str | Path) -> RefSeq

Load sequences from a FASTA file.

R reference: readDNAStringSet() then RefSeq(x)

to_fasta

to_fasta(path: str | Path) -> None

Write sequences to a FASTA file.

R reference: writeXStringSet(refseq(x), filepath)