Skip to content

Datasets

pyloseq ships with four reference datasets drawn from R's phyloseq package. They are stored as Parquet and Newick files under tests/golden/ and loaded without any R dependency.

The loaders return a plain dict of DataFrames, Series, and strings — not a pre-built Phyloseq. This is intentional: it gives you control over which components to include and in what form.

from pyloseq.datasets import (
    load_global_patterns_reference,
    load_enterotype_reference,
    load_esophagus_reference,
    load_soilrep_reference,
    load_golden,
)

Reference datasets

Dataset Samples Taxa Has tree Has refseq Dict keys
GlobalPatterns 26 19,216 yes no otu_table, sample_data, tax_table, phy_tree_newick, taxa_sums, sample_sums
enterotype 280 553 no no otu_table, sample_data, tax_table, taxa_sums, sample_sums
esophagus 3 58 yes no otu_table, sample_data, tax_table, phy_tree_newick, taxa_sums, sample_sums
soilrep 56 16,825 no no otu_table, sample_data, tax_table, taxa_sums, sample_sums

load_global_patterns_reference

26 environmental samples from nine sample types (ocean, soil, freshwater, skin, mock community, etc.), 7 taxonomic ranks. The most commonly used reference dataset for testing ordination and diversity functions:

from pyloseq import Phyloseq, OtuTable, SampleData, TaxTable, PhyTree
from pyloseq.datasets import load_global_patterns_reference

ref = load_global_patterns_reference()

gp = Phyloseq(
    otu=OtuTable(ref["otu_table"], taxa_are_rows=True),
    sam=SampleData(ref["sample_data"]),
    tax=TaxTable(ref["tax_table"]),
    tree=PhyTree.from_newick(ref["phy_tree_newick"]),
)

pyloseq.datasets.load_global_patterns_reference

load_global_patterns_reference() -> dict[str, Any]

Return the GlobalPatterns reference data as a dict of DataFrames / Series.

Keys: otu_table, sample_data, tax_table, taxa_sums, sample_sums, phy_tree_newick.

R reference: data("GlobalPatterns"); phyloseq::GlobalPatterns

load_enterotype_reference

280 human gut metagenome samples, 553 genera. No tree. Commonly used for enterotype clustering and genus-level analyses:

from pyloseq.datasets import load_enterotype_reference

ref = load_enterotype_reference()

pyloseq.datasets.load_enterotype_reference

load_enterotype_reference() -> dict[str, Any]

Return the enterotype reference data.

R reference: data("enterotype"); phyloseq::enterotype

load_esophagus_reference

3 human esophagus biopsy samples, 58 OTUs, with a phylogenetic tree. The smallest dataset; useful for quick tests requiring a tree:

from pyloseq.datasets import load_esophagus_reference

ref = load_esophagus_reference()

pyloseq.datasets.load_esophagus_reference

load_esophagus_reference() -> dict[str, Any]

Return the esophagus reference data.

R reference: data("esophagus"); phyloseq::esophagus

load_soilrep_reference

56 soil samples from a warming experiment, 16,825 OTUs. No tree. Useful for testing on a larger, sparse table:

from pyloseq.datasets import load_soilrep_reference

ref = load_soilrep_reference()

pyloseq.datasets.load_soilrep_reference

load_soilrep_reference() -> dict[str, Any]

Return the soilrep reference data.

R reference: data("soilrep"); phyloseq::soilrep


load_golden

Loads a pre-computed R output for a specific dataset and function. Used in tests to compare pyloseq results against R's reference values:

from pyloseq.datasets import load_golden

r_richness = load_golden("GlobalPatterns", "estimate_richness")

The function parameter corresponds to the R function name used in scripts/generate_golden.R. Pass **params matching the exact parameters used when the file was generated to compute the correct file path.

If the file does not exist, FileNotFoundError is raised with instructions for regenerating golden files. See the golden files developer guide for details.

pyloseq.datasets.load_golden

load_golden(
    dataset: str, function: str, **params: Any
) -> Any

Load a golden output file for an analysis function.

Parameters:

Name Type Description Default
dataset str

One of "GlobalPatterns", "enterotype", "esophagus", "soilrep".

required
function str

Name of the R phyloseq function, e.g. "estimate_richness".

required
**params Any

Parameters passed to the function — used to compute the file path hash. Must match the params used when generate_golden.R was run.

{}

Returns:

Type Description
DataFrame

The golden output, with the original rownames restored as the index.

Raises:

Type Description
FileNotFoundError

If the golden file does not exist for the given dataset/function/params.