Introduction

The scprep package provides a unified interface for reading and processing single-cell RNA-seq data from 10X Genomics Cell Ranger output. The package supports conversion to three popular single-cell data object types:

  • ExpressionSet (Biobase) - Default output format
  • Seurat - Popular for single-cell analysis workflows
  • SingleCellExperiment - Bioconductor standard for single-cell data

This vignette demonstrates how to read 10X data and convert it to each object type, then explore the structure of each object to understand where counts data, cell metadata, and gene metadata are stored.


Example Data Setup


For this example, we’ll use simulated sample paths and annotation data that match the expected format for scprep. In practice, these would point to your actual 10X Cell Ranger output directories.


# First let's load the required packages
library(scprep)
library(Biobase)
library(Seurat)
library(SingleCellExperiment)
library(Matrix)

# For this vignette, we will load data from 2 example samples from the 'scdata' package and store in a list (also accepted by 'scprep') (normally you would load data from directories where the 10X data is stored)
library(scdata)
# Load the PBMC 3p and 5p example sample data
data(PBMC_3p_10K, package="scdata")
data(PBMC_5p_10K, package="scdata")

# Store in example_data list - we will only use a subset of the data to keep it lightweight
example_data <- list(
  Sample_1 = PBMC_3p_10K[, 1:1000],
  Sample_2 = PBMC_5p_10K[, 1:1000]
)

# Clean up the original loaded objects to free memory
rm(PBMC_3p_10K, PBMC_5p_10K)
gc()  # Force garbage collection
#>            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
#> Ncells  7173896 383.2   12391758  661.8         NA   8679818  463.6
#> Vcells 19181055 146.4  146549777 1118.1      16384 158809916 1211.7

# We will create some fake sample paths for the purpose of this example, which will purely be stored in metadata (again, normally you would replace with your actual 10x data directories)
sample_paths <- c(
  "/path/to/Sample_1",
  "/path/to/Sample_2"
)

# Example annotation data (modify based on your experimental design)
annotation <- data.frame(
  Sample_ID = c("Sample_1", "Sample_2"),
  Index = c("SI-GA-A1", "SI-GA-B1"), 
  Sample_Project = c("scRNA_10X_Project", "scRNA_10X_Project"),
  Reference = c("refdata-cellranger-GRCh38-3.0.0", "refdata-cellranger-GRCh38-3.0.0"),
  Tissue = c("Blood", "Blood"),
  Type = c("PBMC", "PBMC"),
  Donor = c(1, 2),
  stringsAsFactors = FALSE
)

print("Sample annotation:")
#> [1] "Sample annotation:"
print(annotation)
#>   Sample_ID    Index    Sample_Project                       Reference Tissue
#> 1  Sample_1 SI-GA-A1 scRNA_10X_Project refdata-cellranger-GRCh38-3.0.0  Blood
#> 2  Sample_2 SI-GA-B1 scRNA_10X_Project refdata-cellranger-GRCh38-3.0.0  Blood
#>   Type Donor
#> 1 PBMC     1
#> 2 PBMC     2

Reading Data and Object Conversion

The scprep package provides two main approaches for reading and processing 10X data:

  1. Direct function calls using scprep_build()
  2. Template workflow using template_scprep() with parameter files

Direct Function Approach

The scprep_build() function is the core entry point for reading 10X data and can output different object types using the output_type parameter.


ExpressionSet


Build ExpressionSet:

Run the ‘scprep_build’ function to read the data and create an ExpressionSet object.

# Read data and create ExpressionSet object (default)
eset <- scprep::scprep_build(
  sample_paths = sample_paths,
  input_data = example_data, # normally, you would set this to FALSE to read directly from sample_paths
  annotation = annotation,
  file_type = "h5",
  vdj = FALSE,
  cite = FALSE, 
  atac = FALSE,
  output_type = "ExpressionSet",  # This is the default
  verbose = TRUE
)
#> Building ExpressionSet object 
#> Using pre-loaded count matrices from input_data 
#> Loaded matrix for Sample_1 : 1000 cells 
#> Loaded matrix for Sample_2 : 1000 cells 
#> Using provided annotation matching input_data names 
#> Total cells loaded: 2000 
#> Calculate UMIs per cell 
#> Calculate Genes per cell 
#> Creating ExpressionSet object directly 
#> ExpressionSet object created with 2000 cells and 36601 genes

print(eset)
#> ExpressionSet (storageMode: environment)
#> assayData: 36601 features, 2000 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: AAACCCAGTATATGGA-1-1 AAACCCAGTATCGTAC-1-1 ...
#>     ACGGCCATCACTGGGC-1-2 (2000 total)
#>   varLabels: ID Sample ... Genes (10 total)
#>   varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> Annotation:

Explore ExpressionSet

Now let’s explore the structure of the ExpressionSet object to understand where different types of data are stored.

# Basic object information
cat("ExpressionSet dimensions:", dim(eset), "\n")
#> ExpressionSet dimensions: 36601 2000
cat("Number of samples:", ncol(eset), "\n") 
#> Number of samples: 2000
cat("Number of genes:", nrow(eset), "\n")
#> Number of genes: 36601

# Expression data (counts matrix)
counts_matrix <- exprs(eset)
cat("Counts matrix class:", class(counts_matrix), "\n")
#> Counts matrix class: matrix array
cat("Counts matrix dimensions:", dim(counts_matrix), "\n")
#> Counts matrix dimensions: 36601 2000
print("First few genes and cells:")
#> [1] "First few genes and cells:"
print(counts_matrix[1:5, 1:3])
#>                 AAACCCAGTATATGGA-1-1 AAACCCAGTATCGTAC-1-1 AAACCCAGTCGGTGAA-1-1
#> ENSG00000243485                    0                    0                    0
#> ENSG00000237613                    0                    0                    0
#> ENSG00000186092                    0                    0                    0
#> ENSG00000238009                    0                    0                    0
#> ENSG00000239945                    0                    0                    0

# Cell-level metadata (sample information)  
cell_metadata <- pData(eset)
cat("Cell metadata dimensions:", dim(cell_metadata), "\n")
#> Cell metadata dimensions: 2000 10
print("Cell metadata columns:")
#> [1] "Cell metadata columns:"
print(colnames(cell_metadata))
#>  [1] "ID"             "Sample"         "Index"          "Sample_Project"
#>  [5] "Reference"      "Tissue"         "Type"           "Donor"         
#>  [9] "UMIs"           "Genes"
print("First few rows of cell metadata:")
#> [1] "First few rows of cell metadata:"
print(head(cell_metadata, 3))
#>                                        ID   Sample    Index    Sample_Project
#> AAACCCAGTATATGGA-1-1 AAACCCAGTATATGGA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTATCGTAC-1-1 AAACCCAGTATCGTAC-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTCGGTGAA-1-1 AAACCCAGTCGGTGAA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#>                                            Reference Tissue Type Donor UMIs
#> AAACCCAGTATATGGA-1-1 refdata-cellranger-GRCh38-3.0.0  Blood PBMC     1  860
#> AAACCCAGTATCGTAC-1-1 refdata-cellranger-GRCh38-3.0.0  Blood PBMC     1 1548
#> AAACCCAGTCGGTGAA-1-1 refdata-cellranger-GRCh38-3.0.0  Blood PBMC     1 6387
#>                      Genes
#> AAACCCAGTATATGGA-1-1   350
#> AAACCCAGTATCGTAC-1-1   729
#> AAACCCAGTCGGTGAA-1-1  1827

# Gene-level metadata
gene_metadata <- fData(eset)
cat("Gene metadata dimensions:", dim(gene_metadata), "\n") 
#> Gene metadata dimensions: 0 0
if (ncol(gene_metadata) > 0) {
  print("Gene metadata columns:")
  print(colnames(gene_metadata))
  print("First few rows of gene metadata:")
  print(head(gene_metadata, 3))
}

# Additional assay data slots
print("Available assay data slots:")
#> [1] "Available assay data slots:"
print(names(assayData(eset)))
#> [1] "exprs"

# Quality control metrics (automatically calculated)
print("Automatically calculated QC metrics:")
#> [1] "Automatically calculated QC metrics:"
print(paste("UMIs per cell: min =", min(eset$UMIs), "... max =", max(eset$UMIs)))
#> [1] "UMIs per cell: min = 500 ... max = 61084"
print(paste("Genes per cell: min =", min(eset$Genes), "... max =", max(eset$Genes)))
#> [1] "Genes per cell: min = 75 ... max = 6883"
#>            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
#> Ncells  7193578 384.2   12391758  661.8         NA   8679818  463.6
#> Vcells 92455329 705.4  241076319 1839.3      16384 166293657 1268.8

Seurat


Build Seurat Object:

Run the ‘scprep_build’ function to read the data and create a Seurat object.

# Read data and create Seurat object
seurat_obj <- scprep::scprep_build(
  sample_paths = sample_paths,
  input_data = example_data, # normally, you would set this to FALSE to read directly from sample_paths
  annotation = annotation,
  file_type = "h5", 
  vdj = FALSE,
  cite = FALSE,
  atac = FALSE,
  output_type = "seurat",
  verbose = TRUE
)
#> Building seurat object 
#> Using pre-loaded count matrices from input_data 
#> Loaded matrix for Sample_1 : 1000 cells 
#> Loaded matrix for Sample_2 : 1000 cells 
#> Using provided annotation matching input_data names 
#> Total cells loaded: 2000 
#> Calculate UMIs per cell 
#> Calculate Genes per cell 
#> Creating Seurat object directly 
#> Seurat object created with 2000 cells and 36601 genes

print(seurat_obj)
#> An object of class Seurat 
#> 36601 features across 2000 samples within 1 assay 
#> Active assay: RNA (36601 features, 0 variable features)
#>  1 layer present: counts

Explore Seurat Object

Now let’s explore the structure of the Seurat object to understand where different types of data are stored.

# Basic object information
cat("Seurat object dimensions:", dim(seurat_obj), "\n")
#> Seurat object dimensions: 36601 2000
cat("Number of cells:", ncol(seurat_obj), "\n")
#> Number of cells: 2000
cat("Number of genes:", nrow(seurat_obj), "\n")
#> Number of genes: 36601

# Expression data (counts matrix)
counts_matrix <- Seurat::GetAssayData(seurat_obj, layer = "counts")
cat("Counts matrix class:", class(counts_matrix), "\n")
#> Counts matrix class: dgCMatrix
cat("Counts matrix dimensions:", dim(counts_matrix), "\n")
#> Counts matrix dimensions: 36601 2000
print("First few genes and cells:")
#> [1] "First few genes and cells:"
print(counts_matrix[1:5, 1:3])
#> 5 x 3 sparse Matrix of class "dgCMatrix"
#>                 AAACCCAGTATATGGA-1-1 AAACCCAGTATCGTAC-1-1 AAACCCAGTCGGTGAA-1-1
#> ENSG00000243485                    .                    .                    .
#> ENSG00000237613                    .                    .                    .
#> ENSG00000186092                    .                    .                    .
#> ENSG00000238009                    .                    .                    .
#> ENSG00000239945                    .                    .                    .

# Cell-level metadata
cell_metadata <- seurat_obj@meta.data
cat("Cell metadata dimensions:", dim(cell_metadata), "\n")
#> Cell metadata dimensions: 2000 13
print("Cell metadata columns:")
#> [1] "Cell metadata columns:"
print(colnames(cell_metadata))
#>  [1] "orig.ident"     "nCount_RNA"     "nFeature_RNA"   "ID"            
#>  [5] "Sample"         "Index"          "Sample_Project" "Reference"     
#>  [9] "Tissue"         "Type"           "Donor"          "UMIs"          
#> [13] "Genes"
print("First few rows of cell metadata:")
#> [1] "First few rows of cell metadata:"
print(head(cell_metadata, 3))
#>                         orig.ident nCount_RNA nFeature_RNA                   ID
#> AAACCCAGTATATGGA-1-1 SeuratProject        860          350 AAACCCAGTATATGGA-1-1
#> AAACCCAGTATCGTAC-1-1 SeuratProject       1548          729 AAACCCAGTATCGTAC-1-1
#> AAACCCAGTCGGTGAA-1-1 SeuratProject       6387         1827 AAACCCAGTCGGTGAA-1-1
#>                        Sample    Index    Sample_Project
#> AAACCCAGTATATGGA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTATCGTAC-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTCGGTGAA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#>                                            Reference Tissue Type Donor UMIs
#> AAACCCAGTATATGGA-1-1 refdata-cellranger-GRCh38-3.0.0  Blood PBMC     1  860
#> AAACCCAGTATCGTAC-1-1 refdata-cellranger-GRCh38-3.0.0  Blood PBMC     1 1548
#> AAACCCAGTCGGTGAA-1-1 refdata-cellranger-GRCh38-3.0.0  Blood PBMC     1 6387
#>                      Genes
#> AAACCCAGTATATGGA-1-1   350
#> AAACCCAGTATCGTAC-1-1   729
#> AAACCCAGTCGGTGAA-1-1  1827

# Gene-level metadata  
gene_metadata <- seurat_obj[["RNA"]][[]]
cat("Gene metadata dimensions:", dim(gene_metadata), "\n")
#> Gene metadata dimensions: 36601 0
if (ncol(gene_metadata) > 0) {
  print("Gene metadata columns:")
  print(colnames(gene_metadata)) 
  print("First few rows of gene metadata:")
  print(head(gene_metadata, 3))
}

# Available assays
print("Available assays:")
#> [1] "Available assays:"
print(names(seurat_obj@assays))
#> [1] "RNA"

# Quality control metrics
if ("nCount_RNA" %in% colnames(cell_metadata)) {
  print("Seurat QC metrics:")
  print(paste("UMIs per cell: min =", min(cell_metadata$nCount_RNA), "... max =", max(cell_metadata$nCount_RNA)))
}
#> [1] "Seurat QC metrics:"
#> [1] "UMIs per cell: min = 500 ... max = 61084"
if ("nFeature_RNA" %in% colnames(cell_metadata)) {
  print(paste("Genes per cell: min =", min(cell_metadata$nFeature_RNA), "... max =", max(cell_metadata$nFeature_RNA)))
}
#> [1] "Genes per cell: min = 75 ... max = 6883"
#>            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
#> Ncells  7219453 385.6   12391758  661.8         NA   8679818  463.6
#> Vcells 26052258 198.8  231497266 1766.2      16384 239025481 1823.7

SingleCellExperiment


Build SingleCellExperiment object:

Run the ‘scprep_build’ function to read the data and create a SingleCellExperiment object.

# Read data and create SingleCellExperiment object
sce_obj <- scprep_build(
  sample_paths = sample_paths,
  input_data = example_data, # normally, you would set this to FALSE to read directly from sample_paths
  annotation = annotation,
  file_type = "h5",
  vdj = FALSE, 
  cite = FALSE,
  atac = FALSE,
  output_type = "sce",
  verbose = TRUE
)
#> Building sce object 
#> Using pre-loaded count matrices from input_data 
#> Loaded matrix for Sample_1 : 1000 cells 
#> Loaded matrix for Sample_2 : 1000 cells 
#> Using provided annotation matching input_data names 
#> Total cells loaded: 2000 
#> Calculate UMIs per cell 
#> Calculate Genes per cell 
#> Creating SingleCellExperiment object directly 
#> SingleCellExperiment object created with 2000 cells and 36601 genes

Explore SingleCellExperiment Object

Now let’s explore the structure of the Seurat object to understand where different types of data are stored.

# Basic object information  
cat("SingleCellExperiment dimensions:", dim(sce_obj), "\n")
#> SingleCellExperiment dimensions: 36601 2000
cat("Number of cells:", ncol(sce_obj), "\n")
#> Number of cells: 2000
cat("Number of genes:", nrow(sce_obj), "\n")
#> Number of genes: 36601

# Expression data (counts matrix)
counts_matrix <- counts(sce_obj)
cat("Counts matrix class:", class(counts_matrix), "\n") 
#> Counts matrix class: matrix array
cat("Counts matrix dimensions:", dim(counts_matrix), "\n")
#> Counts matrix dimensions: 36601 2000
print("First few genes and cells:")
#> [1] "First few genes and cells:"
print(counts_matrix[1:5, 1:3])
#>                 AAACCCAGTATATGGA-1-1 AAACCCAGTATCGTAC-1-1 AAACCCAGTCGGTGAA-1-1
#> ENSG00000243485                    0                    0                    0
#> ENSG00000237613                    0                    0                    0
#> ENSG00000186092                    0                    0                    0
#> ENSG00000238009                    0                    0                    0
#> ENSG00000239945                    0                    0                    0

# Cell-level metadata
cell_metadata <- colData(sce_obj)
cat("Cell metadata dimensions:", dim(cell_metadata), "\n")
#> Cell metadata dimensions: 2000 10
print("Cell metadata columns:")
#> [1] "Cell metadata columns:"
print(colnames(cell_metadata))
#>  [1] "ID"             "Sample"         "Index"          "Sample_Project"
#>  [5] "Reference"      "Tissue"         "Type"           "Donor"         
#>  [9] "UMIs"           "Genes"
print("First few rows of cell metadata:")
#> [1] "First few rows of cell metadata:"
print(head(cell_metadata, 3))
#> DataFrame with 3 rows and 10 columns
#>                                        ID   Sample    Index    Sample_Project
#>                               <character> <factor> <factor>          <factor>
#> AAACCCAGTATATGGA-1-1 AAACCCAGTATATGGA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTATCGTAC-1-1 AAACCCAGTATCGTAC-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTCGGTGAA-1-1 AAACCCAGTCGGTGAA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#>                                            Reference   Tissue     Type    Donor
#>                                             <factor> <factor> <factor> <factor>
#> AAACCCAGTATATGGA-1-1 refdata-cellranger-GRCh38-3.0.0    Blood     PBMC        1
#> AAACCCAGTATCGTAC-1-1 refdata-cellranger-GRCh38-3.0.0    Blood     PBMC        1
#> AAACCCAGTCGGTGAA-1-1 refdata-cellranger-GRCh38-3.0.0    Blood     PBMC        1
#>                           UMIs     Genes
#>                      <numeric> <numeric>
#> AAACCCAGTATATGGA-1-1       860       350
#> AAACCCAGTATCGTAC-1-1      1548       729
#> AAACCCAGTCGGTGAA-1-1      6387      1827

# Gene-level metadata
gene_metadata <- rowData(sce_obj) 
cat("Gene metadata dimensions:", dim(gene_metadata), "\n")
#> Gene metadata dimensions: 36601 0
if (ncol(gene_metadata) > 0) {
  print("Gene metadata columns:")
  print(colnames(gene_metadata))
  print("First few rows of gene metadata:")
  print(head(gene_metadata, 3))
}

# Available assays
print("Available assays:")
#> [1] "Available assays:"
print(assayNames(sce_obj))
#> [1] "counts"

# Alternative experiments (if any)
if (length(altExpNames(sce_obj)) > 0) {
  print("Alternative experiments:")
  print(altExpNames(sce_obj))
}

# Quality control metrics (from scprep)
if ("UMIs" %in% colnames(cell_metadata)) {
  print("Quality control metrics:")
  print(paste("UMIs per cell: min =", min(cell_metadata$UMIs), "max =", max(cell_metadata$UMIs)))
}
#> [1] "Quality control metrics:"
#> [1] "UMIs per cell: min = 500 max = 61084"
if ("Genes" %in% colnames(cell_metadata)) {
  print(paste("Genes per cell: min =", min(cell_metadata$Genes), "max =", max(cell_metadata$Genes)))
}
#> [1] "Genes per cell: min = 75 max = 6883"
#>            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
#> Ncells  7239461 386.7   12391758  661.8         NA   9023639  482.0
#> Vcells 92624784 706.7  231497266 1766.2      16384 239025481 1823.7

Template Workflow


For automated processing with parameter files, use the template_scprep() function:

# Create parameter file (scprep_parameters.csv) with output_type specification
# Parameters file should include:
# - dir_input: path to input directory
# - dir_output: path to output directory
# - output_type: "eset" (ExpressionSet), "seurat", or "sce" (SingleCellExperiment)
# - file_type: "h5" or "mtx"
# - Other processing parameters...

# Run template workflow
dataset <- scprep::template_scprep(dir_output = "/path/to/output")

Template workflow benefits: - Parameter file-driven approach for reproducible analyses - Automatic parameter logging and seed generation - Built-in biomaRt gene annotation - Cell filtering - Gene filtering


Summary


Summary: Data Location by Object Type

Data Type ExpressionSet Seurat SingleCellExperiment
Counts Matrix exprs(obj) GetAssayData(obj, slot="counts") counts(obj)
Cell Metadata pData(obj) obj@meta.data colData(obj)
Gene Metadata fData(obj) obj[["RNA"]]@meta.features rowData(obj)
Additional Assays assayData(obj)$SlotName obj[["AssayName"]] assay(obj, "AssayName")

Conclusion

The scprep package provides a unified interface for reading 10X Genomics single-cell data and converting it to the most popular single-cell object formats. Each object type has its own conventions for storing expression data, metadata, and additional assays:

All three formats have the same information - choose the object type that works for your downstream analysis workflow