vignette_scprep_usage.Rmd
The scprep
package provides a unified interface for
reading and processing single-cell RNA-seq data from 10X Genomics Cell
Ranger output. The package supports conversion to three popular
single-cell data object types:
This vignette demonstrates how to read 10X data and convert it to each object type, then explore the structure of each object to understand where counts data, cell metadata, and gene metadata are stored.
For this example, we’ll use simulated sample paths and annotation
data that match the expected format for scprep
. In
practice, these would point to your actual 10X Cell Ranger output
directories.
# First let's load the required packages
library(scprep)
library(Biobase)
library(Seurat)
library(SingleCellExperiment)
library(Matrix)
# For this vignette, we will load data from 2 example samples from the 'scdata' package and store in a list (also accepted by 'scprep') (normally you would load data from directories where the 10X data is stored)
library(scdata)
# Load the PBMC 3p and 5p example sample data
data(PBMC_3p_10K, package="scdata")
data(PBMC_5p_10K, package="scdata")
# Store in example_data list - we will only use a subset of the data to keep it lightweight
example_data <- list(
Sample_1 = PBMC_3p_10K[, 1:1000],
Sample_2 = PBMC_5p_10K[, 1:1000]
)
# Clean up the original loaded objects to free memory
rm(PBMC_3p_10K, PBMC_5p_10K)
gc() # Force garbage collection
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 7173896 383.2 12391758 661.8 NA 8679818 463.6
#> Vcells 19181055 146.4 146549777 1118.1 16384 158809916 1211.7
# We will create some fake sample paths for the purpose of this example, which will purely be stored in metadata (again, normally you would replace with your actual 10x data directories)
sample_paths <- c(
"/path/to/Sample_1",
"/path/to/Sample_2"
)
# Example annotation data (modify based on your experimental design)
annotation <- data.frame(
Sample_ID = c("Sample_1", "Sample_2"),
Index = c("SI-GA-A1", "SI-GA-B1"),
Sample_Project = c("scRNA_10X_Project", "scRNA_10X_Project"),
Reference = c("refdata-cellranger-GRCh38-3.0.0", "refdata-cellranger-GRCh38-3.0.0"),
Tissue = c("Blood", "Blood"),
Type = c("PBMC", "PBMC"),
Donor = c(1, 2),
stringsAsFactors = FALSE
)
print("Sample annotation:")
#> [1] "Sample annotation:"
print(annotation)
#> Sample_ID Index Sample_Project Reference Tissue
#> 1 Sample_1 SI-GA-A1 scRNA_10X_Project refdata-cellranger-GRCh38-3.0.0 Blood
#> 2 Sample_2 SI-GA-B1 scRNA_10X_Project refdata-cellranger-GRCh38-3.0.0 Blood
#> Type Donor
#> 1 PBMC 1
#> 2 PBMC 2
The scprep
package provides two main approaches for
reading and processing 10X data:
scprep_build()
template_scprep()
with parameter filesThe scprep_build()
function is the core entry point for
reading 10X data and can output different object types using the
output_type
parameter.
Run the ‘scprep_build’ function to read the data and create an ExpressionSet object.
# Read data and create ExpressionSet object (default)
eset <- scprep::scprep_build(
sample_paths = sample_paths,
input_data = example_data, # normally, you would set this to FALSE to read directly from sample_paths
annotation = annotation,
file_type = "h5",
vdj = FALSE,
cite = FALSE,
atac = FALSE,
output_type = "ExpressionSet", # This is the default
verbose = TRUE
)
#> Building ExpressionSet object
#> Using pre-loaded count matrices from input_data
#> Loaded matrix for Sample_1 : 1000 cells
#> Loaded matrix for Sample_2 : 1000 cells
#> Using provided annotation matching input_data names
#> Total cells loaded: 2000
#> Calculate UMIs per cell
#> Calculate Genes per cell
#> Creating ExpressionSet object directly
#> ExpressionSet object created with 2000 cells and 36601 genes
print(eset)
#> ExpressionSet (storageMode: environment)
#> assayData: 36601 features, 2000 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: AAACCCAGTATATGGA-1-1 AAACCCAGTATCGTAC-1-1 ...
#> ACGGCCATCACTGGGC-1-2 (2000 total)
#> varLabels: ID Sample ... Genes (10 total)
#> varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> Annotation:
Now let’s explore the structure of the ExpressionSet object to understand where different types of data are stored.
# Basic object information
cat("ExpressionSet dimensions:", dim(eset), "\n")
#> ExpressionSet dimensions: 36601 2000
cat("Number of samples:", ncol(eset), "\n")
#> Number of samples: 2000
cat("Number of genes:", nrow(eset), "\n")
#> Number of genes: 36601
# Expression data (counts matrix)
counts_matrix <- exprs(eset)
cat("Counts matrix class:", class(counts_matrix), "\n")
#> Counts matrix class: matrix array
cat("Counts matrix dimensions:", dim(counts_matrix), "\n")
#> Counts matrix dimensions: 36601 2000
print("First few genes and cells:")
#> [1] "First few genes and cells:"
print(counts_matrix[1:5, 1:3])
#> AAACCCAGTATATGGA-1-1 AAACCCAGTATCGTAC-1-1 AAACCCAGTCGGTGAA-1-1
#> ENSG00000243485 0 0 0
#> ENSG00000237613 0 0 0
#> ENSG00000186092 0 0 0
#> ENSG00000238009 0 0 0
#> ENSG00000239945 0 0 0
# Cell-level metadata (sample information)
cell_metadata <- pData(eset)
cat("Cell metadata dimensions:", dim(cell_metadata), "\n")
#> Cell metadata dimensions: 2000 10
print("Cell metadata columns:")
#> [1] "Cell metadata columns:"
print(colnames(cell_metadata))
#> [1] "ID" "Sample" "Index" "Sample_Project"
#> [5] "Reference" "Tissue" "Type" "Donor"
#> [9] "UMIs" "Genes"
print("First few rows of cell metadata:")
#> [1] "First few rows of cell metadata:"
print(head(cell_metadata, 3))
#> ID Sample Index Sample_Project
#> AAACCCAGTATATGGA-1-1 AAACCCAGTATATGGA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTATCGTAC-1-1 AAACCCAGTATCGTAC-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTCGGTGAA-1-1 AAACCCAGTCGGTGAA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> Reference Tissue Type Donor UMIs
#> AAACCCAGTATATGGA-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1 860
#> AAACCCAGTATCGTAC-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1 1548
#> AAACCCAGTCGGTGAA-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1 6387
#> Genes
#> AAACCCAGTATATGGA-1-1 350
#> AAACCCAGTATCGTAC-1-1 729
#> AAACCCAGTCGGTGAA-1-1 1827
# Gene-level metadata
gene_metadata <- fData(eset)
cat("Gene metadata dimensions:", dim(gene_metadata), "\n")
#> Gene metadata dimensions: 0 0
if (ncol(gene_metadata) > 0) {
print("Gene metadata columns:")
print(colnames(gene_metadata))
print("First few rows of gene metadata:")
print(head(gene_metadata, 3))
}
# Additional assay data slots
print("Available assay data slots:")
#> [1] "Available assay data slots:"
print(names(assayData(eset)))
#> [1] "exprs"
# Quality control metrics (automatically calculated)
print("Automatically calculated QC metrics:")
#> [1] "Automatically calculated QC metrics:"
print(paste("UMIs per cell: min =", min(eset$UMIs), "... max =", max(eset$UMIs)))
#> [1] "UMIs per cell: min = 500 ... max = 61084"
print(paste("Genes per cell: min =", min(eset$Genes), "... max =", max(eset$Genes)))
#> [1] "Genes per cell: min = 75 ... max = 6883"
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 7193578 384.2 12391758 661.8 NA 8679818 463.6
#> Vcells 92455329 705.4 241076319 1839.3 16384 166293657 1268.8
Run the ‘scprep_build’ function to read the data and create a Seurat object.
# Read data and create Seurat object
seurat_obj <- scprep::scprep_build(
sample_paths = sample_paths,
input_data = example_data, # normally, you would set this to FALSE to read directly from sample_paths
annotation = annotation,
file_type = "h5",
vdj = FALSE,
cite = FALSE,
atac = FALSE,
output_type = "seurat",
verbose = TRUE
)
#> Building seurat object
#> Using pre-loaded count matrices from input_data
#> Loaded matrix for Sample_1 : 1000 cells
#> Loaded matrix for Sample_2 : 1000 cells
#> Using provided annotation matching input_data names
#> Total cells loaded: 2000
#> Calculate UMIs per cell
#> Calculate Genes per cell
#> Creating Seurat object directly
#> Seurat object created with 2000 cells and 36601 genes
print(seurat_obj)
#> An object of class Seurat
#> 36601 features across 2000 samples within 1 assay
#> Active assay: RNA (36601 features, 0 variable features)
#> 1 layer present: counts
Now let’s explore the structure of the Seurat object to understand where different types of data are stored.
# Basic object information
cat("Seurat object dimensions:", dim(seurat_obj), "\n")
#> Seurat object dimensions: 36601 2000
cat("Number of cells:", ncol(seurat_obj), "\n")
#> Number of cells: 2000
cat("Number of genes:", nrow(seurat_obj), "\n")
#> Number of genes: 36601
# Expression data (counts matrix)
counts_matrix <- Seurat::GetAssayData(seurat_obj, layer = "counts")
cat("Counts matrix class:", class(counts_matrix), "\n")
#> Counts matrix class: dgCMatrix
cat("Counts matrix dimensions:", dim(counts_matrix), "\n")
#> Counts matrix dimensions: 36601 2000
print("First few genes and cells:")
#> [1] "First few genes and cells:"
print(counts_matrix[1:5, 1:3])
#> 5 x 3 sparse Matrix of class "dgCMatrix"
#> AAACCCAGTATATGGA-1-1 AAACCCAGTATCGTAC-1-1 AAACCCAGTCGGTGAA-1-1
#> ENSG00000243485 . . .
#> ENSG00000237613 . . .
#> ENSG00000186092 . . .
#> ENSG00000238009 . . .
#> ENSG00000239945 . . .
# Cell-level metadata
cell_metadata <- seurat_obj@meta.data
cat("Cell metadata dimensions:", dim(cell_metadata), "\n")
#> Cell metadata dimensions: 2000 13
print("Cell metadata columns:")
#> [1] "Cell metadata columns:"
print(colnames(cell_metadata))
#> [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "ID"
#> [5] "Sample" "Index" "Sample_Project" "Reference"
#> [9] "Tissue" "Type" "Donor" "UMIs"
#> [13] "Genes"
print("First few rows of cell metadata:")
#> [1] "First few rows of cell metadata:"
print(head(cell_metadata, 3))
#> orig.ident nCount_RNA nFeature_RNA ID
#> AAACCCAGTATATGGA-1-1 SeuratProject 860 350 AAACCCAGTATATGGA-1-1
#> AAACCCAGTATCGTAC-1-1 SeuratProject 1548 729 AAACCCAGTATCGTAC-1-1
#> AAACCCAGTCGGTGAA-1-1 SeuratProject 6387 1827 AAACCCAGTCGGTGAA-1-1
#> Sample Index Sample_Project
#> AAACCCAGTATATGGA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTATCGTAC-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTCGGTGAA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> Reference Tissue Type Donor UMIs
#> AAACCCAGTATATGGA-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1 860
#> AAACCCAGTATCGTAC-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1 1548
#> AAACCCAGTCGGTGAA-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1 6387
#> Genes
#> AAACCCAGTATATGGA-1-1 350
#> AAACCCAGTATCGTAC-1-1 729
#> AAACCCAGTCGGTGAA-1-1 1827
# Gene-level metadata
gene_metadata <- seurat_obj[["RNA"]][[]]
cat("Gene metadata dimensions:", dim(gene_metadata), "\n")
#> Gene metadata dimensions: 36601 0
if (ncol(gene_metadata) > 0) {
print("Gene metadata columns:")
print(colnames(gene_metadata))
print("First few rows of gene metadata:")
print(head(gene_metadata, 3))
}
# Available assays
print("Available assays:")
#> [1] "Available assays:"
print(names(seurat_obj@assays))
#> [1] "RNA"
# Quality control metrics
if ("nCount_RNA" %in% colnames(cell_metadata)) {
print("Seurat QC metrics:")
print(paste("UMIs per cell: min =", min(cell_metadata$nCount_RNA), "... max =", max(cell_metadata$nCount_RNA)))
}
#> [1] "Seurat QC metrics:"
#> [1] "UMIs per cell: min = 500 ... max = 61084"
if ("nFeature_RNA" %in% colnames(cell_metadata)) {
print(paste("Genes per cell: min =", min(cell_metadata$nFeature_RNA), "... max =", max(cell_metadata$nFeature_RNA)))
}
#> [1] "Genes per cell: min = 75 ... max = 6883"
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 7219453 385.6 12391758 661.8 NA 8679818 463.6
#> Vcells 26052258 198.8 231497266 1766.2 16384 239025481 1823.7
Run the ‘scprep_build’ function to read the data and create a SingleCellExperiment object.
# Read data and create SingleCellExperiment object
sce_obj <- scprep_build(
sample_paths = sample_paths,
input_data = example_data, # normally, you would set this to FALSE to read directly from sample_paths
annotation = annotation,
file_type = "h5",
vdj = FALSE,
cite = FALSE,
atac = FALSE,
output_type = "sce",
verbose = TRUE
)
#> Building sce object
#> Using pre-loaded count matrices from input_data
#> Loaded matrix for Sample_1 : 1000 cells
#> Loaded matrix for Sample_2 : 1000 cells
#> Using provided annotation matching input_data names
#> Total cells loaded: 2000
#> Calculate UMIs per cell
#> Calculate Genes per cell
#> Creating SingleCellExperiment object directly
#> SingleCellExperiment object created with 2000 cells and 36601 genes
Now let’s explore the structure of the Seurat object to understand where different types of data are stored.
# Basic object information
cat("SingleCellExperiment dimensions:", dim(sce_obj), "\n")
#> SingleCellExperiment dimensions: 36601 2000
cat("Number of cells:", ncol(sce_obj), "\n")
#> Number of cells: 2000
cat("Number of genes:", nrow(sce_obj), "\n")
#> Number of genes: 36601
# Expression data (counts matrix)
counts_matrix <- counts(sce_obj)
cat("Counts matrix class:", class(counts_matrix), "\n")
#> Counts matrix class: matrix array
cat("Counts matrix dimensions:", dim(counts_matrix), "\n")
#> Counts matrix dimensions: 36601 2000
print("First few genes and cells:")
#> [1] "First few genes and cells:"
print(counts_matrix[1:5, 1:3])
#> AAACCCAGTATATGGA-1-1 AAACCCAGTATCGTAC-1-1 AAACCCAGTCGGTGAA-1-1
#> ENSG00000243485 0 0 0
#> ENSG00000237613 0 0 0
#> ENSG00000186092 0 0 0
#> ENSG00000238009 0 0 0
#> ENSG00000239945 0 0 0
# Cell-level metadata
cell_metadata <- colData(sce_obj)
cat("Cell metadata dimensions:", dim(cell_metadata), "\n")
#> Cell metadata dimensions: 2000 10
print("Cell metadata columns:")
#> [1] "Cell metadata columns:"
print(colnames(cell_metadata))
#> [1] "ID" "Sample" "Index" "Sample_Project"
#> [5] "Reference" "Tissue" "Type" "Donor"
#> [9] "UMIs" "Genes"
print("First few rows of cell metadata:")
#> [1] "First few rows of cell metadata:"
print(head(cell_metadata, 3))
#> DataFrame with 3 rows and 10 columns
#> ID Sample Index Sample_Project
#> <character> <factor> <factor> <factor>
#> AAACCCAGTATATGGA-1-1 AAACCCAGTATATGGA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTATCGTAC-1-1 AAACCCAGTATCGTAC-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> AAACCCAGTCGGTGAA-1-1 AAACCCAGTCGGTGAA-1-1 Sample_1 SI-GA-A1 scRNA_10X_Project
#> Reference Tissue Type Donor
#> <factor> <factor> <factor> <factor>
#> AAACCCAGTATATGGA-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1
#> AAACCCAGTATCGTAC-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1
#> AAACCCAGTCGGTGAA-1-1 refdata-cellranger-GRCh38-3.0.0 Blood PBMC 1
#> UMIs Genes
#> <numeric> <numeric>
#> AAACCCAGTATATGGA-1-1 860 350
#> AAACCCAGTATCGTAC-1-1 1548 729
#> AAACCCAGTCGGTGAA-1-1 6387 1827
# Gene-level metadata
gene_metadata <- rowData(sce_obj)
cat("Gene metadata dimensions:", dim(gene_metadata), "\n")
#> Gene metadata dimensions: 36601 0
if (ncol(gene_metadata) > 0) {
print("Gene metadata columns:")
print(colnames(gene_metadata))
print("First few rows of gene metadata:")
print(head(gene_metadata, 3))
}
# Available assays
print("Available assays:")
#> [1] "Available assays:"
print(assayNames(sce_obj))
#> [1] "counts"
# Alternative experiments (if any)
if (length(altExpNames(sce_obj)) > 0) {
print("Alternative experiments:")
print(altExpNames(sce_obj))
}
# Quality control metrics (from scprep)
if ("UMIs" %in% colnames(cell_metadata)) {
print("Quality control metrics:")
print(paste("UMIs per cell: min =", min(cell_metadata$UMIs), "max =", max(cell_metadata$UMIs)))
}
#> [1] "Quality control metrics:"
#> [1] "UMIs per cell: min = 500 max = 61084"
if ("Genes" %in% colnames(cell_metadata)) {
print(paste("Genes per cell: min =", min(cell_metadata$Genes), "max =", max(cell_metadata$Genes)))
}
#> [1] "Genes per cell: min = 75 max = 6883"
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 7239461 386.7 12391758 661.8 NA 9023639 482.0
#> Vcells 92624784 706.7 231497266 1766.2 16384 239025481 1823.7
template_scprep()
function:
# Create parameter file (scprep_parameters.csv) with output_type specification
# Parameters file should include:
# - dir_input: path to input directory
# - dir_output: path to output directory
# - output_type: "eset" (ExpressionSet), "seurat", or "sce" (SingleCellExperiment)
# - file_type: "h5" or "mtx"
# - Other processing parameters...
# Run template workflow
dataset <- scprep::template_scprep(dir_output = "/path/to/output")
Template workflow benefits: - Parameter file-driven approach for reproducible analyses - Automatic parameter logging and seed generation - Built-in biomaRt gene annotation - Cell filtering - Gene filtering
Data Type | ExpressionSet | Seurat | SingleCellExperiment |
---|---|---|---|
Counts Matrix | exprs(obj) |
GetAssayData(obj, slot="counts") |
counts(obj) |
Cell Metadata | pData(obj) |
obj@meta.data |
colData(obj) |
Gene Metadata | fData(obj) |
obj[["RNA"]]@meta.features |
rowData(obj) |
Additional Assays | assayData(obj)$SlotName |
obj[["AssayName"]] |
assay(obj, "AssayName") |
The scprep
package provides a unified interface for
reading 10X Genomics single-cell data and converting it to the most
popular single-cell object formats. Each object type has its own
conventions for storing expression data, metadata, and additional
assays:
exprs()
, pData()
, and
fData()
GetAssayData()
, @meta.data
, and
assay-specific slotscounts()
, colData()
,
and rowData()
All three formats have the same information - choose the object type that works for your downstream analysis workflow