An R package for aggregating single-cell RNA-Seq data and metadata in ExpressionSet, Seurat, or SingleCellExperiment objects, along with biomaRt gene annotation, basic cell filtering and QC metric calculation.
Data that may be used for testing scprep may be accessed via the scdata package.
For post-processing & secondary analysis of data that has been ingested by scprep, this package may be coupled with scpost.
The scprep R package can be installed from Github using devtools:
devtools::install_github("g-duclos/scprep")
For a containerized environment with all dependencies pre-installed:
Quick Start with Docker:
# Build the Docker image
docker build -t scprep .
# Run interactive R session with scprep
docker run -it --rm scprep
# Run with data directory mounted
docker run -it --rm -v /path/to/your/data:/scprep/data scprep
Using Docker Compose:
# Start the scprep container
docker-compose up scprep
# For RStudio interface (optional)
docker-compose up scprep-rstudio
# Then navigate to http://localhost:8788
Docker Image Features: - Pre-installed R 4.2.1+ with all required dependencies - Seurat v5, Matrix, Biobase, biomaRt, and other dependencies configured - Optimized for single-cell RNA-Seq workflows - Volume mounting for data input/output
📖 Complete Usage Guide & Object Types Vignette - Comprehensive guide demonstrating how to read 10X data, convert between object types, and explore data structure
Specify the sample metadata - view the annotation file here: scprep_annotation.csv (click “Raw file” at the top right to download) * Column 1: “Sample_ID” corresponds to the name of each sample * Column 2: “Index” corresponds to the name of the PCR index used during library preparation * Column 3: “Sample_Project” corresponds to the name of the project affiliated with all samples * Column 4: “Reference” corresponds to the name of reference used when running Cell Ranger * Additional columns can be added corresponding to useful metadata for a particular experiment
Specify the pipeline parameters - view the parameters file here: scprep_parameters.csv (click “Raw file” at the top right to download)
Critical Parameters * dir_input (path to input directory) * dir_output (path to output directory) * file_type (input format: “h5” or “mtx”) * output_type (object type: “eset”, “seurat”, or “sce”) * gene_id (gene identifier type: “ensembl” or “symbol”)
Define the input directory (dir_input), which must contain a subdirectory named after each sample. The input directory (dir_input) must also contain the following file(s) produced by 10X Genomics’ Cell Ranger pipeline:
Option 1: H5 Format (Recommended) * filtered_feature_bc_matrix.h5 (the filtered gene counts matrix for each sample)
Option 2: MTX Format * matrix.mtx (gene expression count matrix in Matrix Market format) * barcodes.tsv (cell barcodes) * genes.tsv or features.tsv (gene identifiers and names)
Additional Files: * If working with 10x Genomics Immune Profiling assay that includes 5’ RNA-Seq with TCR or Ig V(D)J data: filtered_contig_annotations.csv
Format Selection: Set the file_type
parameter in scprep_parameters.csv
: - file_type="h5"
for H5 format (supports multimodal data like CITE-seq/ATAC-seq) - file_type="mtx"
for MTX format (RNA expression data only)
scprep requires the following R packages: - Seurat (>= 5.0.0): For reading 10X Genomics data formats and Seurat object creation - Matrix (>= 1.2-0): For sparse matrix operations
- Biobase: For ExpressionSet data structures - SingleCellExperiment: For SingleCellExperiment object creation - R (>= 4.0.0): Minimum R version
Note: All dependencies are automatically installed when using devtools::install_github()
or the Docker container.
scprep now supports three different output object types:
Template function to aggregate gene counts matrices from multiple samples, store aggregated counts matrices and metadata (for samples and genes) in an ExpressionSet, Seurat, or SingleCellExperiment object, add biomaRt gene annotation, perform cell filtering, and calculate select QC metrics.
Output type is specified in the scprep_parameters.csv
file:
library(Biobase)
# Output type determined by output_type parameter in scprep_parameters.csv
dataset <- scprep::template_scprep(dir_output=dir_output)
Core functions:
# Build Seurat object with RNA data (no additional modalities) dataset
# Add gene-level metadata to the object dataset
# Assign status of high quality "Cell", "Dead", or "Debris" to each barcode and add to object metadata dataset