Quick Start Guide

Abstract

scMaSigPro is an R package designed for the serial analysis of single-cell RNA-seq (scRNA-Seq) data along inferred pseudotime. It builds upon the maSigPro Bioconductor package to identify genes with significant expression changes across different branching paths in a pseudotime-ordered scRNA-Seq dataset. This vignette illustrates the basic workflow of scMaSigPro, providing a step-by-step guide for users using a small simulated dataset.

Introduction

scMaSigPro is a polynomial regression-based approach inspired by the maSigPro Bioconductor package tailored for scRNA-Seq data. It first discretizes single cell expression along inferred pseudotime while preserving order. Afterwards it applies the maSigPro model to pinpoint genes exhibiting significant expression profile differences among branching paths and pseudotime.

Installation

Currently, scMaSigPro is available on GitHub and can be installed as follows:

Bioconductor and Dependencies

# Install Dependencies
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.14")

BiocManager::install(c('SingleCellExperiment', 'maSigPro', 'MatrixGenerics', 'S4Vectors'))

scMaSigPro latest version

To install scMaSigPro from GitHub, use the following R code:

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install scMaSigPro
devtools::install_github("BioBam/scMaSigPro",
                         ref = "main",
                         build_vignettes = FALSE,
                         build_manual = TRUE,
                         upgrade = "never",
                         force = TRUE,
                         quiet = TRUE)

Basic Workflow

Here, we demonstrate the basic workflow of scMaSigPro using a simulated dataset included in the package, simulated via splatter.

Load the `scMaSigPro` package and the dataset

# Set Seed for Reproducibility
set.seed(123)

# Load Package
library(scMaSigPro)

# Load example data
data("splat.sim", package = "scMaSigPro")

Plot the data (Optional)

# Load
suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(SingleCellExperiment))

# Extract PCA
df <- as.data.frame(reducedDim(splat.sim)[, c(1:2)])
df$Step <- splat.sim@colData$Step
df$Group <- splat.sim@colData$Group

# Plot the data
ggplot(
  df,
  aes(x = PC1, y = PC2, color = Step, shape = Group)
) +
  geom_point() +
  theme_minimal()

If you are interested in learning how this data is simulated and what information it contains, you can try ?splat.sim in the R console to read about the details.

The UMAP visualization of our dataset reveals a bifurcating trajectory, illustrating how one cell type diverges into two distinct cell types. This bifurcation is driven by the selective expression of genes along the pseudotime. In the UMAP plot, the two branching paths of differentiation are represented by points with distinct shapes, each corresponding to a different cell type.

Furthermore, the points in the UMAP are color-coded based on a simulated pseudotime-like variable, termed Step, from the Splatter package. This color gradient effectively captures the progression of cells along the pseudotime axis, providing insights into the temporal dynamics of cellular differentiation.

Convert to `scMaSigPro` Object

scMaSigPro offers various options to directly convert widely used single-cell related S4 objects directly to the scmpClass class (now referred to as scmpObject), such as SingleCellExperiment or Monocle3’s cell_data_set class. Here, using the simulated object (of the SingleCellExperiment class) that we have just loaded into the R environment, we will demonstrate the direct conversion to scmpObject object.

In our case, we already have pseudotime and path information stored in the colData slot of the splat.sim. The Step column holds the information for the simulated steps of each path, which we can treat as a pseudotime variable, and the Group column has information regarding branching path, i.e., to which branching path each of the cell belongs. We can view this information by executing View(as.data.frame(splat.sim@colData)).

##        Cell  Batch Group ExpLibSize Step sizeFactor
## Cell1 Cell1 Batch1 Path2   56391.37   42  0.8847501
## Cell2 Cell2 Batch1 Path1   54923.49   41  0.7779386
## Cell3 Cell3 Batch1 Path1   52122.28   23  0.7687344
## Cell4 Cell4 Batch1 Path1   59039.15   42  0.9017475
## Cell5 Cell5 Batch1 Path2   64317.37   79  1.0111086

We will use as_scmp() from scMaSigPro and direct it to use the columns present in the dataset as the columns for Pseudotime and Path. To do this, we have to enable the parameter labels_exist and then pass the existing column names as a named list.

# Helper Function to convert annotated SCE object to scmpObject
scmp_ob <- as_scmp(
  object = splat.sim, from = "sce",
  align_pseudotime = FALSE,
  verbose = TRUE,
  additional_params = list(
    labels_exist = TRUE,
    exist_ptime_col = "Step",
    exist_path_col = "Group"
  )
)

## Supplied object: SingleCellExperiment object

## Overwritting columns in cell level metadata, 'Group'
##         is replaced by 'Path' and 'Step' is replaced by 'Pseudotime'.

Once the object is created, we can type the name of the object in the R console to view various attributes, such as the dimensions (number of cells and number of genes), the available branching paths, and the range of pseudotime.

## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path2, Path1

Pseudo-bulking along the continuum with `sc.squeeze()`

scMaSigPro provides a comprehensive function for pseudo-bulking along the pseudotime continuum, named sc.squeeze(). This function discretizes the original pseudotime values into bins using histogram binning methods. The sc.squeeze() function includes various parameters that can be tailored depending on the characteristics of the data. Here, we will show the basic usage with the default parameters:

scmp_ob <- sc.squeeze(scmp_ob)

The console output of scMaSigPro is dynamic and displays more attributes as the analysis progresses. To view the results of the binning procedure, we can simply type the object’s name into the console:

## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean), 
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13

Here, we observe that the original pseudotime, which ranged from 1 to 100 with two cells at each step and different paths, is now rescaled to a range of 1 to 8, with each path having an equal number of bins (8+8 in total). We also see the average bin sizes per path, indicating the average number of cells that make up a single bin for each path.

Visualize bins

We can also visually inspect the binning process using a tile plot:

plotBinTile(scmp_ob)

The tile plot provides a clear view of the number of cells in each bin and how the sizes of bins compare with those of other paths in the same scmp_binned_pseudotime.

Setting up the Polynomial Model

scMaSigPro, being the successor of maSigPro, utilizes the same polynomial regression model. Let’s consider a case with two branching paths along the same pseudotime scale, modeled with a quadratic polynomial. The model can be represented as follows:

## Loading required package: knitr

\[\begin{align*} Y_{i} &\sim \text{NegativeBinomial}(\mu_{i}, \theta = 10) \\ \log(\mu_{i}) &= \beta_{0} + \beta_{1} \cdot (\text{Path}_{B}\text{vsPath}_{A})_{i} + \beta_{2} \cdot T_{\text{Pseudotime}}^{\text{Binned}_{i}} \\ &\quad + \beta_{3} \cdot (T_{\text{Pseudotime}}^{\text{Binned}_{i}} \cdot \text{Path}_{B_{i}}) + \beta_{4} \cdot (T_{\text{Pseudotime}}^{\text{Binned}_{i}})^2 \\ &\quad + \beta_{5} \cdot ((T_{\text{Pseudotime}}^{\text{Binned}_{i}})^2 \cdot \text{Path}_{B_{i}}) + \text{Offset}_{i} + \omega_{i} \cdot \epsilon_{i}\\ \end{align*}\]

To construct this model, we use sc.set.poly() to include quadratic terms:

# Polynomial Degree 2
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 2)

Visualize the model

Once the model is stored, we can visualize the corresponding polynomial using the showPoly() function:

showPoly(scmp_ob)

## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2 + beta4*scmp_binned_pseudotime2 + beta5*scmp_binned_pseudotime2xPath2"

Similarly, we can fit a cubic polynomial by setting the polynomial degree to 3:

# Polynomial Degree 3
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 3)
showPoly(scmp_ob)

## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2 + beta4*scmp_binned_pseudotime2 + beta5*scmp_binned_pseudotime2xPath2 + beta6*scmp_binned_pseudotime3 + beta7*scmp_binned_pseudotime3xPath2"

However, for simplicity, we will explore a polynomial of degree 1. Note that increasing the polynomial degree enhances scMaSigPro’s performance in capturing exponential and nonlinear gene expression patterns:

# Polynomial Degree 1
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 1)
showPoly(scmp_ob)

## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2"

In the above model we have:

beta0: Accounts for differences in expression from the start to the end.
_beta1*Path2vsPath1_: Captures differences between the branching paths, assuming path-1 does not change along pseudotime.
_beta2*scmp_binned_pseudotime_: Reflects differences across pseudotime.
_beta3*scmp_binned_pseudotimexPath2_: Represents interaction between pseudotime and path differences.

Detecting Genes with Non-Flat Profiles

To identify genes that demonstrate significant changes along the pseudotime contniuum, we use the sc.p.vector(). This function, adapted from the original maSigPro, includes additional parameters such as the use of offsets and weights.

scMaSigPro expects raw counts as input because it models data using a negative binomial distribution, a count distribution, so the counts should not be normalized to continuous values. To account for library size differences, it uses offsets, which are the log of size factors, similar to DESeq2.

For those who prefer to supply continuous data normalized using methods like log(1+x), the distribution can be changed during model fitting.

We can execute sc.p.vector() as follows:

# Detect non-flat profiles
scmp_ob <- sc.p.vector(scmp_ob,
  offset = TRUE, p_value = 0.05, verbose = FALSE,
  log_offset = TRUE
)
scmp_ob

## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean), 
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50

The console output reveals that scMaSigPro detected 51 genes with non-flat profiles.

Having identified genes with significant profiles, we can refine their polynomial models using sc.t.fit(). This function evaluates each term of the polynomial model. In our case, it will assess which among “beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2” significantly contributes to the differences. To execute sc.t.fit(), we proceed as follows:

# Model refinement
scmp_ob <- sc.t.fit(scmp_ob, verbose = FALSE)
scmp_ob

## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean), 
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
## No. of Influential Features: 8

Selection of Genes

With our refined models in hand, we now focus on identifying genes showing significant differences with pseudotime, among paths, or both. For this purpose, we use the sc.filter() function. Our aim is to select models with a relatively high \(R^2\), indicating simple linear relationships. The vars parameter in sc.filter() allows us to extract different sets of significant genes. Setting vars = 'all' retrieves all non-flat profiles identified in sc.p.vector() with \(R^2>=\) the specified threshold. The option vars = 'groups fetches genes per path, resulting in two gene lists that demonstrate associative significance among paths, helping us identify genes associated with one path or the other along the pseudotime continuum. The vars = 'each' option finds significance for each term in the polynomial. In our case, we are interested in genes differentially expressed between paths and over pseudotime continuum, so we will choose vars = 'groups.

scmp_ob <- sc.filter(
  scmpObj = scmp_ob,
  rsq = 0.7,
  vars = "groups",
  intercept = "dummy",
  includeInflu = TRUE
)

By setting the vars parameter to “groups”, the function will add genes with \(R^2\) >= 0.7 to the object. To explore the number of genes per group, we will make an upset plot:

plotIntersect(scmp_ob)

Here, we observe that 23 genes belong to both Path2vsPath1 and Path1, indicating that these genes not only change along pseudotime but also exhibit significantly different expression between the two paths. Additionally, there are 10 genes uniquely associated with Path2vsPath1. This implies that Path2 has 10 genes that are significantly differentially expressed over time, using Path1 genes as a reference. Let’s explore a few of these genes:

FigureA <- plotTrend(scmp_ob, "Gene9", logs = TRUE, logType = "log", lines = TRUE)
FigureB <- plotTrend(scmp_ob, "Gene95", logs = TRUE, logType = "log", lines = TRUE)
FigureC <- plotTrend(scmp_ob, "Gene10", logs = TRUE, logType = "log", lines = TRUE)
FigureD <- plotTrend(scmp_ob, "Gene92", logs = TRUE, logType = "log", lines = TRUE)
ggpubr::ggarrange(FigureA, FigureB, FigureC, FigureD,
  ncol = 2, nrow = 2,
  labels = c("A", "B", "C", "D")
)

These plots illustrate the gene expression trends for selected genes, providing insights into their behavior in different paths. Fortunately, since we are using simulated data, we can validate these trends by extracting gene-level metadata from our simulated object:

groundTruth <- as.data.frame(splat.sim@rowRanges@elementMetadata)
print(groundTruth[groundTruth$Gene %in% c("Gene9", "Gene95", "Gene10", "Gene92"), c(1, 2, 6, 8)])

##      Gene BaseGeneMean DEFacPath1 DEFacPath2
## 9   Gene9    0.5478075  1.5743095  1.0000000
## 10 Gene10    2.7941952  0.5499756  2.8090798
## 92 Gene92    2.4912557  0.4996190  0.4264288
## 95 Gene95    0.4676235  1.0000000  2.4591537

The ground truth data reveals, for example, that Gene9 has a base gene mean of 0.5, a fold change of 1.5 in Path1, and remains at the same expression level for Path2. This trend is accurately captured in our analysis, as shown in Figure-A. Similarly, Gene92 does not show a significant difference between the paths but demonstrates a downtrend along pseudotime in both paths, a finding we successfully recapitulate in Figure-D.

Cluster Trends

To discern general trends among the genes, we can cluster and visualize them together using the sc.cluster.trend() and plotTrendCluster(). By default, this function employs hierarchical clustering (hclust) and we will divide the genes into 4 clusters.

# Cluster Trend
scmp_ob <- sc.cluster.trend(
  scmp_ob,
  geneSet = "union",
  k = 4
)

# Plot
plotTrendCluster(
  scmpObj = scmp_ob,
  plot = "coeff",
  logs = TRUE,
  verbose = FALSE,
  lines = TRUE
)

This visualization helps in understanding the collective behavior of genes within each cluster, highlighting patterns and trends that might be relevant for further biological interpretation.

This concludes the basic usage quick start guide of scMaSigPro. Please refer to other vignettes for more in-depth analysis.

Session Info

sessionInfo(package = "scMaSigPro")

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Madrid
## tzcode source: system (glibc)
## 
## attached base packages:
## character(0)
## 
## other attached packages:
## [1] scMaSigPro_0.0.4
## 
## loaded via a namespace (and not attached):
##   [1] bitops_1.0-7                gridExtra_2.3              
##   [3] rlang_1.1.3                 magrittr_2.0.3             
##   [5] matrixStats_1.2.0           e1071_1.7-14               
##   [7] compiler_4.3.0              mgcv_1.9-1                 
##   [9] venn_1.11                   systemfonts_1.0.5          
##  [11] vctrs_0.6.5                 stringr_1.5.1              
##  [13] pkgconfig_2.0.3             crayon_1.5.2               
##  [15] fastmap_1.1.1               backports_1.4.1            
##  [17] XVector_0.42.0              ellipsis_0.3.2             
##  [19] labeling_0.4.3              utf8_1.2.4                 
##  [21] promises_1.2.1              rmarkdown_2.25             
##  [23] grDevices_4.3.0             UpSetR_1.4.0               
##  [25] ragg_1.2.7                  purrr_1.0.2                
##  [27] xfun_0.42                   zlibbioc_1.48.0            
##  [29] cachem_1.0.8                graphics_4.3.0             
##  [31] GenomeInfoDb_1.38.6         jsonlite_1.8.8             
##  [33] highr_0.10                  later_1.3.2                
##  [35] DelayedArray_0.28.0         broom_1.0.5                
##  [37] parallel_4.3.0              R6_2.5.1                   
##  [39] bslib_0.6.1                 stringi_1.8.3              
##  [41] car_3.1-2                   parallelly_1.37.0          
##  [43] GenomicRanges_1.54.1        jquerylib_0.1.4            
##  [45] Rcpp_1.0.12                 bookdown_0.37              
##  [47] assertthat_0.2.1            SummarizedExperiment_1.32.0
##  [49] knitr_1.45                  IRanges_2.36.0             
##  [51] splines_4.3.0               httpuv_1.6.14              
##  [53] Matrix_1.6-5                igraph_1.6.0               
##  [55] tidyselect_1.2.0            rstudioapi_0.15.0          
##  [57] abind_1.4-5                 yaml_2.3.8                 
##  [59] admisc_0.34                 plyr_1.8.9                 
##  [61] lattice_0.22-5              tibble_3.2.1               
##  [63] Biobase_2.62.0              shiny_1.8.0                
##  [65] withr_3.0.0                 evaluate_0.23              
##  [67] base_4.3.0                  desc_1.4.3                 
##  [69] proxy_0.4-27                mclust_6.0.1               
##  [71] ggpubr_0.6.0                pillar_1.9.0               
##  [73] BiocManager_1.30.22         carData_3.0-5              
##  [75] MatrixGenerics_1.14.0       stats4_4.3.0               
##  [77] plotly_4.10.4               generics_0.1.3             
##  [79] RCurl_1.98-1.14             S4Vectors_0.40.2           
##  [81] ggplot2_3.5.1               munsell_0.5.0              
##  [83] scales_1.3.0                BiocStyle_2.30.0           
##  [85] stats_4.3.0                 xtable_1.8-4               
##  [87] class_7.3-22                glue_1.7.0                 
##  [89] lazyeval_0.2.2              tools_4.3.0                
##  [91] datasets_4.3.0              data.table_1.15.0          
##  [93] ggsignif_0.6.4              fs_1.6.3                   
##  [95] cowplot_1.1.2               grid_4.3.0                 
##  [97] utils_4.3.0                 tidyr_1.3.1                
##  [99] methods_4.3.0               colorspace_2.1-0           
## [101] SingleCellExperiment_1.24.0 nlme_3.1-164               
## [103] GenomeInfoDbData_1.2.11     cli_3.6.2                  
## [105] textshaping_0.3.7           fansi_1.0.6                
## [107] S4Arrays_1.2.0              viridisLite_0.4.2          
## [109] dplyr_1.1.4                 gtable_0.3.4               
## [111] rstatix_0.7.2               sass_0.4.8                 
## [113] digest_0.6.34               BiocGenerics_0.48.1        
## [115] SparseArray_1.2.4           htmlwidgets_1.6.4          
## [117] farver_2.1.1                memoise_2.0.1              
## [119] maSigPro_1.74.0             entropy_1.3.1              
## [121] htmltools_0.5.7             pkgdown_2.0.7              
## [123] lifecycle_1.0.4             httr_1.4.7                 
## [125] mime_0.12                   MASS_7.3-60

Basic Steps

Priyansh Srivastava

Introduction

Installation

Bioconductor and Dependencies

scMaSigPro latest version

Basic Workflow

Load the `scMaSigPro` package and the dataset

Plot the data (Optional)

Convert to `scMaSigPro` Object

Pseudo-bulking along the continuum with `sc.squeeze()`

Visualize bins

Setting up the Polynomial Model

Visualize the model

Detecting Genes with Non-Flat Profiles

Model Refinement

Selection of Genes

Cluster Trends

Session Info

Quick Start Guide

Basic Steps

Priyansh Srivastava

Introduction

Installation

Bioconductor and Dependencies

scMaSigPro latest version

Basic Workflow

Load the scMaSigPro package and the dataset

Plot the data (Optional)

Convert to scMaSigPro Object

Pseudo-bulking along the continuum with sc.squeeze()

Visualize bins

Setting up the Polynomial Model

Visualize the model

Detecting Genes with Non-Flat Profiles

Model Refinement

Selection of Genes

Cluster Trends

Session Info

Load the `scMaSigPro` package and the dataset

Convert to `scMaSigPro` Object

Pseudo-bulking along the continuum with `sc.squeeze()`