vignettes/Basic-Workflow.Rmd
Basic-Workflow.Rmd
Abstract
scMaSigPro
is an R package designed for the serial
analysis of single-cell RNA-seq (scRNA-Seq) data along inferred
pseudotime. It builds upon the maSigPro
Bioconductor package to identify genes with significant expression
changes across different branching paths in a pseudotime-ordered
scRNA-Seq dataset. This vignette illustrates the basic workflow of
scMaSigPro
, providing a step-by-step guide for users
using a small simulated dataset.
scMaSigPro
is a polynomial regression-based approach
inspired by the maSigPro
Bioconductor package tailored for scRNA-Seq data. It first discretizes
single cell expression along inferred pseudotime while preserving order.
Afterwards it applies the maSigPro
model to pinpoint genes exhibiting significant expression profile
differences among branching paths and pseudotime.
Currently, scMaSigPro
is available on GitHub and can be
installed as follows:
# Install Dependencies
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.14")
BiocManager::install(c('SingleCellExperiment', 'maSigPro', 'MatrixGenerics', 'S4Vectors'))
To install scMaSigPro
from GitHub, use the following R
code:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install scMaSigPro
devtools::install_github("BioBam/scMaSigPro",
ref = "main",
build_vignettes = FALSE,
build_manual = TRUE,
upgrade = "never",
force = TRUE,
quiet = TRUE)
Here, we demonstrate the basic workflow of scMaSigPro
using a simulated dataset included in the package, simulated via splatter.
scMaSigPro
package and the dataset
# Set Seed for Reproducibility
set.seed(123)
# Load Package
library(scMaSigPro)
# Load example data
data("splat.sim", package = "scMaSigPro")
# Load
suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(SingleCellExperiment))
# Extract PCA
df <- as.data.frame(reducedDim(splat.sim)[, c(1:2)])
df$Step <- splat.sim@colData$Step
df$Group <- splat.sim@colData$Group
# Plot the data
ggplot(
df,
aes(x = PC1, y = PC2, color = Step, shape = Group)
) +
geom_point() +
theme_minimal()
If you are interested in learning how this data is simulated and what
information it contains, you can try ?splat.sim
in the R
console to read about the details.
The UMAP visualization of our dataset reveals a bifurcating trajectory, illustrating how one cell type diverges into two distinct cell types. This bifurcation is driven by the selective expression of genes along the pseudotime. In the UMAP plot, the two branching paths of differentiation are represented by points with distinct shapes, each corresponding to a different cell type.
Furthermore, the points in the UMAP are color-coded based on a simulated pseudotime-like variable, termed Step, from the Splatter package. This color gradient effectively captures the progression of cells along the pseudotime axis, providing insights into the temporal dynamics of cellular differentiation.
scMaSigPro
Object
scMaSigPro
offers various options to directly convert
widely used single-cell related S4 objects directly to the scmpClass
class (now referred to as scmpObject
), such as SingleCellExperiment
or Monocle3
’s
cell_data_set
class. Here, using the simulated object (of the SingleCellExperiment
class) that we have just loaded into the R environment, we will
demonstrate the direct conversion to scmpObject
object.
In our case, we already have pseudotime and path information stored
in the colData
slot of the splat.sim
. The
Step column holds the information for the simulated steps of
each path, which we can treat as a pseudotime variable, and the
Group column has information regarding branching path, i.e., to
which branching path each of the cell belongs. We can view this
information by executing
View(as.data.frame(splat.sim@colData)).
## Cell Batch Group ExpLibSize Step sizeFactor
## Cell1 Cell1 Batch1 Path2 56391.37 42 0.8847501
## Cell2 Cell2 Batch1 Path1 54923.49 41 0.7779386
## Cell3 Cell3 Batch1 Path1 52122.28 23 0.7687344
## Cell4 Cell4 Batch1 Path1 59039.15 42 0.9017475
## Cell5 Cell5 Batch1 Path2 64317.37 79 1.0111086
We will use as_scmp()
from scMaSigPro
and
direct it to use the columns present in the dataset as the columns for
Pseudotime
and Path
. To do this, we have to
enable the parameter labels_exist
and then pass the
existing column names as a named list.
# Helper Function to convert annotated SCE object to scmpObject
scmp_ob <- as_scmp(
object = splat.sim, from = "sce",
align_pseudotime = FALSE,
verbose = TRUE,
additional_params = list(
labels_exist = TRUE,
exist_ptime_col = "Step",
exist_path_col = "Group"
)
)
## Supplied object: SingleCellExperiment object
## Overwritting columns in cell level metadata, 'Group'
## is replaced by 'Path' and 'Step' is replaced by 'Pseudotime'.
Once the object is created, we can type the name of the object in the R console to view various attributes, such as the dimensions (number of cells and number of genes), the available branching paths, and the range of pseudotime.
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path2, Path1
sc.squeeze()
scMaSigPro
provides a comprehensive function for
pseudo-bulking along the pseudotime continuum, named
sc.squeeze()
. This function discretizes the original
pseudotime values into bins using histogram binning methods. The
sc.squeeze()
function includes various parameters that can
be tailored depending on the characteristics of the data. Here, we will
show the basic usage with the default parameters:
scmp_ob <- sc.squeeze(scmp_ob)
The console output of scMaSigPro
is dynamic and displays
more attributes as the analysis progresses. To view the results of the
binning procedure, we can simply type the object’s name into the
console:
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
Here, we observe that the original pseudotime, which ranged from 1 to 100 with two cells at each step and different paths, is now rescaled to a range of 1 to 8, with each path having an equal number of bins (8+8 in total). We also see the average bin sizes per path, indicating the average number of cells that make up a single bin for each path.
We can also visually inspect the binning process using a tile plot:
plotBinTile(scmp_ob)
The tile plot provides a clear view of the number of cells in each bin and how the sizes of bins compare with those of other paths in the same scmp_binned_pseudotime.
scMaSigPro
, being the successor of maSigPro
,
utilizes the same polynomial regression model. Let’s consider a case
with two branching paths along the same pseudotime scale, modeled with a
quadratic polynomial. The model can be represented as follows:
## Loading required package: knitr
\[\begin{align*} Y_{i} &\sim \text{NegativeBinomial}(\mu_{i}, \theta = 10) \\ \log(\mu_{i}) &= \beta_{0} + \beta_{1} \cdot (\text{Path}_{B}\text{vsPath}_{A})_{i} + \beta_{2} \cdot T_{\text{Pseudotime}}^{\text{Binned}_{i}} \\ &\quad + \beta_{3} \cdot (T_{\text{Pseudotime}}^{\text{Binned}_{i}} \cdot \text{Path}_{B_{i}}) + \beta_{4} \cdot (T_{\text{Pseudotime}}^{\text{Binned}_{i}})^2 \\ &\quad + \beta_{5} \cdot ((T_{\text{Pseudotime}}^{\text{Binned}_{i}})^2 \cdot \text{Path}_{B_{i}}) + \text{Offset}_{i} + \omega_{i} \cdot \epsilon_{i}\\ \end{align*}\]
To construct this model, we use sc.set.poly()
to include
quadratic terms:
# Polynomial Degree 2
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 2)
Once the model is stored, we can visualize the corresponding
polynomial using the showPoly()
function:
showPoly(scmp_ob)
## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2 + beta4*scmp_binned_pseudotime2 + beta5*scmp_binned_pseudotime2xPath2"
Similarly, we can fit a cubic polynomial by setting the polynomial degree to 3:
# Polynomial Degree 3
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 3)
showPoly(scmp_ob)
## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2 + beta4*scmp_binned_pseudotime2 + beta5*scmp_binned_pseudotime2xPath2 + beta6*scmp_binned_pseudotime3 + beta7*scmp_binned_pseudotime3xPath2"
However, for simplicity, we will explore a polynomial of degree 1.
Note that increasing the polynomial degree enhances
scMaSigPro
’s performance in capturing exponential and
nonlinear gene expression patterns:
# Polynomial Degree 1
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 1)
showPoly(scmp_ob)
## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2"
In the above model we have:
beta0: Accounts for differences in expression from the start to the end.
_beta1*Path2vsPath1_: Captures differences between the branching paths, assuming path-1 does not change along pseudotime.
_beta2*scmp_binned_pseudotime_: Reflects differences across pseudotime.
_beta3*scmp_binned_pseudotimexPath2_: Represents interaction between pseudotime and path differences.
To identify genes that demonstrate significant changes along the
pseudotime contniuum, we use the sc.p.vector()
. This
function, adapted from the original maSigPro
,
includes additional parameters such as the use of offsets
and weights
.
scMaSigPro
expects raw counts as input because it models
data using a negative binomial distribution, a count distribution, so
the counts should not be normalized to continuous values. To account for
library size differences, it uses offsets
, which are the
log of size factors, similar to DESeq2.
For those who prefer to supply continuous data normalized using methods like log(1+x), the distribution can be changed during model fitting.
We can execute sc.p.vector()
as follows:
# Detect non-flat profiles
scmp_ob <- sc.p.vector(scmp_ob,
offset = TRUE, p_value = 0.05, verbose = FALSE,
log_offset = TRUE
)
scmp_ob
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
The console output reveals that scMaSigPro
detected 51
genes with non-flat profiles.
Having identified genes with significant profiles, we can refine
their polynomial models using sc.t.fit()
. This function
evaluates each term of the polynomial model. In our case, it will assess
which among “beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime +
beta3*scmp_binned_pseudotimexPath2” significantly contributes to the
differences. To execute sc.t.fit()
, we proceed as
follows:
# Model refinement
scmp_ob <- sc.t.fit(scmp_ob, verbose = FALSE)
scmp_ob
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
## No. of Influential Features: 8
With our refined models in hand, we now focus on identifying genes
showing significant differences with pseudotime, among paths, or both.
For this purpose, we use the sc.filter()
function. Our aim
is to select models with a relatively high \(R^2\), indicating simple linear
relationships. The vars
parameter in
sc.filter()
allows us to extract different sets of
significant genes. Setting vars = 'all'
retrieves all
non-flat profiles identified in sc.p.vector()
with \(R^2>=\) the specified threshold. The
option vars = 'groups
fetches genes per path, resulting in
two gene lists that demonstrate associative significance among paths,
helping us identify genes associated with one path or the other along
the pseudotime continuum. The vars = 'each'
option finds
significance for each term in the polynomial. In our case, we are
interested in genes differentially expressed between paths and over
pseudotime continuum, so we will choose vars = 'groups
.
scmp_ob <- sc.filter(
scmpObj = scmp_ob,
rsq = 0.7,
vars = "groups",
intercept = "dummy",
includeInflu = TRUE
)
By setting the vars parameter to “groups”, the function will add genes with \(R^2\) >= 0.7 to the object. To explore the number of genes per group, we will make an upset plot:
plotIntersect(scmp_ob)
Here, we observe that 23 genes belong to both Path2vsPath1 and Path1, indicating that these genes not only change along pseudotime but also exhibit significantly different expression between the two paths. Additionally, there are 10 genes uniquely associated with Path2vsPath1. This implies that Path2 has 10 genes that are significantly differentially expressed over time, using Path1 genes as a reference. Let’s explore a few of these genes:
FigureA <- plotTrend(scmp_ob, "Gene9", logs = TRUE, logType = "log", lines = TRUE)
FigureB <- plotTrend(scmp_ob, "Gene95", logs = TRUE, logType = "log", lines = TRUE)
FigureC <- plotTrend(scmp_ob, "Gene10", logs = TRUE, logType = "log", lines = TRUE)
FigureD <- plotTrend(scmp_ob, "Gene92", logs = TRUE, logType = "log", lines = TRUE)
ggpubr::ggarrange(FigureA, FigureB, FigureC, FigureD,
ncol = 2, nrow = 2,
labels = c("A", "B", "C", "D")
)
These plots illustrate the gene expression trends for selected genes, providing insights into their behavior in different paths. Fortunately, since we are using simulated data, we can validate these trends by extracting gene-level metadata from our simulated object:
groundTruth <- as.data.frame(splat.sim@rowRanges@elementMetadata)
print(groundTruth[groundTruth$Gene %in% c("Gene9", "Gene95", "Gene10", "Gene92"), c(1, 2, 6, 8)])
## Gene BaseGeneMean DEFacPath1 DEFacPath2
## 9 Gene9 0.5478075 1.5743095 1.0000000
## 10 Gene10 2.7941952 0.5499756 2.8090798
## 92 Gene92 2.4912557 0.4996190 0.4264288
## 95 Gene95 0.4676235 1.0000000 2.4591537
The ground truth data reveals, for example, that Gene9 has a base gene mean of 0.5, a fold change of 1.5 in Path1, and remains at the same expression level for Path2. This trend is accurately captured in our analysis, as shown in Figure-A. Similarly, Gene92 does not show a significant difference between the paths but demonstrates a downtrend along pseudotime in both paths, a finding we successfully recapitulate in Figure-D.
To discern general trends among the genes, we can cluster and
visualize them together using the sc.cluster.trend()
and
plotTrendCluster()
. By default, this function employs
hierarchical clustering (hclust) and we will divide the genes into 4
clusters.
# Cluster Trend
scmp_ob <- sc.cluster.trend(
scmp_ob,
geneSet = "union",
k = 4
)
# Plot
plotTrendCluster(
scmpObj = scmp_ob,
plot = "coeff",
logs = TRUE,
verbose = FALSE,
lines = TRUE
)
This visualization helps in understanding the collective behavior of genes within each cluster, highlighting patterns and trends that might be relevant for further biological interpretation.
This concludes the basic usage quick start guide of
scMaSigPro
. Please refer to other vignettes for more
in-depth analysis.
sessionInfo(package = "scMaSigPro")
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## character(0)
##
## other attached packages:
## [1] scMaSigPro_0.0.4
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 gridExtra_2.3
## [3] rlang_1.1.3 magrittr_2.0.3
## [5] matrixStats_1.2.0 e1071_1.7-14
## [7] compiler_4.3.0 mgcv_1.9-1
## [9] venn_1.11 systemfonts_1.0.5
## [11] vctrs_0.6.5 stringr_1.5.1
## [13] pkgconfig_2.0.3 crayon_1.5.2
## [15] fastmap_1.1.1 backports_1.4.1
## [17] XVector_0.42.0 ellipsis_0.3.2
## [19] labeling_0.4.3 utf8_1.2.4
## [21] promises_1.2.1 rmarkdown_2.25
## [23] grDevices_4.3.0 UpSetR_1.4.0
## [25] ragg_1.2.7 purrr_1.0.2
## [27] xfun_0.42 zlibbioc_1.48.0
## [29] cachem_1.0.8 graphics_4.3.0
## [31] GenomeInfoDb_1.38.6 jsonlite_1.8.8
## [33] highr_0.10 later_1.3.2
## [35] DelayedArray_0.28.0 broom_1.0.5
## [37] parallel_4.3.0 R6_2.5.1
## [39] bslib_0.6.1 stringi_1.8.3
## [41] car_3.1-2 parallelly_1.37.0
## [43] GenomicRanges_1.54.1 jquerylib_0.1.4
## [45] Rcpp_1.0.12 bookdown_0.37
## [47] assertthat_0.2.1 SummarizedExperiment_1.32.0
## [49] knitr_1.45 IRanges_2.36.0
## [51] splines_4.3.0 httpuv_1.6.14
## [53] Matrix_1.6-5 igraph_1.6.0
## [55] tidyselect_1.2.0 rstudioapi_0.15.0
## [57] abind_1.4-5 yaml_2.3.8
## [59] admisc_0.34 plyr_1.8.9
## [61] lattice_0.22-5 tibble_3.2.1
## [63] Biobase_2.62.0 shiny_1.8.0
## [65] withr_3.0.0 evaluate_0.23
## [67] base_4.3.0 desc_1.4.3
## [69] proxy_0.4-27 mclust_6.0.1
## [71] ggpubr_0.6.0 pillar_1.9.0
## [73] BiocManager_1.30.22 carData_3.0-5
## [75] MatrixGenerics_1.14.0 stats4_4.3.0
## [77] plotly_4.10.4 generics_0.1.3
## [79] RCurl_1.98-1.14 S4Vectors_0.40.2
## [81] ggplot2_3.5.1 munsell_0.5.0
## [83] scales_1.3.0 BiocStyle_2.30.0
## [85] stats_4.3.0 xtable_1.8-4
## [87] class_7.3-22 glue_1.7.0
## [89] lazyeval_0.2.2 tools_4.3.0
## [91] datasets_4.3.0 data.table_1.15.0
## [93] ggsignif_0.6.4 fs_1.6.3
## [95] cowplot_1.1.2 grid_4.3.0
## [97] utils_4.3.0 tidyr_1.3.1
## [99] methods_4.3.0 colorspace_2.1-0
## [101] SingleCellExperiment_1.24.0 nlme_3.1-164
## [103] GenomeInfoDbData_1.2.11 cli_3.6.2
## [105] textshaping_0.3.7 fansi_1.0.6
## [107] S4Arrays_1.2.0 viridisLite_0.4.2
## [109] dplyr_1.1.4 gtable_0.3.4
## [111] rstatix_0.7.2 sass_0.4.8
## [113] digest_0.6.34 BiocGenerics_0.48.1
## [115] SparseArray_1.2.4 htmlwidgets_1.6.4
## [117] farver_2.1.1 memoise_2.0.1
## [119] maSigPro_1.74.0 entropy_1.3.1
## [121] htmltools_0.5.7 pkgdown_2.0.7
## [123] lifecycle_1.0.4 httr_1.4.7
## [125] mime_0.12 MASS_7.3-60