scMaSigPro
is an R package designed for the serial analysis of single-cell
RNA-seq (scRNA-Seq) data along inferred pseudotime. It builds upon the maSigPro
Bioconductor package to identify genes with significant expression changes across different
branching paths in a pseudotime-ordered scRNA-Seq dataset. This vignette illustrates
the S4 Object of the scMaSigPro-Class
and generic methods.
scMaSigPro 0.0.4
scMaSigPro
is a polynomial regression-based approach inspired by the maSigPro
Bioconductor package tailored for scRNA-Seq data. It first discretizes single cell
expression along inferred pseudotime while preserving order. Afterwards it applies
the maSigPro model
to pinpoint genes exhibiting significant expression profile differences among
branching paths and pseudotime.
Currently, scMaSigPro
is available on GitHub and can be installed as follows:
# Install Dependencies
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(version = "3.14")
BiocManager::install(c("SingleCellExperiment", "maSigPro", "MatrixGenerics", "S4Vectors"))
To install scMaSigPro
from GitHub, use the following R code:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install scMaSigPro
devtools::install_github("BioBam/scMaSigPro",
ref = "main",
build_vignettes = FALSE,
build_manual = TRUE,
upgrade = "never",
force = TRUE,
quiet = TRUE
)
Here, we will use scmp.ob
that already has all the data inside it. The scmp.ob
is included in the package and is simulated via splatter.
If you are interested in learning how this data is processed, please see the
vignette Basic-Workflow
.
scMaSigPro
package and the dataset# Set Seed for Reproducibility
set.seed(123)
# Load Package
library(scMaSigPro)
# Load example data
data("scmp.ob", package = "scMaSigPro")
As the object scmp.ob
is already processed, we can type the name of the object
in the R console to view various attributes, such as the dimensions, number of
non-flat profiles etc.
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
## No. of Influential Features: 8
Sparse
& Dense
SlotsThe scmp.ob
object has two slots, Sparse
and Dense
both which are of
SingleCellExperiment Class
.
The Sparse
slot contains the raw count data (i.e. non-pseudo-bulked) and the
Dense
slot contains the counts that are pseudo-bulked.
To access the counts we can simply use the eDense()
and eSparse()
methods. These
two are generic methods and have simillary functionality to the assays()
from the
SummarizedExperiment Class
.
These methods can be used to get or set the expression counts in the scMaSigPro
object. We will only show first 3 rows and columns of the Sparse
and Dense
slot.
# Get the Dense counts
eDense(scmp.ob)[c(1:3), c(1:3)]
## Path1_bin_1 Path1_bin_2 Path1_bin_3
## Gene1 16 24 11
## Gene2 6335 7337 3375
## Gene3 201 245 105
# Get the Sparse counts
eSparse(scmp.ob)[c(1:3), c(1:3)]
## Cell1 Cell2 Cell3
## Gene1 0 2 1
## Gene2 436 371 432
## Gene3 35 13 17
Just like the colData()
method, the scMaSigPro
object also has cDense()
and
cSparse()
methods. These are also generics and can get the cell level information
from the Sparse
and Dense
slots. Specifically in the case of cDense()
, the
bin level information is accessed.
# Get bin level metadata
cDense(scmp.ob)[c(1:3), ]
## scmp_binned_pseudotime
## Path1_bin_1 1
## Path1_bin_2 2
## Path1_bin_3 3
## scmp_bin_members
## Path1_bin_1 Cell6|Cell37|Cell47|Cell55|Cell95|Cell102|Cell107|Cell135|Cell136|Cell156|Cell159|Cell190|Cell197|Cell200
## Path1_bin_2 Cell3|Cell7|Cell49|Cell61|Cell77|Cell97|Cell101|Cell119|Cell122|Cell144|Cell157|Cell165|Cell173|Cell175|Cell182
## Path1_bin_3 Cell38|Cell39|Cell108|Cell124|Cell133|Cell146|Cell188
## scmp_bin Path scmp_bin_size scmp_l_bound scmp_u_bound
## Path1_bin_1 Path1_bin_1 Path1 14 1.0 13.2
## Path1_bin_2 Path1_bin_2 Path1 15 13.2 25.5
## Path1_bin_3 Path1_bin_3 Path1 7 25.5 37.8
# Get cell level metadata
cSparse(scmp.ob)[c(1:3), c(1:3)]
## Cell Batch Path
## Cell2 Cell2 Batch1 Path1
## Cell3 Cell3 Batch1 Path1
## Cell4 Cell4 Batch1 Path1
Design
SlotThe two most important slots of Design
are,
predictor_matrix: A matrix of predictors for the polynomial regression model.
It can be accessed with generic method predictors()
.
assignment_matrix: A matrix containing the hard-assignment of each bin to
binned-Pseudotime. It can be accessed with generic method pathAssign()
.
Profile
SlotThe Profile
slot stores the results from the sc.p.vector()
, which performs the
global fitting of the polynomial-GLM and test the significance against the intercept
only model. Below is the brief description of the data stored in Profile
slot.
non_flat: A list of genes with non-flat profiles in pseudotime.
pvalue and adj_p_values: Significance levels.
fdr: False Discovery Rate.
Estimate
SlotThe Estimate
slot stores the results from the sc.t.fit()
, which performs the
optimization of the polynomial-GLM and selects most significant terms.
Below is the brief description of the data stored in Estimate
slot.
significance_matrix: This matrix stores the p-values
for the optimized
model, terms of the model and the coefficient of determination i.e. \(R^2\) for
each of the genes.
coefficient_matrix: This matrix stores the \(\beta\) estimates for all the
genes in the significance_matrix
. The coefficient reflect changes associated
with each of the terms in optimized polynomial-GLM.
path_coefficient_matrix: This matrix stores the \(\beta\) estimates for all the
genes in the significance_matrix
. The coefficient reflect changes associated
with each of the branching paths.
influential: Stores information of the genes which have influential observations as per cook’s distance.
Significant
SlotSignificant
slot stores the final results of the workflow. Once specified
significant genes are selected based on the sc.filter()
and sc.cluster.trend
.
The Significant
stores the gene and the cluster information.
Parameters
SlotThe last slot is the Parameters
. It stores all the parameters used during the
analysis. The showParams()
returns all the parameters in a dataframe.
showParams(scmp.ob)[c(10:15), ] # only 5
## parameters value
## 10 bin_mem_col scmp_bin_members
## 11 anno_col cell_type
## 12 g 100
## 13 p_value 0.05
## 14 min_na 6
## 15 mt_correction BH
queryCoeff()
In contrast to the sc.filter()
function, which identifies significant genes
based on \(R^2\) values and p-values for each group or polynomial term, the
queryCoeff()
function enables the extraction of genes based on their estimated
coefficients (\(\beta\)) in a poly-GLM.
A positive value of \(\beta\) that there is a positive association between the independent variable and the response variable, meaning as the independent variable increases, the response variable tends to increase. Conversely, a negative \(\beta\) value indicates a negative association, implying that an increase in the independent variable is associated with a decrease in the response variable.
In the scMaSigPro
framework, each gene deemed significant is evaluated for its
association with the response variable. When a model includes multiple explanatory
terms, such as \(Pseudotime\) (linear) and \(Pseudotime^2\) (quadratic) components, the
overall effect is assessed by considering the combined influence of these
coefficients. This approach allows for a nuanced understanding of how each term
contributes to the response. Additionally, users can specify an interest in
identifying relationships where all significant terms contribute in the same
direction, i.e., either all increasing or all decreasing the response.
This means that for a gene to be classified under this category, each term (like
\(Pseudotime\) and \(Pseudotime^2\) must exhibit a consistent effect, either both
positively or both negatively influencing the gene expression.
query
parameterThe queryCoeff()
function’s query
parameter can be set to one of three
values: ‘path’, ‘pseudotime’, or ‘path_pseudotime’, each specifying a different
subset of terms in the poly-GLM for evaluation.
‘query’ = path
: This setting focuses on evaluating the coefficients of
terms that describe the influence of each branching path in the model. It aims
to quantify how different paths within the model contribute to the response
variable, independently of other factors.
‘query’ = pseudotime
: When set to ‘pseudotime’, the function assesses the
coefficients of terms that capture the effect of pseudotime on the response
variable. This approach is centered on understanding how changes over pseudotime
affect the response.
‘query’ = path_pseudotime
: This option is for evaluating the coefficients
of terms that represent the interaction effects between pseudotime and branching
paths. It examines how the combined influence of both pseudotime and branching
paths impacts the response variable, focusing on the synergistic or interaction
effects rather than their independent contributions.
Now we will make use of the path_pseudotime
and pseudotime
and plot
top-4 genes for this tutorial.
# Increasing effect along Pseudotime
pTime_up <- queryCoeff(scmp.ob,
query = "pseudotime",
change = "increasing",
verbose = FALSE
)
## Polynomial-GLM Formula: beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2
# Order
pTime_up <- pTime_up[order(pTime_up$betascmp_binned_pseudotime,
decreasing = TRUE
), , drop = FALSE]
# View
head(pTime_up, 4)
## betascmp_binned_pseudotime
## Gene46 0.11753328
## Gene86 0.11606199
## Gene18 0.08252258
## Gene89 0.06665634
# Plot
pTime_up.plots <- list()
for (i in rownames(head(pTime_up, 4))) {
pTime_up.plots[[i]] <- plotTrend(scmp.ob, i)
}
# Plot combined
ggpubr::ggarrange(pTime_up.plots[[1]], pTime_up.plots[[2]], pTime_up.plots[[3]], pTime_up.plots[[4]],
ncol = 2, nrow = 2, labels = c("A", "B", "C", "D")
)
# Increasing effect along Pseudotime * Path
pTime_path_up <- queryCoeff(scmp.ob,
query = "pseudotime_path",
change = "increasing",
verbose = FALSE
)
## Polynomial-GLM Formula: beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2
# Order
pTime_path_up <- pTime_path_up[order(pTime_path_up$betascmp_binned_pseudotimexPath2,
decreasing = TRUE
), , drop = FALSE]
# View
head(pTime_path_up, 4)
## betascmp_binned_pseudotimexPath2
## Gene10 0.1989694
## Gene3 0.1617452
## Gene95 0.1254349
## Gene72 0.1092380
# Plot
pTime_path_up.plots <- list()
for (i in rownames(head(pTime_path_up, 4))) {
pTime_path_up.plots[[i]] <- plotTrend(scmp.ob, i)
}
# Plot combined
ggpubr::ggarrange(pTime_path_up.plots[[1]], pTime_path_up.plots[[2]], pTime_path_up.plots[[3]], pTime_path_up.plots[[4]], ncol = 2, nrow = 2, labels = c("A", "B", "C", "D"))
This concludes the vignette of scMaSigPro-Class
. Please refer
to other vignettes for more in-depth analysis.
sessionInfo(package = "scMaSigPro")
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## character(0)
##
## other attached packages:
## [1] scMaSigPro_0.0.4
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 gridExtra_2.3
## [3] rlang_1.1.3 magrittr_2.0.3
## [5] RcppAnnoy_0.0.22 matrixStats_1.2.0
## [7] e1071_1.7-14 compiler_4.3.0
## [9] mgcv_1.9-1 venn_1.11
## [11] vctrs_0.6.5 stringr_1.5.1
## [13] pkgconfig_2.0.3 crayon_1.5.2
## [15] fastmap_1.1.1 backports_1.4.1
## [17] XVector_0.42.0 ellipsis_0.3.2
## [19] labeling_0.4.3 utf8_1.2.4
## [21] promises_1.2.1 rmarkdown_2.25
## [23] grDevices_4.3.0 UpSetR_1.4.0
## [25] purrr_1.0.2 xfun_0.42
## [27] zlibbioc_1.48.0 cachem_1.0.8
## [29] graphics_4.3.0 GenomeInfoDb_1.38.6
## [31] jsonlite_1.8.8 highr_0.10
## [33] later_1.3.2 DelayedArray_0.28.0
## [35] terra_1.7-71 broom_1.0.5
## [37] parallel_4.3.0 R6_2.5.1
## [39] bslib_0.6.1 stringi_1.8.3
## [41] car_3.1-2 parallelly_1.37.0
## [43] GenomicRanges_1.54.1 jquerylib_0.1.4
## [45] Rcpp_1.0.12 bookdown_0.37
## [47] assertthat_0.2.1 SummarizedExperiment_1.32.0
## [49] knitr_1.45 IRanges_2.36.0
## [51] splines_4.3.0 httpuv_1.6.14
## [53] Matrix_1.6-5 igraph_1.6.0
## [55] tidyselect_1.2.0 rstudioapi_0.15.0
## [57] abind_1.4-5 yaml_2.3.8
## [59] codetools_0.2-19 admisc_0.34
## [61] lattice_0.22-5 tibble_3.2.1
## [63] plyr_1.8.9 Biobase_2.62.0
## [65] shiny_1.8.0 withr_3.0.0
## [67] evaluate_0.23 base_4.3.0
## [69] proxy_0.4-27 mclust_6.0.1
## [71] pillar_1.9.0 BiocManager_1.30.22
## [73] ggpubr_0.6.0 carData_3.0-5
## [75] MatrixGenerics_1.14.0 stats4_4.3.0
## [77] plotly_4.10.4 generics_0.1.3
## [79] RCurl_1.98-1.14 S4Vectors_0.40.2
## [81] ggplot2_3.5.1 munsell_0.5.0
## [83] scales_1.3.0 BiocStyle_2.30.0
## [85] stats_4.3.0 xtable_1.8-4
## [87] class_7.3-22 glue_1.7.0
## [89] lazyeval_0.2.2 tools_4.3.0
## [91] datasets_4.3.0 data.table_1.15.0
## [93] ggsignif_0.6.4 cowplot_1.1.2
## [95] grid_4.3.0 utils_4.3.0
## [97] tidyr_1.3.1 methods_4.3.0
## [99] colorspace_2.1-0 SingleCellExperiment_1.24.0
## [101] nlme_3.1-164 GenomeInfoDbData_1.2.11
## [103] cli_3.6.2 fansi_1.0.6
## [105] S4Arrays_1.2.0 viridisLite_0.4.2
## [107] dplyr_1.1.4 gtable_0.3.4
## [109] rstatix_0.7.2 sass_0.4.8
## [111] digest_0.6.34 BiocGenerics_0.48.1
## [113] SparseArray_1.2.4 htmlwidgets_1.6.4
## [115] farver_2.1.1 maSigPro_1.74.0
## [117] entropy_1.3.1 htmltools_0.5.7
## [119] lifecycle_1.0.4 httr_1.4.7
## [121] mime_0.12 MASS_7.3-60