Contents

0.1 Introduction

scMaSigPro is a polynomial regression-based approach inspired by the maSigPro Bioconductor package tailored for scRNA-Seq data. It first discretizes single cell expression along inferred pseudotime while preserving order. Afterwards it applies the maSigPro model to pinpoint genes exhibiting significant expression profile differences among branching paths and pseudotime.

0.2 Installation

Currently, scMaSigPro is available on GitHub and can be installed as follows:

0.2.1 Bioconductor and Dependencies

# Install Dependencies
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install(version = "3.14")

BiocManager::install(c("SingleCellExperiment", "maSigPro", "MatrixGenerics", "S4Vectors"))

0.2.2 scMaSigPro latest version

To install scMaSigPro from GitHub, use the following R code:

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# Install scMaSigPro
devtools::install_github("BioBam/scMaSigPro",
  ref = "main",
  build_vignettes = FALSE,
  build_manual = TRUE,
  upgrade = "never",
  force = TRUE,
  quiet = TRUE
)

0.3 Dataset

Here, we will use scmp.ob that already has all the data inside it. The scmp.ob is included in the package and is simulated via splatter. If you are interested in learning how this data is processed, please see the vignette Basic-Workflow.

0.3.1 Load the scMaSigPro package and the dataset

# Set Seed for Reproducibility
set.seed(123)

# Load Package
library(scMaSigPro)

# Load example data
data("scmp.ob", package = "scMaSigPro")

As the object scmp.ob is already processed, we can type the name of the object in the R console to view various attributes, such as the dimensions, number of non-flat profiles etc.

## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean), 
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
## No. of Influential Features: 8

0.4 Slots of the scMaSigPro object

0.4.1 Sparse & Dense Slots

The scmp.ob object has two slots, Sparse and Dense both which are of SingleCellExperiment Class. The Sparse slot contains the raw count data (i.e. non-pseudo-bulked) and the Dense slot contains the counts that are pseudo-bulked.

To access the counts we can simply use the eDense() and eSparse() methods. These two are generic methods and have simillary functionality to the assays() from the SummarizedExperiment Class. These methods can be used to get or set the expression counts in the scMaSigPro object. We will only show first 3 rows and columns of the Sparse and Dense slot.

# Get the Dense counts
eDense(scmp.ob)[c(1:3), c(1:3)]
##       Path1_bin_1 Path1_bin_2 Path1_bin_3
## Gene1          16          24          11
## Gene2        6335        7337        3375
## Gene3         201         245         105
# Get the Sparse counts
eSparse(scmp.ob)[c(1:3), c(1:3)]
##       Cell1 Cell2 Cell3
## Gene1     0     2     1
## Gene2   436   371   432
## Gene3    35    13    17

Just like the colData() method, the scMaSigPro object also has cDense() and cSparse() methods. These are also generics and can get the cell level information from the Sparse and Dense slots. Specifically in the case of cDense(), the bin level information is accessed.

# Get bin level metadata
cDense(scmp.ob)[c(1:3), ]
##             scmp_binned_pseudotime
## Path1_bin_1                      1
## Path1_bin_2                      2
## Path1_bin_3                      3
##                                                                                                            scmp_bin_members
## Path1_bin_1       Cell6|Cell37|Cell47|Cell55|Cell95|Cell102|Cell107|Cell135|Cell136|Cell156|Cell159|Cell190|Cell197|Cell200
## Path1_bin_2 Cell3|Cell7|Cell49|Cell61|Cell77|Cell97|Cell101|Cell119|Cell122|Cell144|Cell157|Cell165|Cell173|Cell175|Cell182
## Path1_bin_3                                                           Cell38|Cell39|Cell108|Cell124|Cell133|Cell146|Cell188
##                scmp_bin  Path scmp_bin_size scmp_l_bound scmp_u_bound
## Path1_bin_1 Path1_bin_1 Path1            14          1.0         13.2
## Path1_bin_2 Path1_bin_2 Path1            15         13.2         25.5
## Path1_bin_3 Path1_bin_3 Path1             7         25.5         37.8
# Get cell level metadata
cSparse(scmp.ob)[c(1:3), c(1:3)]
##        Cell  Batch  Path
## Cell2 Cell2 Batch1 Path1
## Cell3 Cell3 Batch1 Path1
## Cell4 Cell4 Batch1 Path1

0.4.2 Design Slot

The two most important slots of Design are,

  1. predictor_matrix: A matrix of predictors for the polynomial regression model. It can be accessed with generic method predictors().

  2. assignment_matrix: A matrix containing the hard-assignment of each bin to binned-Pseudotime. It can be accessed with generic method pathAssign().

0.4.3 Profile Slot

The Profile slot stores the results from the sc.p.vector(), which performs the global fitting of the polynomial-GLM and test the significance against the intercept only model. Below is the brief description of the data stored in Profile slot.

  1. non_flat: A list of genes with non-flat profiles in pseudotime.

  2. pvalue and adj_p_values: Significance levels.

  3. fdr: False Discovery Rate.

0.4.4 Estimate Slot

The Estimate slot stores the results from the sc.t.fit(), which performs the optimization of the polynomial-GLM and selects most significant terms. Below is the brief description of the data stored in Estimate slot.

  1. significance_matrix: This matrix stores the p-values for the optimized model, terms of the model and the coefficient of determination i.e. \(R^2\) for each of the genes.

  2. coefficient_matrix: This matrix stores the \(\beta\) estimates for all the genes in the significance_matrix. The coefficient reflect changes associated with each of the terms in optimized polynomial-GLM.

  3. path_coefficient_matrix: This matrix stores the \(\beta\) estimates for all the genes in the significance_matrix. The coefficient reflect changes associated with each of the branching paths.

  4. influential: Stores information of the genes which have influential observations as per cook’s distance.

0.4.5 Significant Slot

Significant slot stores the final results of the workflow. Once specified significant genes are selected based on the sc.filter() and sc.cluster.trend. The Significant stores the gene and the cluster information.

0.4.6 Parameters Slot

The last slot is the Parameters. It stores all the parameters used during the analysis. The showParams() returns all the parameters in a dataframe.

showParams(scmp.ob)[c(10:15), ] # only 5
##       parameters            value
## 10   bin_mem_col scmp_bin_members
## 11      anno_col        cell_type
## 12             g              100
## 13       p_value             0.05
## 14        min_na                6
## 15 mt_correction               BH

0.6 Session Info

sessionInfo(package = "scMaSigPro")
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Madrid
## tzcode source: system (glibc)
## 
## attached base packages:
## character(0)
## 
## other attached packages:
## [1] scMaSigPro_0.0.4
## 
## loaded via a namespace (and not attached):
##   [1] bitops_1.0-7                gridExtra_2.3              
##   [3] rlang_1.1.3                 magrittr_2.0.3             
##   [5] RcppAnnoy_0.0.22            matrixStats_1.2.0          
##   [7] e1071_1.7-14                compiler_4.3.0             
##   [9] mgcv_1.9-1                  venn_1.11                  
##  [11] vctrs_0.6.5                 stringr_1.5.1              
##  [13] pkgconfig_2.0.3             crayon_1.5.2               
##  [15] fastmap_1.1.1               backports_1.4.1            
##  [17] XVector_0.42.0              ellipsis_0.3.2             
##  [19] labeling_0.4.3              utf8_1.2.4                 
##  [21] promises_1.2.1              rmarkdown_2.25             
##  [23] grDevices_4.3.0             UpSetR_1.4.0               
##  [25] purrr_1.0.2                 xfun_0.42                  
##  [27] zlibbioc_1.48.0             cachem_1.0.8               
##  [29] graphics_4.3.0              GenomeInfoDb_1.38.6        
##  [31] jsonlite_1.8.8              highr_0.10                 
##  [33] later_1.3.2                 DelayedArray_0.28.0        
##  [35] terra_1.7-71                broom_1.0.5                
##  [37] parallel_4.3.0              R6_2.5.1                   
##  [39] bslib_0.6.1                 stringi_1.8.3              
##  [41] car_3.1-2                   parallelly_1.37.0          
##  [43] GenomicRanges_1.54.1        jquerylib_0.1.4            
##  [45] Rcpp_1.0.12                 bookdown_0.37              
##  [47] assertthat_0.2.1            SummarizedExperiment_1.32.0
##  [49] knitr_1.45                  IRanges_2.36.0             
##  [51] splines_4.3.0               httpuv_1.6.14              
##  [53] Matrix_1.6-5                igraph_1.6.0               
##  [55] tidyselect_1.2.0            rstudioapi_0.15.0          
##  [57] abind_1.4-5                 yaml_2.3.8                 
##  [59] codetools_0.2-19            admisc_0.34                
##  [61] lattice_0.22-5              tibble_3.2.1               
##  [63] plyr_1.8.9                  Biobase_2.62.0             
##  [65] shiny_1.8.0                 withr_3.0.0                
##  [67] evaluate_0.23               base_4.3.0                 
##  [69] proxy_0.4-27                mclust_6.0.1               
##  [71] pillar_1.9.0                BiocManager_1.30.22        
##  [73] ggpubr_0.6.0                carData_3.0-5              
##  [75] MatrixGenerics_1.14.0       stats4_4.3.0               
##  [77] plotly_4.10.4               generics_0.1.3             
##  [79] RCurl_1.98-1.14             S4Vectors_0.40.2           
##  [81] ggplot2_3.5.1               munsell_0.5.0              
##  [83] scales_1.3.0                BiocStyle_2.30.0           
##  [85] stats_4.3.0                 xtable_1.8-4               
##  [87] class_7.3-22                glue_1.7.0                 
##  [89] lazyeval_0.2.2              tools_4.3.0                
##  [91] datasets_4.3.0              data.table_1.15.0          
##  [93] ggsignif_0.6.4              cowplot_1.1.2              
##  [95] grid_4.3.0                  utils_4.3.0                
##  [97] tidyr_1.3.1                 methods_4.3.0              
##  [99] colorspace_2.1-0            SingleCellExperiment_1.24.0
## [101] nlme_3.1-164                GenomeInfoDbData_1.2.11    
## [103] cli_3.6.2                   fansi_1.0.6                
## [105] S4Arrays_1.2.0              viridisLite_0.4.2          
## [107] dplyr_1.1.4                 gtable_0.3.4               
## [109] rstatix_0.7.2               sass_0.4.8                 
## [111] digest_0.6.34               BiocGenerics_0.48.1        
## [113] SparseArray_1.2.4           htmlwidgets_1.6.4          
## [115] farver_2.1.1                maSigPro_1.74.0            
## [117] entropy_1.3.1               htmltools_0.5.7            
## [119] lifecycle_1.0.4             httr_1.4.7                 
## [121] mime_0.12                   MASS_7.3-60