Type something to search...

Publications

Abstract. RNA-seq data analysis relies on many different tools, each tailored to specific applications and coming with unique assumptions and restrictions. Indeed, tools for differential transcript usage, or diagnosing patients with rare diseases through splicing and expression outliers, either lack in performance, discard information, or do not scale to massive data compendia. Here, we show that replacing the normalisation offsets unlocks bulk RNA-seq workflows for scalable differential usage, aberrant splicing and expression analyses. Our method, saseR, is much faster than state-of-the-art methods, dramatically outperforms these to detect aberrant splicing, and provides a single workflow for various short- and long-read RNA-seq applications.

Abstract. Single-cell RNA sequencing allows the quantification of gene expression at the individual cell level, enabling the study of cellular heterogeneity and gene expression dynamics. Dimensionality reduction is a common preprocessing step critical for the visualization, clustering, and phenotypic characterization of samples. This step, often performed using principal component analysis or closely related methods, is challenging because of the size and complexity of the data. In this work, we present a generalized matrix factorization model assuming a general exponential dispersion family distribution and we show that many of the proposed approaches in the single-cell dimensionality reduction literature can be seen as special cases of this model. Furthermore, we propose a scalable adaptive stochastic gradient descent algorithm that allows us to estimate the model efficiently, enabling the analysis of millions of cells. We benchmark the proposed algorithm through extensive numerical experiments against state-of-the-art methods and showcase its use in real-world biological applications. The proposed method systematically outperforms existing methods of both generalized and non-negative matrix factorization, demonstrating faster execution times and parsimonious memory usage, while maintaining, or even enhancing, matrix reconstruction fidelity and accuracy in biological signal extraction. On real data, we show that our method scales seamlessly to millions of cells, enabling dimensionality reduction in large single-cell datasets. Finally, all the methods discussed here are implemented in an efficient open-source R package, sgdGMF, available on CRAN.

Abstract. The olfactory epithelium is one of the few regions of the nervous system that sustains neurogenesis throughout life. Its experimental accessibility makes it especially tractable for studying molecular mechanisms that drive neural regeneration in response to injury. In this study, we used single-cell sequencing to identify the transcriptional cascades and epigenetic processes involved in determining olfactory epithelial stem cell fate during injury-induced regeneration. By combining gene expression and accessible chromatin profiles of individual lineage-traced olfactory stem cells, we identified transcriptional heterogeneity among activated stem cells at a stage when cell fates are being specified. We further identified a subset of resting cells that appears poised for activation, characterized by accessible chromatin around wound response and lineage-specific genes prior to their later expression in response to injury. Together these results provide evidence for a latent activated stem cell state, in which a subset of quiescent olfactory epithelial stem cells are epigenetically primed to support injury-induced regeneration.

Abstract. Single-cell gene expression data are often characterized by large matrices, where the number of cells may be lower than the number of genes of interest. Factorization models have emerged as powerful tools to condense the available information through a sparse decomposition into lower rank matrices. In this work, we adapt and implement a recent Bayesian class of generalized factor models to count data and, specifically, to model the covariance between genes. The developed methodology also allows one to include exogenous information within the prior, such that recognition of covariance structures between genes is favoured. In this work, we use biological pathways as external information to induce sparsity patterns within the loadings matrix. This approach facilitates the interpretation of loadings columns and the corresponding latent factors, which can be regarded as unobserved cell covariates. We demonstrate the effectiveness of our model on single-cell RNA sequencing data obtained from lung adenocarcinoma cell lines, revealing promising insights into the role of pathways in characterizing gene relationships and extracting valuable information about unobserved cell traits.

Abstract. Spatial transcriptomics technologies provide spatially-resolved measurements of gene expression through assays that can either target selected genes or capture transcriptome-wide expression profiles. The complexity and variability of these technologies and their associated data necessitate multi-step workflows integrating diverse computational methods and software packages. We provide a freely accessible, open-source, continuously updated and tested online book containing reproducible code examples, datasets, and discussion about data analysis workflows for spatial omics data using Bioconductor in R, including interoperability with Python.

Abstract. The increasing size of single-cell RNA sequencing (scRNA-seq) datasets poses major computational challenges. This work benchmarks the scalability, efficiency, and accuracy of five widely used analysis frameworks (Seurat, OSCA, scrapper, Scanpy, and rapids\_singlecell), focusing on the impact of algorithmic and infrastructural choices on performance. We performed a systematic comparison of these workflows using representative datasets, including a 1.3 million mouse brain cell dataset for scalability and three smaller datasets (BE1, scMixology, and cord blood CITE-seq) with ground truth labels to assess clustering accuracy. Principal Component Analysis (PCA) was used as a paradigmatic step to evaluate the computational performance of six SVD algorithms (exact, ARPACK, IRLBA, randomized, Jacobi, and incremental PCA) across multiple data representations (dense, sparse, HDF5) and hardware configurations (CPU vs GPU). All methods showed high concordance in PCA results, with negligible loss of accuracy in truncated approaches. GPU-based computation using rapids\_singlecell provided a 15× speed-up over the best CPU methods, with moderate memory usage. On CPU, ARPACK and IRLBA were the most efficient for sparse matrices, while randomized SVD performed best for HDF5-backed data. Among full pipelines, rapids\_singlecell was the fastest, whereas OSCA and scrapper achieved the highest clustering accuracy (ARI up to 0.97) in datasets with known cell identities. Performance differences were largely driven by the choice of highly variable genes (HVGs) and PCA implementation. The study highlights that scalability in scRNA-seq analysis depends critically on both algorithmic and infrastructural factors. GPU acceleration and optimized BLAS/LAPACK configurations markedly enhance performance, while Bioconductor-based pipelines remain robust in accuracy. The provided benchmarks offer practical guidelines for efficient and reliable analysis of large-scale single-cell datasets.

Abstract. Spatial transcriptomics measures the expression of thousands of genes in a tissue sample while preserving its spatial structure. This class of technologies has enabled the investigation of the spatial variation of gene expressions and their impact on specific biological processes. Identifying genes with similar expression profiles is of utmost importance, thus motivating the development of flexible methods leveraging spatial data structure to cluster genes. Here, we propose a modeling framework for clustering observations measured over numerous spatial locations via Gaussian processes. Rather than specifying their covariance kernels as a function of the spatial structure, we use it to inform a generalized Cholesky decomposition of their precision matrices. This approach prevents issues with kernel misspecification and facilitates the estimation of a non-stationarity spatial covariance structure. Applied to spatial transcriptomic data, our model identifies gene clusters with distinctive spatial correlation patterns across tissue areas comprising different cell types, like tumoral and stromal areas.

Abstract. Abstract Traditional gene expression deconvolution methods assess a limited number of cell types, therefore do not capture the full complexity of the tumor microenvironment (TME). Here, we integrate nine deconvolution tools to assess 79 TME cell types in 10,592 tumors across 33 different cancer types, creating the most comprehensive analysis of the TME. In total, we found 41 patterns of immune infiltration and stroma profiles, identifying heterogeneous yet unique TME portraits for each cancer and several new findings. Our findings indicate that leukocytes play a major role in distinguishing various tumor types, and that a shared immune-rich TME cluster predicts better survival in bladder cancer for luminal and basal squamous subtypes, as well as in melanoma for RAS-hotspot subtypes. Our detailed deconvolution and mutational correlation analyses uncover 35 therapeutic target and candidate response biomarkers hypotheses (including CASP8 and RAS pathway genes).

Abstract. The unprecedented speed and sensitivity of mass spectrometry (MS) unlocked large-scale applications of proteomics and even enabled proteome profiling of single cells. However, this fast-evolving field is hindered by a lack of scalable dimensionality reduction tools that can compensate for substantial batch effects and missingness across MS runs. Therefore, we present omicsGMF, a fast, scalable, and interpretable matrix factorization method, tailored for bulk and single-cell proteomics data. Unlike current workflows that sequentially apply imputation, batch correction, and principal component analysis, omicsGMF integrates these steps into a unified framework, dramatically enhancing data processing and dimensionality reduction. Additionally, omicsGMF provides robust imputation of missing values, outperforming bespoke state-of-the-art imputation tools. We further demonstrate how this integrated approach increases statistical power to detect differentially abundant proteins in the downstream data analysis. Hence, omicsGMF is a highly scalable approach to dimensionality reduction in proteomics, that dramatically improves many important steps in proteomics data analysis.

Abstract. Mass spectrometry imaging techniques measure molecular abundance in a tissue sample at a cellular resolution, all while preserving the spatial structure of the tissue. This kind of technology offers a detailed understanding of the role of several molecular factors in biological systems. For this reason, the development of fast and efficient computational methods that can extract relevant signals from massive experiments has become necessary. A key goal in mass spectrometry data analysis is the identification of molecules with similar functions in the analyzed biological system. This result can be achieved by studying the spatial distribution of the molecules' abundance patterns. To do so, one can perform coclustering, that is, dividing the molecules into groups according to their expression patterns over the tissue and segmenting the tissue according to the molecules' abundance levels. We present TRIFASE, a semi-nonnegative matrix trifactorization technique that performs coclustering while accounting for the spatial correlation of the data. We propose an estimation algorithm that solves the proposed matrix trifactorization problem. Moreover, to improve scalability, we also propose two heuristic approximations of the most expensive steps, which help the algorithm converge while significantly streamlining the computational cost. We validated our method on a series of simulation experiments, comparing the different estimating strategies discussed in the article. Last, we analyzed a mouse brain tissue sample processed with MALDI-MSI technology, showing how TRIFASE extracts specific expression patterns of molecule abundance in localized tissue areas and discovers blocks of proteins whose activation is directly linked to specific biological mechanisms.

Abstract. In single-cell transcriptomics, differential gene expression (DE) analyses typically focus on testing differences in the average expression of genes between cell types or conditions of interest. Single-cell transcriptomics, however, also has the promise to prioritise genes for which the expression differ in other aspects of the distribution. Here we develop a workflow for assessing differential detection (DD), which tests for differences in the average fraction of samples or cells in which a gene is detected. After benchmarking eight different DD data analysis strategies, we provide a unified workflow for jointly assessing DE and DD. Using simulations and two case studies, we show that DE and DD analysis provide complementary information, both in terms of the individual genes they report and in the functional interpretation of those genes.

Abstract. In this paper, we tackle structure learning of Directed Acyclic Graphs (DAGs), with the idea of exploiting available prior knowledge of the domain at hand to guide the search of the best structure. In particular, we assume to know the topological ordering of variables in addition to the given data. We study a new algorithm for learning the structure of DAGs, proving its theoretical consistence in the limit of infinite observations. Furthermore, we experimentally compare the proposed algorithm to a number of popular competitors, in order to study its behavior in finite samples.

Abstract. Conformal inference is a method that provides prediction sets for machine learning models, operating independently of the underlying distributional assumptions and relying solely on the exchangeability of training and test data. Despite its wide applicability and popularity, its application in graph-structured problems remains underexplored. This paper addresses this gap by developing an approach that leverages the rich information encoded in the graph structure of predicted classes to enhance the interpretability of conformal sets. Using a motivating example from genomics, specifically imaging-based spatial transcriptomics data and single-cell RNA sequencing data, we demonstrate how incorporating graph-structured constraints can improve the interpretation of cell type predictions. This approach aims to generate more coherent conformal sets that align with the inherent relationships among classes, facilitating clearer and more intuitive interpretations of model predictions. Additionally, we provide a technique to address non-exchangeability, particularly when the distribution of the response variable changes between training and test datasets. We implemented our method in the open-source R package scConform, available at https://github.com/ccb-hms/scConform.

Abstract. Understanding cancer mechanisms, defining subtypes, predicting prognosis and assessing therapy efficacy are crucial aspects of cancer research. Gene-expression signatures derived from bulk gene expression data have played a significant role in these endeavors over the past decade. However, recent advancements in high-resolution transcriptomic technologies, such as single-cell RNA sequencing and spatial transcriptomics, have revealed the complex cellular heterogeneity within tumors, necessitating the development of computational tools to characterize tumor mass heterogeneity accurately. Thus we implemented signifinder, a novel R Bioconductor package designed to streamline the collection and use of cancer transcriptional signatures across bulk, single-cell, and spatial transcriptomics data. Leveraging publicly available signatures curated by signifinder, users can assess a wide range of tumor characteristics, including hallmark processes, therapy responses, and tumor microenvironment peculiarities. Through three case studies, we demonstrate the utility of transcriptional signatures in bulk, single-cell, and spatial transcriptomic data analyses, providing insights into cell-resolution transcriptional signatures in oncology. Signifinder represents a significant advancement in cancer transcriptomic data analysis, offering a comprehensive framework for interpreting high-resolution data and addressing tumor complexity.

Abstract. Sleep deprivation (SD) has negative effects on brain and body function. Sleep problems are prevalent in a variety of disorders, including neurodevelopmental and psychiatric conditions. Thus, understanding the molecular consequences of SD is of fundamental importance in biology. In this study, we present the first simultaneous bulk and single-nuclear RNA sequencing characterization of the effects of SD in the male mouse frontal cortex. We show that SD predominantly affects glutamatergic neurons, specifically in layers 4 and 5, and produces isoform switching of over 1500 genes, particularly those involved in splicing and RNA binding. At both the global and cell-type specific level, SD has a large repressive effect on transcription, downregulating thousands of genes and transcripts. As a resource we provide extensive characterizations of cell-types, genes, transcripts, and pathways affected by SD. We also provide publicly available tutorials aimed at allowing readers adapt analyses performed in this study to their own datasets.

Abstract. BACKGROUND: Single-cell transcriptome sequencing (scRNA-Seq) has allowed new types of investigations at unprecedented levels of resolution. Among the primary goals of scRNA-Seq is the classification of cells into distinct types. Many approaches build on existing clustering literature to develop tools specific to single-cell. However, almost all of these methods rely on heuristics or user-supplied parameters to control the number of clusters. This affects both the resolution of the clusters within the original dataset as well as their replicability across datasets. While many recommendations exist, in general, there is little assurance that any given set of parameters will represent an optimal choice in the trade-off between cluster resolution and replicability. For instance, another set of parameters may result in more clusters that are also more replicable. RESULTS: Here, we propose Dune, a new method for optimizing the trade-off between the resolution of the clusters and their replicability. Our method takes as input a set of clustering results-or partitions-on a single dataset and iteratively merges clusters within each partitions in order to maximize their concordance between partitions. As demonstrated on multiple datasets from different platforms, Dune outperforms existing techniques, that rely on hierarchical merging for reducing the number of clusters, in terms of replicability of the resultant merged clusters as well as concordance with ground truth. Dune is available as an R package on Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/Dune.html . CONCLUSIONS: Cluster refinement by Dune helps improve the robustness of any clustering analysis and reduces the reliance on tuning parameters. This method provides an objective approach for borrowing information across multiple clusterings to generate replicable clusters most likely to represent common biological features across multiple datasets.

Abstract. SUMMARY: Recently, an increasing number of methodological approaches have been proposed to tackle the complexity of metagenomics and microbiome data. In this scenario, reproducibility and replicability have become two critical issues, and the development of computational frameworks for the comparative evaluations of such methods is of utmost importance. Here, we present benchdamic, a Bioconductor package to benchmark methods for the identification of differentially abundant taxa. AVAILABILITY AND IMPLEMENTATION: benchdamic is available as an open-source R package through the Bioconductor project at https://bioconductor.org/packages/benchdamic/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Abstract. This work defines a new correction for the likelihood ratio test for a two-sample problem within the multivariate normal context. This correction applies to decomposable graphical models, where testing equality of distributions can be decomposed into lower dimensional problems.

Abstract. BACKGROUND: The majority of high-throughput single-cell molecular profiling methods quantify RNA expression; however, recent multimodal profiling methods add simultaneous measurement of genomic, proteomic, epigenetic, and/or spatial information on the same cells. The development of new statistical and computational methods in Bioconductor for such data will be facilitated by easy availability of landmark datasets using standard data classes. RESULTS: We collected, processed, and packaged publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T. We integrate data modalities via the MultiAssayExperiment Bioconductor class, document and re-distribute datasets as the SingleCellMultiModal package in Bioconductor's Cloud-based ExperimentHub. The result is single-command actualization of landmark datasets from seven single-cell multimodal data generation technologies, without need for further data processing or wrangling in order to analyze and develop methods within Bioconductor's ecosystem of hundreds of packages for single-cell and multimodal data. CONCLUSIONS: We provide two examples of integrative analyses that are greatly simplified by SingleCellMultiModal. The package will facilitate development of bioinformatic and statistical methods in Bioconductor to meet the challenges of integrating molecular layers and analyzing phenotypic outputs including cell differentiation, activity, and disease.

Abstract. Spatial transcriptomics is a groundbreaking technology that allows the measurement of the activity of thousands of genes in a tissue sample and maps where the activity occurs. This technology has enabled the study of the spatial variation of the genes across the tissue. Comprehending gene functions and interactions in different areas of the tissue is of great scientific interest, as it might lead to a deeper understanding of several key biological mechanisms, such as cell-cell communication or tumor-microenvironment interaction. To do so, one can group cells of the same type and genes that exhibit similar expression patterns. However, adequate statistical tools that exploit the previously unavailable spatial information to more coherently group cells and genes are still lacking. In this work we introduce SpaRTaCo, a new statistical model that clusters the spatial expression profiles of the genes according to a partition of the tissue. This is accomplished by performing a co-clustering, that is, inferring the latent block structure of the data and inducing two types of clustering: of the genes, using their expression across the tissue, and of the image areas, using the gene expression in the spots where the RNA is collected. Our proposed methodology is validated with a series of simulation experiments, and its usefulness in responding to specific biological questions is illustrated with an application to a human brain tissue sample processed with the 10X-Visium protocol.

Abstract. The problem of estimating the structure of a graph from observed data is of growing interest in the context of high-throughput genomic data and single-cell RNA sequencing in particular. These, however, are challenging applications, since the data consist of high-dimensional counts with high variance and overabundance of zeros. Here we present a general framework for learning the structure of a graph from single-cell RNA-seq data, based on the zero-inflated negative binomial distribution. We demonstrate with simulations that our approach is able to retrieve the structure of a graph in a variety of settings, and we show the utility of the approach on real data.

Abstract. Statistical evaluation of diagnostic tests, and, more generally, of biomarkers, is a constantly developing field, in which complexity of the assessment increases with the complexity of the design under which data are collected. One particularly prevalent type of data is clustered data, where individual units are naturally nested into clusters. In these cases, Bias can arise from omission, in the evaluation process, of cluster-level effects and/or individual covariates. Focusing on the three-class case and for continuous-valued diagnostic tests, we investigate how to exploit the clustered structure of data within a linear-mixed model approach, both when the assumption of normality holds and when it does not. We provide a method for the estimation of covariate-specific receiver operating characteristic surfaces and discuss methods for the choice of optimal thresholds, proposing three possible estimators. A proof of consistency and asymptotic normality of the proposed threshold estimators is given. All considered methods are evaluated by extensive simulation experiments. As an application, we study the use of the Lysosomal Associated Membrane Protein Family Member 5 gene expression as a biomarker to distinguish among three types of glutamatergic neurons.

Abstract. Abstract Summary We present NewWave , a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA sequencing data. To achieve scalability, NewWave uses mini-batch optimization and can work with out-of-memory data, enabling users to analyze datasets with millions of cells. Availability and implementation NewWave is implemented as an open-source R package available through the Bioconductor project at https://bioconductor.org/packages/NewWave/ Supplementary information Supplementary data are available at Bioinformatics online.

Abstract. MOTIVATION: Single-cell RNA sequencing (scRNA-seq) enables transcriptome-wide gene expression measurements at single-cell resolution providing a comprehensive view of the compositions and dynamics of tissue and organism development. The evolution of scRNA-seq protocols has led to a dramatic increase of cells throughput, exacerbating many of the computational and statistical issues that previously arose for bulk sequencing. In particular, with scRNA-seq data all the analyses steps, including normalization, have become computationally intensive, both in terms of memory usage and computational time. In this perspective, new accurate methods able to scale efficiently are desirable. RESULTS: Here, we propose PsiNorm, a between-sample normalization method based on the power-law Pareto distribution parameter estimate. Here, we show that the Pareto distribution well resembles scRNA-seq data, especially those coming from platforms that use unique molecular identifiers. Motivated by this result, we implement PsiNorm, a simple and highly scalable normalization method. We benchmark PsiNorm against seven other methods in terms of cluster identification, concordance and computational resources required. We demonstrate that PsiNorm is among the top performing methods showing a good trade-off between accuracy and scalability. Moreover, PsiNorm does not need a reference, a characteristic that makes it useful in supervised classification settings, in which new out-of-sample data need to be normalized. AVAILABILITY AND IMPLEMENTATION: PsiNorm is implemented in the scone Bioconductor package and available at https://bioconductor.org/packages/scone/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Abstract. SUMMARY: SpatialExperiment is a new data infrastructure for storing and accessing spatially-resolved transcriptomics data, implemented within the R/Bioconductor framework, which provides advantages of modularity, interoperability, standardized operations and comprehensive documentation. Here, we demonstrate the structure and user interface with examples from the 10x Genomics Visium and seqFISH platforms, and provide access to example datasets and visualization tools in the STexampleData, TENxVisiumData and ggspavis packages. AVAILABILITY AND IMPLEMENTATION: The SpatialExperiment, STexampleData, TENxVisiumData and ggspavis packages are available from Bioconductor. The package versions described in this manuscript are available in Bioconductor version 3.15 onwards. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Abstract. The assay for transposase-accessible chromatin using sequencing (ATAC-seq) allows the study of epigenetic regulation of gene expression by assessing chromatin configuration for an entire genome. Despite its popularity, there have been limited studies investigating the analytical challenges related to ATAC-seq data, with most studies leveraging tools developed for bulk transcriptome sequencing. Here, we show that GC-content effects are omnipresent in ATAC-seq datasets. Since the GC-content effects are sample specific, they can bias downstream analyses such as clustering and differential accessibility analysis. We introduce a normalization method based on smooth-quantile normalization within GC-content bins and evaluate it together with 11 different normalization procedures on 8 public ATAC-seq datasets. Accounting for GC-content effects in the normalization is crucial for common downstream ATAC-seq data analyses, improving accuracy and interpretability. Through case studies, we show that exploratory data analysis is essential to guide the choice of an appropriate normalization method for a given dataset.

Abstract. Abstract Here we report the generation of a multimodal cell census and atlas of the mammalian primary motor cortex as the initial product of the BRAIN Initiative Cell Census Network (BICCN). This was achieved by coordinated large-scale analyses of single-cell transcriptomes, chromatin accessibility, DNA methylomes, spatially resolved single-cell transcriptomes, morphological and electrophysiological properties and cellular resolution input–output mapping, integrated through cross-modal computational analysis. Our results advance the collective knowledge and understanding of brain cell-type organization 1–5 . First, our study reveals a unified molecular genetic landscape of cortical cell types that integrates their transcriptome, open chromatin and DNA methylation maps. Second, cross-species analysis achieves a consensus taxonomy of transcriptomic types and their hierarchical organization that is conserved from mouse to marmoset and human. Third, in situ single-cell transcriptomics provides a spatially resolved cell-type atlas of the motor cortex. Fourth, cross-modal analysis provides compelling evidence for the transcriptomic, epigenomic and gene regulatory basis of neuronal phenotypes such as their physiological and anatomical properties, demonstrating the biological validity and genomic underpinning of neuron types. We further present an extensive genetic toolset for targeting glutamatergic neuron types towards linking their molecular and developmental identity to their circuit function. Together, our results establish a unifying and mechanistic framework of neuronal cell-type organization that integrates multi-layered molecular genetic and spatial information with multi-faceted phenotypic properties.

Abstract. Abstract Single-cell transcriptomics can provide quantitative molecular signatures for large, unbiased samples of the diverse cell types in the brain 1–3 . With the proliferation of multi-omics datasets, a major challenge is to validate and integrate results into a biological understanding of cell-type organization. Here we generated transcriptomes and epigenomes from more than 500,000 individual cells in the mouse primary motor cortex, a structure that has an evolutionarily conserved role in locomotion. We developed computational and statistical methods to integrate multimodal data and quantitatively validate cell-type reproducibility. The resulting reference atlas—containing over 56 neuronal cell types that are highly replicable across analysis methods, sequencing technologies and modalities—is a comprehensive molecular and genomic account of the diverse neuronal and non-neuronal cell types in the mouse primary motor cortex. The atlas includes a population of excitatory neurons that resemble pyramidal cells in layer 4 in other cortical regions 4 . We further discovered thousands of concordant marker genes and gene regulatory elements for these cell types. Our results highlight the complex molecular regulation of cell types in the brain and will directly enable the design of reagents to target specific cell types in the mouse primary motor cortex for functional analysis.

Abstract. Splicing varies across brain regions, but the single-cell resolution of regional variation is unclear. We present a single-cell investigation of differential isoform expression (DIE) between brain regions using single-cell long-read sequencing in mouse hippocampus and prefrontal cortex in 45 cell types at postnatal day 7 ( www.isoformAtlas.com ). Isoform tests for DIE show better performance than exon tests. We detect hundreds of DIE events traceable to cell types, often corresponding to functionally distinct protein isoforms. Mostly, one cell type is responsible for brain-region specific DIE. However, for fewer genes, multiple cell types influence DIE. Thus, regional identity can, although rarely, override cell-type specificity. Cell types indigenous to one anatomic structure display distinctive DIE, e.g. the choroid plexus epithelium manifests distinct transcription-start-site usage. Spatial transcriptomics and long-read sequencing yield a spatially resolved splicing map. Our methods quantify isoform expression with cell-type and spatial resolution and it contributes to further our understanding of how the brain integrates molecular and cellular complexity.

Abstract. MOTIVATION: Data transformations are an important step in the analysis of RNA-seq data. Nonetheless, the impact of transformation on the outcome of unsupervised clustering procedures is still unclear. RESULTS: Here, we present an Asymmetric Winsorization per-Sample Transformation (AWST), which is robust to data perturbations and removes the need for selecting the most informative genes prior to sample clustering. Our procedure leads to robust and biologically meaningful clusters both in bulk and in single-cell applications. AVAILABILITY AND IMPLEMENTATION: The AWST method is available at https://github.com/drisso/awst. The code to reproduce the analyses is available at https://github.com/drisso/awst\_analysis. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Abstract. Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/mbkmeans.

Abstract. EDITORIAL article Front. Oncol., 15 September 2020Sec. Cancer Genetics Volume 10 - 2020 | https://doi.org/10.3389/fonc.2020.01768

Abstract. Altered olfactory function is a common symptom of COVID-19, but its etiology is unknown. A key question is whether SARS-CoV-2 (CoV-2) - the causal agent in COVID-19 - affects olfaction directly, by infecting olfactory sensory neurons or their targets in the olfactory bulb, or indirectly, through perturbation of supporting cells. Here we identify cell types in the olfactory epithelium and olfactory bulb that express SARS-CoV-2 cell entry molecules. Bulk sequencing demonstrated that mouse, non-human primate and human olfactory mucosa expresses two key genes involved in CoV-2 entry, ACE2 and TMPRSS2. However, single cell sequencing revealed that ACE2 is expressed in support cells, stem cells, and perivascular cells, rather than in neurons. Immunostaining confirmed these results and revealed pervasive expression of ACE2 protein in dorsally-located olfactory epithelial sustentacular cells and olfactory bulb pericytes in the mouse. These findings suggest that CoV-2 infection of non-neuronal cell types leads to anosmia and related disturbances in odor perception in COVID-19 patients.

Abstract. BACKGROUND: The correct identification of differentially abundant microbial taxa between experimental conditions is a methodological and computational challenge. Recent work has produced methods to deal with the high sparsity and compositionality characteristic of microbiome data, but independent benchmarks comparing these to alternatives developed for RNA-seq data analysis are lacking. RESULTS: We compare methods developed for single-cell and bulk RNA-seq, and specifically for microbiome data, in terms of suitability of distributional assumptions, ability to control false discoveries, concordance, power, and correct identification of differentially abundant genera. We benchmark these methods using 100 manually curated datasets from 16S and whole metagenome shotgun sequencing. CONCLUSIONS: The multivariate and compositional methods developed specifically for microbiome analysis did not outperform univariate methods developed for differential expression analysis of RNA-seq data. We recommend a careful exploratory data analysis prior to application of any inferential model and we present a framework to help scientists make an informed choice of analysis methods in a dataset-specific manner.

Abstract. The neocortex is functionally organized into layers. Layer four receives the densest bottom up sensory inputs, while layers 2/3 and 5 receive top down inputs that may convey predictive information. A subset of cortical somatostatin (SST) neurons, the Martinotti cells, gate top down input by inhibiting the apical dendrites of pyramidal cells in layers 2/3 and 5, but it is unknown whether an analogous inhibitory mechanism controls activity in layer 4. Using high precision circuit mapping, in vivo optogenetic perturbations, and single cell transcriptional profiling, we reveal complementary circuits in the mouse barrel cortex involving genetically distinct SST subtypes that specifically and reciprocally interconnect with excitatory cells in different layers: Martinotti cells connect with layers 2/3 and 5, whereas non-Martinotti cells connect with layer 4. By enforcing layer-specific inhibition, these parallel SST subnetworks could independently regulate the balance between bottom up and top down input.

Abstract. Autism Spectrum Disorder (ASD) is the most prevalent neurodevelopmental disorder in the United States and often co-presents with sleep problems. Sleep problems in ASD predict the severity of ASD core diagnostic symptoms and have a considerable impact on the quality of life of caregivers. Little is known, however, about the underlying molecular mechanisms of sleep problems in ASD. We investigated the role of Shank3, a high confidence ASD gene candidate, in sleep architecture and regulation. We show that mice lacking exon 21 of Shank3 have problems falling asleep even when sleepy. Using RNA-seq we show that sleep deprivation increases the differences in prefrontal cortex gene expression between mutants and wild types, downregulating circadian transcription factors Per3, Bhlhe41, Hlf, Tef, and Nr1d1. Shank3 mutants also have trouble regulating wheel-running activity in constant darkness. Overall, our study shows that Shank3 is an important modulator of sleep and clock gene expression.

Abstract. Nowadays, the analysis of RNA-seq and BS-Seq can be considered well established, whereas the analysis of broad peaks data as Sono-Seq/ATAC-Seq and histone modification (HM) ChIP-Seq is still challenging. To fill the gap in existing methods, we present DEScan2 a novel bioconductor package [2] for the analysis of broad peaks data. The method consists of three main steps: 1) a peak caller, 2) peak filtering and alignment across replicates and 3) a method to efficiently compute a count matrix of the filtered peaks. Using an already published ATAC-Seq dataset for chromatin accessibility our method shows interesting results, also by comparing it with other well-known tools for this kind of data analysis.

Abstract. Dropout events in single-cell RNA sequencing (scRNA-seq) cause many transcripts to go undetected and induce an excess of zero read counts, leading to power issues in differential expression (DE) analysis. This has triggered the development of bespoke scRNA-seq DE methods to cope with zero inflation. Recent evaluations, however, have shown that dedicated scRNA-seq tools provide no advantage compared to traditional bulk RNA-seq tools. We introduce a weighting strategy, based on a zero-inflated negative binomial model, that identifies excess zero counts and generates gene- and cell-specific weights to unlock bulk RNA-seq DE pipelines for zero-inflated data, boosting performance for scRNA-seq.

Abstract. Abstract Midbrain dopamine neurons project to numerous targets throughout the brain to modulate various behaviors and brain states. Within this small population of neurons exists significant heterogeneity based on physiology, circuitry, and disease susceptibility. Recent studies have shown that dopamine neurons can be subdivided based on gene expression; however, the extent to which genetic markers represent functionally relevant dopaminergic subpopulations has not been fully explored. Here we performed single-cell RNA-sequencing of mouse dopamine neurons and validated studies showing that Neurod6 and Grp are selective markers for dopaminergic subpopulations. Using a combination of multiplex fluorescent in situ hybridization, retrograde labeling, and electrophysiology in mice of both sexes, we defined the anatomy, projection targets, physiological properties, and disease vulnerability of dopamine neurons based on Grp and/or Neurod6 expression. We found that the combinatorial expression of Grp and Neurod6 defines dopaminergic subpopulations with unique features. Grp + /Neurod6 + dopamine neurons reside in the ventromedial VTA, send projections to the medial shell of the nucleus accumbens, and have noncanonical physiological properties. Grp + /Neurod6- dopamine neurons are found in the VTA as well as in the ventromedial portion of the SNc, where they project selectively to the dorsomedial striatum. Grp-/Neurod6 + dopamine neurons represent a smaller VTA subpopulation, which is preferentially spared in a 6-OHDA model of Parkinson’s disease. Together, our work provides detailed characterization of Neurod6 and Grp expression in the midbrain and generates new insights into how these markers define functionally relevant dopaminergic subpopulations.

Abstract. Clustering of genes and/or samples is a common task in gene expression analysis. The goals in clustering can vary, but an important scenario is that of finding biologically meaningful subtypes within the samples. This is an application that is particularly appropriate when there are large numbers of samples, as in many human disease studies. With the increasing popularity of single-cell transcriptome sequencing (RNA-Seq), many more controlled experiments on model organisms are similarly creating large gene expression datasets with the goal of detecting previously unknown heterogeneity within cells. It is common in the detection of novel subtypes to run many clustering algorithms, as well as rely on subsampling and ensemble methods to improve robustness. We introduce a Bioconductor R package, clusterExperiment, that implements a general and flexible strategy we entitle Resampling-based Sequential Ensemble Clustering (RSEC). RSEC enables the user to easily create multiple, competing clusterings of the data based on different techniques and associated tuning parameters, including easy integration of resampling and sequential clustering, and then provides methods for consolidating the multiple clusterings into a final consensus clustering. The package is modular and allows the user to separately apply the individual components of the RSEC procedure, i.e., apply multiple clustering algorithms, create a consensus clustering or choose tuning parameters, and merge clusters. Additionally, clusterExperiment provides a variety of visualization tools for the clustering process, as well as methods for the identification of possible cluster signatures or biomarkers. The R package clusterExperiment is publicly available through the Bioconductor Project, with a detailed manual (vignette) as well as well documented help pages for each function.

Abstract. Single-cell RNA-sequencing (scRNA-seq) is a powerful high-throughput technique that enables researchers to measure genome-wide transcription levels at the resolution of single cells. Because of the low amount of RNA present in a single cell, some genes may fail to be detected even though they are expressed; these genes are usually referred to as dropouts. Here, we present a general and flexible zero-inflated negative binomial model (ZINB-WaVE), which leads to low-dimensional representations of the data that account for zero inflation (dropouts), over-dispersion, and the count nature of the data. We demonstrate, with simulated and real data, that the model and its associated estimation procedure are able to give a more stable and accurate low-dimensional representation of the data than principal component analysis (PCA) and zero-inflated factor analysis (ZIFA), without the need for a preliminary normalization step.

Abstract. promoter 6) revealed that the SNP rs6010065 was associated with ASD. Our data support the idea that learning recapitulates development at the epigenetic level and demonstrate that behaviorally induced epigenetic changes in mice can highlight regulatory regions relevant to brain disorders in patients.

Abstract. BACKGROUND: Single-cell transcriptomics allows researchers to investigate complex communities of heterogeneous cells. It can be applied to stem cells and their descendants in order to chart the progression from multipotent progenitors to fully differentiated cells. While a variety of statistical and computational methods have been proposed for inferring cell lineages, the problem of accurately characterizing multiple branching lineages remains difficult to solve. RESULTS: We introduce Slingshot, a novel method for inferring cell lineages and pseudotimes from single-cell gene expression data. In previously published datasets, Slingshot correctly identifies the biological signal for one to three branching trajectories. Additionally, our simulation study shows that Slingshot infers more accurate pseudotimes than other leading methods. CONCLUSIONS: Slingshot is a uniquely robust and flexible tool which combines the highly stable techniques necessary for noisy single-cell data with the ability to identify multiple trajectories. Accurate lineage inference is a critical step in the identification of dynamic temporal gene expression.

Abstract. Novel single-cell transcriptome sequencing assays allow researchers to measure gene expression levels at the resolution of single cells and offer the unprecendented opportunity to investigate at the molecular level fundamental biological questions, such as stem cell differentiation or the discovery and characterization of rare cell types. However, such assays raise challenging statistical and computational questions and require the development of novel methodology and software. Using stem cell differentiation in the mouse olfactory epithelium as a case study, this integrated workflow provides a step-by-step tutorial to the methodology and associated software for the following four main tasks: (1) dimensionality reduction accounting for zero inflation and over dispersion and adjusting for gene and cell-level covariates; (2) cell clustering using resampling-based sequential ensemble clustering; (3) inference of cell lineages and pseudotimes; and (4) differential expression analysis along lineages.

Abstract. Limiting potential for totipotency Biological roles for microRNAs are not limited to RNA silencing and posttranscriptional regulation; they have now been shown to also regulate cell pluripotency. Choi et al. eliminated miR-34a from mouse embryonic stem cells and found that the cells exhibited a bidirectional cell fate potential, generating both embryonic and extraembryonic lineages (see the Perspective by Hasuwa and Siomi). During miR-34a deficiency, an endogenous retrovirus was induced, at least in part through Gata2-dependent transcriptional activation. Thus, the interplay of protein-coding genes, noncoding RNAs, and endogenous retroviruses can change cell fate plasticity and the developmental potential of pluripotent stem cells. Science , this issue p. eaag1927 ; see also p. 581

Abstract. BACKGROUND: Why we sleep is still one of the most perplexing mysteries in biology. Strong evidence indicates that sleep is necessary for normal brain function and that sleep need is a tightly regulated process. Surprisingly, molecular mechanisms that determine sleep need are incompletely described. Moreover, very little is known about transcriptional changes that specifically accompany the accumulation and discharge of sleep need. Several studies have characterized differential gene expression changes following sleep deprivation. Much less is known, however, about changes in gene expression during the compensatory response to sleep deprivation (i.e. recovery sleep). RESULTS: In this study we present a comprehensive analysis of the effects of sleep deprivation and subsequent recovery sleep on gene expression in the mouse cortex. We used a non-traditional analytical method for normalization of genome-wide gene expression data, Removal of Unwanted Variation (RUV). RUV improves detection of differential gene expression following sleep deprivation. We also show that RUV normalization is crucial to the discovery of differentially expressed genes associated with recovery sleep. Our analysis indicates that the majority of transcripts upregulated by sleep deprivation require 6 h of recovery sleep to return to baseline levels, while the majority of downregulated transcripts return to baseline levels within 1-3 h. We also find that transcripts that change rapidly during recovery (i.e. within 3 h) do so on average with a time constant that is similar to the time constant for the discharge of sleep need. CONCLUSIONS: We demonstrate that proper data normalization is essential to identify changes in gene expression that are specifically linked to sleep deprivation and recovery sleep. Our results provide the first evidence that recovery sleep is comprised of two waves of transcriptional regulation that occur at different times and affect functionally distinct classes of genes.

Abstract. The process of memory consolidation requires transcription and translation to form long-term memories. Significant effort has been dedicated to understanding changes in hippocampal gene expression after contextual fear conditioning. However, alternative splicing by differential transcript regulation during this time period has received less attention. Here, we use RNA-seq to determine exon-level changes in expression after contextual fear conditioning and retrieval. Our work reveals that a short variant of Homer1, Ania-3, is regulated by contextual fear conditioning. The ribosome biogenesis regulator Las1l, small nucleolar RNA Snord14e, and the RNA-binding protein Rbm3 also change specific transcript usage after fear conditioning. The changes in Ania-3 and Las1l are specific to either the new context or the context-shock association, while the changes in Rbm3 occur after context or shock only. Our analysis revealed novel transcript regulation of previously undetected changes after learning, revealing the importance of high throughput sequencing approaches in the study of gene expression changes after learning.

Abstract. The sequencing of the full transcriptome (RNA-seq) has become the preferred choice for the measurement of genome-wide gene expression. Despite its widespread use, challenges remain in RNA-seq data analysis. One often-overlooked aspect is normalization. Despite the fact that a variety of factors or 'batch effects' can contribute unwanted variation to the data, commonly used RNA-seq normalization methods only correct for sequencing depth. The study of gene expression is particularly problematic when it is influenced simultaneously by a variety of biological factors in addition to the one of interest. Using examples from experimental neuroscience, we show that batch effects can dominate the signal of interest; and that the choice of normalization method affects the power and reproducibility of the results. While commonly used global normalization methods are not able to adequately normalize the data, more recently developed RNA-seq normalization can. We focus on one particular method, RUVSeq and show that it is able to increase power and biological insight of the results. Finally, we provide a tutorial outlining the implementation of RUVSeq normalization that is applicable to a broad range of studies as well as meta-analysis of publicly available data.

Abstract. BACKGROUND: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. RESULTS: We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. CONCLUSIONS: Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes.

Abstract. BACKGROUND: In the last decades, microarray technology has spread, leading to a dramatic increase of publicly available datasets. The first statistical tools developed were focused on the identification of significant differentially expressed genes. Later, researchers moved toward the systematic integration of gene expression profiles with additional biological information, such as chromosomal location, ontological annotations or sequence features. The analysis of gene expression linked to physical location of genes on chromosomes allows the identification of transcriptionally imbalanced regions, while, Gene Set Analysis focuses on the detection of coordinated changes in transcriptional levels among sets of biologically related genes. In this field, meta-analysis offers the possibility to compare different studies, addressing the same biological question to fully exploit public gene expression datasets. RESULTS: We describe STEPath, a method that starts from gene expression profiles and integrates the analysis of imbalanced region as an a priori step before performing gene set analysis. The application of STEPath in individual studies produced gene set scores weighted by chromosomal activation. As a final step, we propose a way to compare these scores across different studies (meta-analysis) on related biological issues. One complication with meta-analysis is batch effects, which occur because molecular measurements are affected by laboratory conditions, reagent lots and personnel differences. Major problems occur when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. We evaluated the power of combining chromosome mapping and gene set enrichment analysis, performing the analysis on a dataset of leukaemia (example of individual study) and on a dataset of skeletal muscle diseases (meta-analysis approach). In leukaemia, we identified the Hox gene set, a gene set closely related to the pathology that other algorithms of gene set analysis do not identify, while the meta-analysis approach on muscular disease discriminates between related pathologies and correlates similar ones from different studies. CONCLUSIONS: STEPath is a new method that integrates gene expression profiles, genomic co-expressed regions and the information about the biological function of genes. The usage of the STEPath-computed gene set scores overcomes batch effects in the meta-analysis approaches allowing the direct comparison of different pathologies and different studies on a gene set activation level.

Abstract. BACKGROUND: Cluster analysis is a crucial tool in several biological and medical studies dealing with microarray data. Such studies pose challenging statistical problems due to dimensionality issues, since the number of variables can be much higher than the number of observations. RESULTS: Here, we present a general framework to deal with the clustering of microarray data, based on a three-step procedure: (i) gene filtering; (ii) dimensionality reduction; (iii) clustering of observations in the reduced space. Via a nonparametric model-based clustering approach we obtain promising results both in simulated and real data. CONCLUSIONS: The proposed algorithm is a simple and effective tool for the clustering of microarray data, in an unsupervised setting.

Abstract. MOTIVATION: Microarray normalization is a fundamental step in removing systematic bias and noise variability caused by technical and experimental artefacts. Several approaches, suitable for large-scale genome arrays, have been proposed and shown to be effective in the reduction of systematic errors. Most of these methodologies are based on specific assumptions that are reasonable for whole-genome arrays, but possibly unsuitable for small microRNA (miRNA) platforms. In this work, we propose a novel normalization (loessM), and we investigate, through simulated and real datasets, the influence that normalizations for two-colour miRNA arrays have on the identification of differentially expressed genes. RESULTS: We show that normalizations usually applied to large-scale arrays, in several cases, modify the actual structure of miRNA data, leading to large portions of false positives and false negatives. Nevertheless, loessM is able to outperform other techniques in most experimental scenarios. Moreover, when usual assumptions on differential expression distribution are missed, channel effect has a strikingly negative influence on small arrays, bias that cannot be removed by normalizations but rather by an appropriate experimental design. We find that the combination of loessM with eCADS, an experimental design based on biological replicates dye-swap recently proposed for channel-effect reduction, gives better results in most of the experimental conditions in terms of specificity/sensitivity both on simulated and real data. AVAILABILITY: LoessM R function is freely available at http://gefu.cribi.unipd.it/papers/miRNA-simulation/

Abstract. BACKGROUND: Various normalisation techniques have been developed in the context of microarray analysis to try to correct expression measurements for experimental bias and random fluctuations. Major techniques include: total intensity normalisation; intensity dependent normalisation; and variance stabilising normalisation. The aim of this paper is to discuss the impact of normalisation techniques for two-channel array technology on the process of identification of differentially expressed genes. RESULTS: Through three precise simulation plans, we quantify the impact of normalisations: (a) on the sensitivity and specificity of a specified test statistic for the identification of deregulated genes, (b) on the gene ranking induced by the statistic. CONCLUSION: Although we found a limited difference of sensitivities and specificities for the test after each normalisation, the study highlights a strong impact in terms of gene ranking agreement, resulting in different levels of agreement between competing normalisations. However, we show that the combination of two normalisations, such as glog and lowess, that handle different aspects of microarray data, is able to outperform other individual techniques.

Abstract. BACKGROUND: Publicly available datasets of microarray gene expression signals represent an unprecedented opportunity for extracting genomic relevant information and validating biological hypotheses. However, the exploitation of this exceptionally rich mine of information is still hampered by the lack of appropriate computational tools, able to overcome the critical issues raised by meta-analysis. RESULTS: This work presents A-MADMAN, an open source web application which allows the retrieval, annotation, organization and meta-analysis of gene expression datasets obtained from Gene Expression Omnibus. A-MADMAN addresses and resolves several open issues in the meta-analysis of gene expression data. CONCLUSION: A-MADMAN allows i) the batch retrieval from Gene Expression Omnibus and the local organization of raw data files and of any related meta-information, ii) the re-annotation of samples to fix incomplete, or otherwise inadequate, metadata and to create user-defined batches of data, iii) the integrative analysis of data obtained from different Affymetrix platforms through custom chip definition files and meta-normalization. Software and documentation are available on-line at http://compgen.bio.unipd.it/bioinfo/amadman/.