Preprocess data and add prior GRN information

In this notebook, we will go through the preprocessing steps needed prior to running RegVelo pipeline.

Library import

import scvelo as scv
import scanpy as sc
import pandas as pd
import numpy as np

import regvelo as rgv

Load data

Read murine neural crest data that contains .layers['spliced'] and .layers['unspliced'].

adata = rgv.datasets.murine_nc(data_type = "velocyto")
adata
AnnData object with n_obs × n_vars = 6788 × 30717
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'

Data preprocessing

We perform preprocessing steps, consisting of filtering, normalization, and logarithmizing. We also compute the neighborhood graph in PCA space and embedd it for visualization with UMAP. We further compute the first and second order moments (means and uncentered variances) using scv.pp.moments needed for velocity estimation.

Note

If preprocessing steps have already performed, you can skip this section and proceed directly to loading prior GRN.

scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=3000)
sc.pp.log1p(adata)

sc.pp.neighbors(adata, n_pcs=30, n_neighbors=50)
sc.tl.umap(adata)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
Filtered out 22217 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 3000 highly variable genes.
computing moments based on connectivities
    finished (0:00:02) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
adata
AnnData object with n_obs × n_vars = 6788 × 3000
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'log1p', 'neighbors', 'umap'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

Load prior GRN created from notebook ‘Infer prior GRN from pySCENIC’ for RegVelo

In the following, we load the processed prior GRN infromation into our AnnData object. In this step .uns['skeleton'] and .var['TF'] are added, which will be needed for RegVelo’s velocity pipeline.

GRN = pd.read_parquet("regulon_mat_processed_all_regulons.parquet")
GRN.head()
0610005C13Rik 0610009L18Rik 0610010K14Rik 0610012G03Rik 0610030E20Rik 0610038B21Rik 0610040B10Rik 0610040J01Rik 0610043K17Rik 1110002L01Rik ... Zswim8 Zw10 Zwilch Zwint Zxdb Zxdc Zyg11b Zyx Zzef1 Zzz3
0610005C13Rik 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
0610009L18Rik 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
0610010K14Rik 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
0610012G03Rik 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
0610030E20Rik 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 13697 columns

Note

The function rgv.pp.set_prior_grn aligns the loaded GRN with the gene expression data in adata and by default, it removes genes without incoming or outgoing regulatory edges.

adata = rgv.pp.set_prior_grn(adata, GRN.T)
adata
AnnData object with n_obs × n_vars = 6788 × 2112
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'log1p', 'neighbors', 'umap', 'regulators', 'targets', 'skeleton', 'network'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

Note

The following steps ensure that only velocity-informative genes and TF genes are considered and updates adata.uns["skeleton"] accordingly. The selection of velocity-informative genes is done using rgv.pp.preprocess_data, which in addition to min-max scaling of the spliced and unspliced layers, filters genes with non-negative fitted degradation rates \(\gamma\) and non-negative \(R^2\) values from scv.tl.velocity with mode=deterministic.

velocity_genes = rgv.pp.preprocess_data(adata.copy()).var_names.tolist()

# select TFs that regulate at least one gene
TF = adata.var_names[adata.uns["skeleton"].sum(1) != 0]
var_mask = np.union1d(TF, velocity_genes)

adata = adata[:, var_mask].copy()

# update skeleton matrix
adata.uns["skeleton"] = adata.uns["skeleton"].loc[adata.var_names.tolist(), adata.var_names.tolist()]
computing velocities
    finished (0:00:00) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)

Note

The function rgv.pp.filter_genes further refines the GRN, such that each gene has at least one regulator. This step further reduces the number of genes considered.

adata = rgv.pp.filter_genes(adata)

# perform min-max scaling
adata = rgv.pp.preprocess_data(adata, filter_on_r2=False)

adata.var["velocity_genes"] = adata.var_names.isin(velocity_genes)
adata.var["TF"] = adata.var_names.isin(TF)

adata
Number of genes: 1187
Number of genes: 1164
AnnData object with n_obs × n_vars = 6788 × 1164
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'velocity_genes', 'TF'
    uns: 'log1p', 'neighbors', 'umap', 'regulators', 'targets', 'skeleton', 'network'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

Note

If you are using a dataset, where you have already performed preprocessing steps following scVelo and wish to retain all genes for velocity estimations, you can choose to omit filtering for velocity-informative genes and TF genes, as well as the GRN refinement step in rgv.pp.filter_genes. In this case, you can set keep_dim=True in rgv.pp.set_prior_grn followed by running adata = rgv.pp.preprocess_data(adata, filter_on_r2=False) for min-max scaling of the spliced and unspliced layers, as shown below:

adata = rgv.pp.set_prior_grn(adata, GRN.T, keep_dim=True)
adata = rgv.pp.preprocess_data(adata, filter_on_r2=False)

Save data

This data, with the parameters chosen in this tutorial, can also be assessed by calling rgv.datasets.murine_nc(data_type = "preprocessed")

adata.write_h5ad("adata_processed_velo.h5ad")