Preprocess data and add prior GRN information¶

In this notebook, we will go through the preprocessing steps needed prior to running RegVelo pipeline.

Library import¶

import scvelo as scv
import scanpy as sc
import pandas as pd
import numpy as np

import regvelo as rgv

Load data¶

Read murine neural crest data that contains .layers['spliced'] and .layers['unspliced'].

adata = rgv.datasets.murine_nc(data_type = "velocyto")

adata

AnnData object with n_obs × n_vars = 6788 × 30717
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'

Data preprocessing¶

We perform preprocessing steps, consisting of filtering, normalization, and logarithmizing. We also compute the neighborhood graph in PCA space and embedd it for visualization with UMAP. We further compute the first and second order moments (means and uncentered variances) using scv.pp.moments needed for velocity estimation.

Note

If preprocessing steps have already performed, you can skip this section and proceed directly to loading prior GRN.

scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=3000)
sc.pp.log1p(adata)

sc.pp.neighbors(adata, n_pcs=30, n_neighbors=50)
sc.tl.umap(adata)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)

Filtered out 22217 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 3000 highly variable genes.
computing moments based on connectivities
    finished (0:00:02) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

adata

AnnData object with n_obs × n_vars = 6788 × 3000
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'log1p', 'neighbors', 'umap'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

Load prior GRN created from notebook ‘Infer prior GRN from pySCENIC’ for RegVelo¶

In the following, we load the processed prior GRN infromation into our AnnData object. In this step .uns['skeleton'] and .var['TF'] are added, which will be needed for RegVelo’s velocity pipeline.

GRN = pd.read_parquet("regulon_mat_processed_all_regulons.parquet")

GRN.head()

	0610005C13Rik	0610009L18Rik	0610010K14Rik	0610012G03Rik	0610030E20Rik	0610038B21Rik	0610040B10Rik	0610040J01Rik	0610043K17Rik	1110002L01Rik	...	Zswim8	Zw10	Zwilch	Zwint	Zxdb	Zxdc	Zyg11b	Zyx	Zzef1	Zzz3
0610005C13Rik	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
0610009L18Rik	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
0610010K14Rik	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
0610012G03Rik	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
0610030E20Rik	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 13697 columns

Note

The function rgv.pp.set_prior_grn aligns the loaded GRN with the gene expression data in adata and by default, it removes genes without incoming or outgoing regulatory edges.

adata = rgv.pp.set_prior_grn(adata, GRN.T)
adata

AnnData object with n_obs × n_vars = 6788 × 2112
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'log1p', 'neighbors', 'umap', 'regulators', 'targets', 'skeleton', 'network'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

Note

The following steps ensure that only velocity-informative genes and TF genes are considered and updates adata.uns["skeleton"] accordingly. The selection of velocity-informative genes is done using rgv.pp.preprocess_data, which in addition to min-max scaling of the spliced and unspliced layers, filters genes with non-negative fitted degradation rates \(\gamma\) and non-negative \(R^2\) values from scv.tl.velocity with mode=deterministic.

velocity_genes = rgv.pp.preprocess_data(adata.copy()).var_names.tolist()

# select TFs that regulate at least one gene
TF = adata.var_names[adata.uns["skeleton"].sum(1) != 0]
var_mask = np.union1d(TF, velocity_genes)

adata = adata[:, var_mask].copy()

# update skeleton matrix
adata.uns["skeleton"] = adata.uns["skeleton"].loc[adata.var_names.tolist(), adata.var_names.tolist()]

computing velocities
    finished (0:00:00) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)

Note

The function rgv.pp.filter_genes further refines the GRN, such that each gene has at least one regulator. This step further reduces the number of genes considered.

adata = rgv.pp.filter_genes(adata)

# perform min-max scaling
adata = rgv.pp.preprocess_data(adata, filter_on_r2=False)

adata.var["velocity_genes"] = adata.var_names.isin(velocity_genes)
adata.var["TF"] = adata.var_names.isin(TF)

adata

Number of genes: 1187
Number of genes: 1164

AnnData object with n_obs × n_vars = 6788 × 1164
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'velocity_genes', 'TF'
    uns: 'log1p', 'neighbors', 'umap', 'regulators', 'targets', 'skeleton', 'network'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

Note

If you are using a dataset, where you have already performed preprocessing steps following scVelo and wish to retain all genes for velocity estimations, you can choose to omit filtering for velocity-informative genes and TF genes, as well as the GRN refinement step in rgv.pp.filter_genes. In this case, you can set keep_dim=True in rgv.pp.set_prior_grn followed by running adata = rgv.pp.preprocess_data(adata, filter_on_r2=False) for min-max scaling of the spliced and unspliced layers, as shown below:

adata = rgv.pp.set_prior_grn(adata, GRN.T, keep_dim=True)
adata = rgv.pp.preprocess_data(adata, filter_on_r2=False)

Save data¶

This data, with the parameters chosen in this tutorial, can also be assessed by calling rgv.datasets.murine_nc(data_type = "preprocessed")

adata.write_h5ad("adata_processed_velo.h5ad")