Preprocess data and add prior GRN information¶
In this notebook, we will go through the preprocessing steps needed prior to running RegVelo pipeline.
Library import¶
import scvelo as scv
import scanpy as sc
import pandas as pd
import numpy as np
import regvelo as rgv
Load data¶
Read murine neural crest data that contains .layers['spliced'] and .layers['unspliced'].
adata = rgv.datasets.murine_nc(data_type = "velocyto")
adata
AnnData object with n_obs × n_vars = 6788 × 30717
obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron'
var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
Data preprocessing¶
We perform preprocessing steps, consisting of filtering, normalization, and logarithmizing. We also compute the neighborhood graph in PCA space and embedd it for visualization with UMAP. We further compute the first and second order moments (means and uncentered variances) using scv.pp.moments needed for velocity estimation.
Note
If preprocessing steps have already performed, you can skip this section and proceed directly to loading prior GRN.
scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=3000)
sc.pp.log1p(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=50)
sc.tl.umap(adata)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
Filtered out 22217 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 3000 highly variable genes.
computing moments based on connectivities
finished (0:00:02) --> added
'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
adata
AnnData object with n_obs × n_vars = 6788 × 3000
obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'log1p', 'neighbors', 'umap'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced', 'Ms', 'Mu'
obsp: 'distances', 'connectivities'
Load prior GRN created from notebook ‘Infer prior GRN from pySCENIC’ for RegVelo¶
In the following, we load the processed prior GRN infromation into our AnnData object. In this step .uns['skeleton'] and .var['TF'] are added, which will be needed for RegVelo’s velocity pipeline.
GRN = pd.read_parquet("regulon_mat_processed_all_regulons.parquet")
GRN.head()
| 0610005C13Rik | 0610009L18Rik | 0610010K14Rik | 0610012G03Rik | 0610030E20Rik | 0610038B21Rik | 0610040B10Rik | 0610040J01Rik | 0610043K17Rik | 1110002L01Rik | ... | Zswim8 | Zw10 | Zwilch | Zwint | Zxdb | Zxdc | Zyg11b | Zyx | Zzef1 | Zzz3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0610005C13Rik | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610009L18Rik | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610010K14Rik | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610012G03Rik | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610030E20Rik | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 13697 columns
Note
The function rgv.pp.set_prior_grn aligns the loaded GRN with the gene expression data in adata and by default, it removes genes without incoming or outgoing regulatory edges.
adata = rgv.pp.set_prior_grn(adata, GRN.T)
adata
AnnData object with n_obs × n_vars = 6788 × 2112
obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'log1p', 'neighbors', 'umap', 'regulators', 'targets', 'skeleton', 'network'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced', 'Ms', 'Mu'
obsp: 'distances', 'connectivities'
Note
The following steps ensure that only velocity-informative genes and TF genes are considered and updates adata.uns["skeleton"] accordingly. The selection of velocity-informative genes is done using rgv.pp.preprocess_data, which in addition to min-max scaling of the spliced and unspliced layers, filters genes with non-negative fitted degradation rates \(\gamma\) and non-negative \(R^2\) values from scv.tl.velocity with mode=deterministic.
velocity_genes = rgv.pp.preprocess_data(adata.copy()).var_names.tolist()
# select TFs that regulate at least one gene
TF = adata.var_names[adata.uns["skeleton"].sum(1) != 0]
var_mask = np.union1d(TF, velocity_genes)
adata = adata[:, var_mask].copy()
# update skeleton matrix
adata.uns["skeleton"] = adata.uns["skeleton"].loc[adata.var_names.tolist(), adata.var_names.tolist()]
computing velocities
finished (0:00:00) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
Note
The function rgv.pp.filter_genes further refines the GRN, such that each gene has at least one regulator. This step further reduces the number of genes considered.
adata = rgv.pp.filter_genes(adata)
# perform min-max scaling
adata = rgv.pp.preprocess_data(adata, filter_on_r2=False)
adata.var["velocity_genes"] = adata.var_names.isin(velocity_genes)
adata.var["TF"] = adata.var_names.isin(TF)
adata
Number of genes: 1187
Number of genes: 1164
AnnData object with n_obs × n_vars = 6788 × 1164
obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'velocity_genes', 'TF'
uns: 'log1p', 'neighbors', 'umap', 'regulators', 'targets', 'skeleton', 'network'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced', 'Ms', 'Mu'
obsp: 'distances', 'connectivities'
Note
If you are using a dataset, where you have already performed preprocessing steps following scVelo and wish to retain all genes for velocity estimations, you can choose to omit filtering for velocity-informative genes and TF genes, as well as the GRN refinement step in rgv.pp.filter_genes. In this case, you can set keep_dim=True in rgv.pp.set_prior_grn followed by running adata = rgv.pp.preprocess_data(adata, filter_on_r2=False) for min-max scaling of the spliced and unspliced layers, as shown below:
adata = rgv.pp.set_prior_grn(adata, GRN.T, keep_dim=True)
adata = rgv.pp.preprocess_data(adata, filter_on_r2=False)
Save data¶
This data, with the parameters chosen in this tutorial, can also be assessed by calling rgv.datasets.murine_nc(data_type = "preprocessed")
adata.write_h5ad("adata_processed_velo.h5ad")