Infer prior GRN from pySCENIC

In this notebook, we use SCENIC to infer a prior gene regulatory network (GRN) for the RegVelo pipeline.

Library import

import os
import numpy as np
import pandas as pd
import scanpy as sc
import loompy as lp

import glob
sc.settings.verbosity = 3
sc.logging.print_versions()
-----
anndata     0.11.4
scanpy      1.10.4
-----
PIL                 11.2.1
asttokens           NA
charset_normalizer  3.4.1
cloudpickle         3.1.1
comm                0.2.1
cycler              0.12.1
cython_runtime      NA
cytoolz             1.0.1
dask                2025.4.1
dateutil            2.9.0.post0
debugpy             1.8.11
decorator           5.1.1
exceptiongroup      1.2.0
executing           0.8.3
h5py                3.13.0
ipykernel           6.29.5
jedi                0.19.2
jinja2              3.1.6
joblib              1.4.2
kiwisolver          1.4.8
legacy_api_wrap     NA
llvmlite            0.44.0
loompy              3.0.8
lz4                 4.4.4
markupsafe          3.0.2
matplotlib          3.10.1
mpl_toolkits        NA
natsort             8.4.0
numba               0.61.2
numexpr             2.10.2
numpy               2.2.5
numpy_groupies      0.11.2
packaging           24.2
pandas              2.2.3
parso               0.8.4
platformdirs        4.3.7
prompt_toolkit      3.0.43
psutil              5.9.0
pure_eval           0.2.2
pyarrow             20.0.0
pydev_ipython       NA
pydevconsole        NA
pydevd              3.2.3
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.19.1
pyparsing           3.2.3
pytz                2025.2
scipy               1.15.2
session_info        v1.0.1
six                 1.17.0
sklearn             1.6.1
stack_data          0.2.0
tblib               3.1.0
threadpoolctl       3.6.0
tlz                 1.0.1
toolz               1.0.0
tornado             6.4.2
traitlets           5.14.3
typing_extensions   NA
wcwidth             0.2.5
yaml                6.0.2
zipp                NA
zmq                 26.2.0
zoneinfo            NA
-----
IPython             8.30.0
jupyter_client      8.6.3
jupyter_core        5.7.2
-----
Python 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
Linux-5.14.0-427.37.1.el9_4.x86_64-x86_64-with-glibc2.34
-----
Session information updated at 2025-04-28 11:30

Load data and output to loom file

Read murine neural crest data.

adata = rgv.datasets.murine_nc(data_type = "normalized")
adata
AnnData object with n_obs × n_vars = 6788 × 30717
    obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron'
    var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
    obsm: 'X_pca', 'X_umap'
adata = sc.AnnData(adata.X, obs=adata.obs, var=adata.var)
adata.var["Gene"] = adata.var_names
adata.obs["CellID"] =  adata.obs_names
adata.write_loom("adata.loom")

SCENIC steps

In the following, we use SCENIC to infer prior regulation information. Installation and usage steps are given in pySCENIC and are demonstrated in SCENICprotocol.

# path to loom file created previously
f_loom_path_scenic = "adata.loom"
# path to list of transcription factors
f_tfs = "allTFs_mm.txt"
!pyscenic grn {f_loom_path_scenic} {f_tfs} -o "adj.csv" --num_workers 24
/bin/bash: line 1: pyscenic: command not found
# path to ranking databases in feather format
f_db_glob = "scenic/cisTarget_databases/*feather"
f_db_names = ' '.join( glob.glob(f_db_glob) )

# path to motif databases
f_motif_path = "scenic/cisTarget_databases/motifs-v9-nr.mgi-m0.001-o0.0.tbl"
!pyscenic ctx "adj.csv" \
    {f_db_names} \
    --annotations_fname {f_motif_path} \
    --expression_mtx_fname {f_loom_path_scenic} \
    --output "reg.csv" \
    --all_modules \
    --num_workers 24
f_pyscenic_output = "pyscenic_output_all_regulon_no_mask.loom"
!pyscenic aucell \
    {f_loom_path_scenic} \
    "reg.csv" \
    --output {f_pyscenic_output} \
    --num_workers 4
# collect SCENIC AUCell output
lf = lp.connect(f_pyscenic_output, mode='r+', validate=False )
auc_mtx = pd.DataFrame(lf.ca.RegulonsAUC, index=lf.ca.CellID)
regulons = lf.ra.Regulons
res = pd.concat([pd.Series(r.tolist(), index=regulons.dtype.names) for r in regulons], axis=1)
res.columns = lf.row_attrs["SYMBOL"]
res.to_csv("regulon_mat_all_regulons.csv")

Create prior GRN for RegVelo

In the following, we preprocess the GRN inferred from SCENIC, saved as regulon_mat_all_regulons.csv. We first read the regulon file, where rows are regulators and columns are target genes. We further extract the names of the transcription factors (TFs) from the row indices using a regex and collapse dublicte TFs by summing their rows.

# load saved regulon-target matrix
reg = pd.read_csv("regulon_mat_all_regulons.csv", index_col = 0)

reg.index = reg.index.str.extract(r"(\w+)")[0]
reg = reg.groupby(reg.index).sum()

We further binarize the matrix, where 1 indicates the presence of regulation and 0 indicates otherwise and get the list of TFs and genes.

reg[reg != 0] = 1

TF = np.unique(list(map(lambda x: x.split("(")[0], reg.index.tolist())))
genes = np.unique(TF.tolist() + reg.columns.tolist())

For the prior GRN, we first construct an empty square matrix and populate it based on the previously binarized regulation information. We further remove the genes that are neither a TF nor a target gene (i.e. remove empty rows and comlumns) and save the cleaned and structured GRN to a .parquet file for RegVelo’s downstream pipeline.

GRN = pd.DataFrame(0, index=genes, columns=genes)
GRN.loc[TF,reg.columns.tolist()] = np.array(reg)

mask = (GRN.sum(0) != 0) | (GRN.sum(1) != 0)
GRN = GRN.loc[mask, mask].copy()

GRN.to_parquet("regulon_mat_processed_all_regulons.parquet")
print("Done! processed GRN with " + str(reg.shape[0]) + " TFs and " + str(reg.shape[1]) + " targets")
Done! processed GRN with 543 TF and 30717 targets