Infer prior GRN from pySCENIC¶
In this notebook, we use SCENIC to infer a prior gene regulatory network (GRN) for the RegVelo pipeline.
Library import¶
import os
import numpy as np
import pandas as pd
import scanpy as sc
import loompy as lp
import glob
sc.settings.verbosity = 3
sc.logging.print_versions()
-----
anndata 0.11.4
scanpy 1.10.4
-----
PIL 11.2.1
asttokens NA
charset_normalizer 3.4.1
cloudpickle 3.1.1
comm 0.2.1
cycler 0.12.1
cython_runtime NA
cytoolz 1.0.1
dask 2025.4.1
dateutil 2.9.0.post0
debugpy 1.8.11
decorator 5.1.1
exceptiongroup 1.2.0
executing 0.8.3
h5py 3.13.0
ipykernel 6.29.5
jedi 0.19.2
jinja2 3.1.6
joblib 1.4.2
kiwisolver 1.4.8
legacy_api_wrap NA
llvmlite 0.44.0
loompy 3.0.8
lz4 4.4.4
markupsafe 3.0.2
matplotlib 3.10.1
mpl_toolkits NA
natsort 8.4.0
numba 0.61.2
numexpr 2.10.2
numpy 2.2.5
numpy_groupies 0.11.2
packaging 24.2
pandas 2.2.3
parso 0.8.4
platformdirs 4.3.7
prompt_toolkit 3.0.43
psutil 5.9.0
pure_eval 0.2.2
pyarrow 20.0.0
pydev_ipython NA
pydevconsole NA
pydevd 3.2.3
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.19.1
pyparsing 3.2.3
pytz 2025.2
scipy 1.15.2
session_info v1.0.1
six 1.17.0
sklearn 1.6.1
stack_data 0.2.0
tblib 3.1.0
threadpoolctl 3.6.0
tlz 1.0.1
toolz 1.0.0
tornado 6.4.2
traitlets 5.14.3
typing_extensions NA
wcwidth 0.2.5
yaml 6.0.2
zipp NA
zmq 26.2.0
zoneinfo NA
-----
IPython 8.30.0
jupyter_client 8.6.3
jupyter_core 5.7.2
-----
Python 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
Linux-5.14.0-427.37.1.el9_4.x86_64-x86_64-with-glibc2.34
-----
Session information updated at 2025-04-28 11:30
Load data and output to loom file¶
Read murine neural crest data.
adata = rgv.datasets.murine_nc(data_type = "normalized")
adata
AnnData object with n_obs × n_vars = 6788 × 30717
obs: 'nCount_RNA', 'nFeature_RNA', 'cell_id', 'UMI_count', 'gene_count', 'major_trajectory', 'celltype_update', 'UMAP_1', 'UMAP_2', 'UMAP_3', 'UMAP_2d_1', 'UMAP_2d_2', 'terminal_state', 'nCount_intron', 'nFeature_intron'
var: 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
obsm: 'X_pca', 'X_umap'
adata = sc.AnnData(adata.X, obs=adata.obs, var=adata.var)
adata.var["Gene"] = adata.var_names
adata.obs["CellID"] = adata.obs_names
adata.write_loom("adata.loom")
SCENIC steps¶
In the following, we use SCENIC to infer prior regulation information. Installation and usage steps are given in pySCENIC and are demonstrated in SCENICprotocol.
# path to loom file created previously
f_loom_path_scenic = "adata.loom"
# path to list of transcription factors
f_tfs = "allTFs_mm.txt"
!pyscenic grn {f_loom_path_scenic} {f_tfs} -o "adj.csv" --num_workers 24
/bin/bash: line 1: pyscenic: command not found
# path to ranking databases in feather format
f_db_glob = "scenic/cisTarget_databases/*feather"
f_db_names = ' '.join( glob.glob(f_db_glob) )
# path to motif databases
f_motif_path = "scenic/cisTarget_databases/motifs-v9-nr.mgi-m0.001-o0.0.tbl"
!pyscenic ctx "adj.csv" \
{f_db_names} \
--annotations_fname {f_motif_path} \
--expression_mtx_fname {f_loom_path_scenic} \
--output "reg.csv" \
--all_modules \
--num_workers 24
f_pyscenic_output = "pyscenic_output_all_regulon_no_mask.loom"
!pyscenic aucell \
{f_loom_path_scenic} \
"reg.csv" \
--output {f_pyscenic_output} \
--num_workers 4
# collect SCENIC AUCell output
lf = lp.connect(f_pyscenic_output, mode='r+', validate=False )
auc_mtx = pd.DataFrame(lf.ca.RegulonsAUC, index=lf.ca.CellID)
regulons = lf.ra.Regulons
res = pd.concat([pd.Series(r.tolist(), index=regulons.dtype.names) for r in regulons], axis=1)
res.columns = lf.row_attrs["SYMBOL"]
res.to_csv("regulon_mat_all_regulons.csv")
Create prior GRN for RegVelo¶
In the following, we preprocess the GRN inferred from SCENIC, saved as regulon_mat_all_regulons.csv. We first read the regulon file, where rows are regulators and columns are target genes. We further extract the names of the transcription factors (TFs) from the row indices using a regex and collapse dublicte TFs by summing their rows.
# load saved regulon-target matrix
reg = pd.read_csv("regulon_mat_all_regulons.csv", index_col = 0)
reg.index = reg.index.str.extract(r"(\w+)")[0]
reg = reg.groupby(reg.index).sum()
We further binarize the matrix, where 1 indicates the presence of regulation and 0 indicates otherwise and get the list of TFs and genes.
reg[reg != 0] = 1
TF = np.unique(list(map(lambda x: x.split("(")[0], reg.index.tolist())))
genes = np.unique(TF.tolist() + reg.columns.tolist())
For the prior GRN, we first construct an empty square matrix and populate it based on the previously binarized regulation information. We further remove the genes that are neither a TF nor a target gene (i.e. remove empty rows and comlumns) and save the cleaned and structured GRN to a .parquet file for RegVelo’s downstream pipeline.
GRN = pd.DataFrame(0, index=genes, columns=genes)
GRN.loc[TF,reg.columns.tolist()] = np.array(reg)
mask = (GRN.sum(0) != 0) | (GRN.sum(1) != 0)
GRN = GRN.loc[mask, mask].copy()
GRN.to_parquet("regulon_mat_processed_all_regulons.parquet")
print("Done! processed GRN with " + str(reg.shape[0]) + " TFs and " + str(reg.shape[1]) + " targets")
Done! processed GRN with 543 TF and 30717 targets