This repository contains the code for running the experiments in the manuscript: Xiong, Junhao, et al. “Graph Independence Testing.” arXiv preprint arXiv: 1906.03661 (2019).
The manuscript is currently under major revision, so is the code, so you may not find the exact code to reproduce the figures in the manuscript. For some more updated results, you may consult the slides here.
The core functionalities are in core.py
, which contains functions and the necessary utilities to compute test statistic, p-value and power of naive pearson, gcorr
(graph correlation) and gcorrDC
(a DC-SBM version of gcorr
). Note that gcorr
is slightly modified from the test statistic in the manuscript, so it is an unbiased estimate of the actual correlation (rather than differ by a constant for SBM).
simulations.py
contains function to simulate graspologic
implementation but are more general)
The following files correspond (roughly) to figures in the manuscript. Results can be viewed here.
experiments/sim_teststats.py
andplotting/plot_sim_test_statistic.py
are used to generate Figure 1.experiments/sim_power.py
andplotting/plot_sim_power.py
are used to generate Figure 3 and 4.
This directory currently contains the code to run experiment on the the following datasets:
mouse
: a dataset containing connectomes of 4 different species of mice. See some results heretimeseries
: a dataset containing the connectome of a single subject sequenced over many time points in timecpac200
: a dataset with connectomes from different subjects.enron
: a dataset where each graph represent email correspondence between subjects in a network.
To run experiments on the associated dataset, the standard workflow is as follow:
- Preprocess the raw dataset into a
numpy.array
with the following format: [# graphs, # vertices, # vertices]. You may need to write some code for this, but it should be straightforward using the functions available indata_utils/
. - (optional) Apply a transformation to the graphs using
experiments/real_transfrom_data.py
- (optional) Estimate community assignments of the graphs using
experiments/real_community_estimation.py
, if the test statistics and p-value methods you are using required community assignments to be given. - Run
experiments/real_teststats_pval.py
with the appropriate command-line arguments
Currently, simulation results look good, but the main problem is that we seem to have a big type I error inflation in the real data (the test rarely rejects the null, so we have very low p-values across the board, even when we don’t think there should be acutal dependence). One proposed fix is to use a DC-SBM based test, which seems to work in simulation when the generating models are DC-SBMs, but in real data, it still doesn’t seem to decrease the test statistic or results in a more reasonable p-values.
Also, the test statistics seem to reflect meaningful difference in some datasets (e.g. mouse
), but not others (e.g. timeseries
, cpac200
). It is unclear whether this is because the signal is just not in those datasets, or the test is not powered enough to detect the signal, or due to some preprocessing choices (e.g. choosing the appropriate trimming values for DC-SBM).
Some attempts to address the aforementioned problems can be seen here.
To run code in this repository, first install Python 3.6. You can use pyenv
to manage the Python versions on your machine.
Next, set up the local environment in the ./venv
directory:
python -m venv ./venv
To activate the environment, type:
. venv/bin/activate
Then, install the requirements in the local environment:
pip install --upgrade pip
pip install -r requirements.txt