Overview

An Open Source Project from the Data to AI Lab, at MIT

Metrics for Synthetic Data Generation Projects

Website: https://sdv.dev
Documentation: https://sdv.dev/SDV
Repository: https://github.com/sdv-dev/SDMetrics
License: MIT
Development Status: Pre-Alpha

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after.

It supports multiple data modalities:

Single Columns: Compare 1 dimensional numpy arrays representing individual columns.
Column Pairs: Compare how columns in a pandas.DataFrame relate to each other, in groups of 2.
Single Table: Compare an entire table, represented as a pandas.DataFrame.
Multi Table: Compare multi-table and relational datasets represented as a python dict with multiple tables passed as pandas.DataFrames.
Time Series: Compare tables representing ordered sequences of events.

It includes a variety of metrics such as:

Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
Detection metrics which use machine learning to try to distinguish between real and synthetic data.
Efficacy metrics which compare the performance of machine learning models when run on the synthetic and real data.
Bayesian Network and Gaussian Mixture metrics which learn the distribution of the real data and evaluate the likelihood of the synthetic data belonging to the learned distribution.
Privacy metrics which evaluate whether the synthetic data is leaking information about the real data.

Install

SDMetrics is part of the SDV project and is automatically installed alongside it. For details about this process please visit the SDV Installation Guide

Optionally, SDMetrics can also be installed as a standalone library using the following commands:

Using pip:

pip install sdmetrics

Using conda:

conda install -c sdv-dev -c conda-forge sdmetrics

For more installation options please visit the SDMetrics installation Guide

Usage

SDMetrics is included as part of the framework offered by SDV to evaluate the quality of your synthetic dataset. For more details about how to use it please visit the corresponding User Guide:

Evaluating Synthetic Data

Standalone usage

SDMetrics can also be used as a standalone library to run metrics individually.

In this short example we show how to use it to evaluate a toy multi-table dataset and its synthetic replica by running all the compatible multi-table metrics on it:

import sdmetrics

# Load the demo data, which includes:
# - A dict containing the real tables as pandas.DataFrames.
# - A dict containing the synthetic clones of the real data.
# - A dict containing metadata about the tables.
real_data, synthetic_data, metadata = sdmetrics.load_demo()

# Obtain the list of multi table metrics, which is returned as a dict
# containing the metric names and the corresponding metric classes.
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()

# Run all the compatible metrics and get a report
sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)

The output will be a table with all the details about the executed metrics and their score:

metric	name	score	min_value	max_value	goal
CSTest	Chi-Squared	0.76651	0	1	MAXIMIZE
KSTest	Inverted Kolmogorov-Smirnov D statistic	0.75	0	1	MAXIMIZE
KSTestExtended	Inverted Kolmogorov-Smirnov D statistic	0.777778	0	1	MAXIMIZE
LogisticDetection	LogisticRegression Detection	0.882716	0	1	MAXIMIZE
SVCDetection	SVC Detection	0.833333	0	1	MAXIMIZE
BNLikelihood	BayesianNetwork Likelihood	nan	0	1	MAXIMIZE
BNLogLikelihood	BayesianNetwork Log Likelihood	nan	-inf	0	MAXIMIZE
LogisticParentChildDetection	LogisticRegression Detection	0.619444	0	1	MAXIMIZE
SVCParentChildDetection	SVC Detection	0.916667	0	1	MAXIMIZE

What's next?

If you want to read more about each individual metric, please visit the following folders:

Single Column Metrics: sdmetrics/single_column
Single Table Metrics: sdmetrics/single_table
Multi Table Metrics: sdmetrics/multi_table
Time Series Metrics: sdmetrics/timeseries

The Synthetic Data Vault

This repository is part of The Synthetic Data Vault Project

Website: https://sdv.dev
Documentation: https://sdv.dev/SDV

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github		.github
conda		conda
docs		docs
resources		resources
sdmetrics		sdmetrics
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.md		HISTORY.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
setup.cfg		setup.cfg
setup.py		setup.py
tasks.py		tasks.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Install

Usage

Standalone usage

What's next?

The Synthetic Data Vault

About

Releases

Packages

Languages

License

ZhuofanXie/SDMetrics

Folders and files

Latest commit

History

Repository files navigation

Overview

Install

Usage

Standalone usage

What's next?

The Synthetic Data Vault

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages