Skip to content

Commit

Permalink
store airlock info at upload time
Browse files Browse the repository at this point in the history
  • Loading branch information
vemonet committed Apr 10, 2024
1 parent 6ee3a81 commit 5ffe7bd
Show file tree
Hide file tree
Showing 11 changed files with 92 additions and 50 deletions.
91 changes: 52 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# 🫀 iCare4CVD Cohort Explorer
# 🫀 iCARE4CVD Cohort Explorer

Webapp built for the [iCare4CVD project](https://icare4cvd.eu).
Webapp built for the [iCARE4CVD project](https://icare4cvd.eu).

It aims to enable data owners and data scientists to:

* 🔐 Login with their [Decentriq](https://www.decentriq.com/) account (OAuth based authentication, can be easily switch to other providers). Only accounts with the required permissions will be able to access the webapp.
* ✉️ Contact [Decentriq](https://www.decentriq.com/) to request an account if you are part of the iCare4CVD project
* ✉️ Contact [Decentriq](https://www.decentriq.com/) to request an account if you are part of the iCARE4CVD project
* 📤 Data owners upload CSV cohort metadata files describing the variables of a study cohort
* 🔎 Data scientists explore available cohorts and their variables through a web app:
* Full text search across all cohorts and variables
Expand All @@ -14,7 +14,7 @@ It aims to enable data owners and data scientists to:
* 🔗 Data owners can map each variable of their cohorts to standard concepts, sourced from [OHDSI Athena](https://athena.ohdsi.org/search-terms/terms?query=) API (SNOMEDCT, LOINC...) through the web app.
* Mapping variables will help with data processing and exploration (⚠️ work in progress)
* We use namespaces from the [Bioregistry](https://bioregistry.io) to convert concepts CURIEs to URIs.
* 🛒 Data scientists can add the cohorts they need to perform their analysis to a Data Clean Room (DCR)
* 🛒 Data scientists can add the cohorts they need to perform their analysis to a [Data Clean Room](https://www.decentriq.com/) (DCR) on the Decentriq platform.
* Once complete, the data scientists can publish their DCR to Decentriq in one click.
* The DCR will be automatically created with a data schema corresponding to the selected cohorts, generated from the metadata provided by the data owners.
* The data scientist can then access their DCR in Decentriq, write the code for their analysis, and request computation of this code on the provisioned cohorts.
Expand Down Expand Up @@ -113,7 +113,7 @@ pnpm dev
### 🧹 Code formatting and linting
Automatically format Python code with ruff and black, and TypeScript code with prettier.
Automatically format Python code with ruff and black, and TypeScript code with prettier:
```bash
./scripts/fmt.sh
Expand All @@ -125,47 +125,33 @@ Deploy on a server in production with docker compose.
Put the excel spreadsheet with all cohorts metadata in `data/iCARE4CVD_Cohorts.xlsx`. Uploaded cohorts will go to separated folders in `data/cohorts/`
Generate a secret key used to encode/decode JWT token for a secure authentication system:
1. Generate a secret key used to encode/decode JWT token for a secure authentication system:
```bash
python -c "import secrets ; print(secrets.token_urlsafe(32))"
```
Create a `.env` file with secret configuration:
```bash
AUTH_ENDPOINT=https://auth0.com
CLIENT_ID=AAA
CLIENT_SECRET=BBB
DECENTRIQ_EMAIL=ccc@ddd.com
DECENTRIQ_TOKEN=EEE
JWT_SECRET=vCitcsPBwH4BMCwEqlO1aHJSIn--usrcyxPPRbeYdHM
ADMINS=admin1@email.com,admin2@email.com
```
Deploy:
```bash
docker compose -f docker-compose.prod.yml up -d
```
## 🪄 Administration
```bash
python -c "import secrets ; print(secrets.token_urlsafe(32))"
```
### ✨ Automatically generate variables metadata
2. Create a `.env` file with secret configuration:
You can use the [`csvw-ontomap`](https://github.com/vemonet/csvw-ontomap) python package to automatically generate a CSV metadata file for your data file, with the format expected by iCARE4CVD. It will automatically fill the following columns: var name, var type, categorical, min, max
```bash
AUTH_ENDPOINT=https://auth0.com
CLIENT_ID=AAA
CLIENT_SECRET=BBB
DECENTRIQ_EMAIL=ccc@ddd.com
DECENTRIQ_TOKEN=EEE
JWT_SECRET=vCitcsPBwH4BMCwEqlO1aHJSIn--usrcyxPPRbeYdHM
ADMINS=admin1@email.com,admin2@email.com
```
Install the package:
3. Deploy the stack for production:
```bash
pip install git+https://github.com/vemonet/csvw-ontomap.git
```
```bash
docker compose -f docker-compose.prod.yml up -d
```
Run profiling, supports `.csv`, `.xlsx`, `.sav`:
We currently use [nginx-proxy](https://github.com/nginx-proxy/nginx-proxy) for routing through environment variables in the `docker-compose.yml` file, you can change for the proxy of your liking.
```bash
csvw-ontomap data/COHORT_data.sav -o data/COHORT_datadictionary.csv
```
## 🪄 Administration
### 🗑️ Reset database
Expand All @@ -175,6 +161,15 @@ Reset the database by deleting the `data/db` folder:
rm -rf data/db
```
Next restart of the application the database will be re-populated using the data dictionaries CSV files stored on the server.
> [!WARNING]
>
> Resetting the database only if really necessary, it will cause to lose:
>
> - All concept mappings added from the Cohort Explorer
> - The info about Decentriq airlock data preview for cohorts that have been uploaded (it will default to false when recreating the database, admins can update them by downloading and reuploading the cohorts with the right airlock setting)
### 💾 Backup database
It can be convenient to dump the content of the triplestore database to create a backup.
Expand Down Expand Up @@ -218,3 +213,21 @@ docker compose exec backend curl -X POST -T /data/triplestore_dump_20240225.nq -
### 🚚 Move the app
If you need to move the app to a different server, just copy the whole `data/` folder.
### ✨ Automatically generate variables metadata
Experimental: you can use the [`csvw-ontomap`](https://github.com/vemonet/csvw-ontomap) python package to automatically generate a CSV metadata file for your data file, with the format expected by iCARE4CVD. It will automatically fill the following columns: var name, var type, categorical, min, max. But it does not properly extract datetime data types.
Install the package:
```bash
pip install git+https://github.com/vemonet/csvw-ontomap.git
```
Run profiling, supports `.csv`, `.xlsx`, `.sav`:
```bash
csvw-ontomap data/COHORT_data.sav -o data/COHORT_datadictionary.csv
```
###
2 changes: 1 addition & 1 deletion backend/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ build-backend = "hatchling.build"
requires-python = ">=3.8"
version = "0.0.1"
name = "cohort-explorer-backend"
description = "Backend for the iCare4CVD Cohort Explorer."
description = "Backend for the iCARE4CVD Cohort Explorer."
license = "MIT"
authors = [
{ name = "Vincent Emonet", email = "vincent.emonet@gmail.com" },
Expand Down
5 changes: 3 additions & 2 deletions backend/src/decentriq.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,6 @@ def pandas_script_merge_cohorts(merged_cohorts: dict[str, list[str]], all_cohort
)
async def create_compute_dcr(
cohorts_request: dict[str, Any],
airlock: bool = True,
user: Any = Depends(get_current_user),
) -> dict[str, Any]:
"""Create a Data Clean Room for computing with the cohorts requested using Decentriq SDK"""
Expand Down Expand Up @@ -168,7 +167,9 @@ async def create_compute_dcr(
builder.add_node_definition(TableDataNodeDefinition(name=data_node_id, columns=get_cohort_schema(cohort), is_required=True))
data_nodes.append(data_node_id)

if airlock:
# TODO: made airlock always True for testing
# if cohort.airlock:
if True:
# Add airlock node to make it easy to access small part of the dataset
preview_node_id = f"preview-{data_node_id}"
builder.add_node_definition(PreviewComputeNodeDefinition(
Expand Down
1 change: 1 addition & 0 deletions backend/src/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ class Cohort:
study_population: Optional[str] = None
study_objective: Optional[str] = None
variables: Dict[str, CohortVariable] = field(default_factory=dict)
airlock: bool = False
can_edit: bool = False

def dict(self):
Expand Down
13 changes: 9 additions & 4 deletions backend/src/upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ def to_camelcase(s: str) -> str:
s = sub(r"(_|-)+", " ", s).title().replace(" ", "")
return "".join([s[0].lower(), s[1:]])

def load_cohort_dict_file(dict_path: str, cohort_id: str, user_email: str) -> Dataset:
def load_cohort_dict_file(dict_path: str, cohort_id: str, airlock: bool) -> Dataset:
"""Parse the cohort dictionary uploaded as excel or CSV spreadsheet, and load it to the triplestore"""
# print(f"Loading dictionary {dict_path}")
# df = pd.read_csv(dict_path) if dict_path.endswith(".csv") else pd.read_excel(dict_path)
Expand Down Expand Up @@ -220,6 +220,7 @@ def load_cohort_dict_file(dict_path: str, cohort_id: str, user_email: str) -> Da
g = init_graph()
g.add((cohort_uri, RDF.type, ICARE.Cohort, cohort_uri))
g.add((cohort_uri, DC.identifier, Literal(cohort_id), cohort_uri))
g.add((cohort_uri, ICARE.previewEnabled, Literal(str(airlock).lower()), cohort_uri))

# Record all errors and raise them at the end
errors = []
Expand Down Expand Up @@ -310,6 +311,7 @@ async def upload_cohort(
cohort_id: str = Form(...),
cohort_dictionary: UploadFile = File(...),
cohort_data: UploadFile | None = None,
airlock: bool = True,
) -> dict[str, Any]:
"""Upload a cohort metadata file to the server and add its variables to the triplestore."""
user_email = user["email"]
Expand All @@ -326,6 +328,8 @@ async def upload_cohort(
detail=f"User {user_email} cannot edit cohort {cohort_id}",
)

cohort_info.airlock = airlock

# Create directory named after cohort_id
cohorts_folder = os.path.join(settings.data_folder, "cohorts", cohort_id)
os.makedirs(cohorts_folder, exist_ok=True)
Expand Down Expand Up @@ -354,7 +358,7 @@ async def upload_cohort(
shutil.copyfileobj(cohort_dictionary.file, buffer)

try:
g = load_cohort_dict_file(metadata_path, cohort_id, user_email)
g = load_cohort_dict_file(metadata_path, cohort_id, airlock)
# Delete previous graph for this file from triplestore
delete_existing_triples(get_cohort_uri(cohort_id))
publish_graph_to_endpoint(g)
Expand Down Expand Up @@ -463,8 +467,9 @@ def init_triplestore() -> None:
folder_path = os.path.join(settings.data_folder, "cohorts", folder)
if os.path.isdir(folder_path):
for file in glob.glob(os.path.join(folder_path, "*_datadictionary.*")):
# TODO: currently when we reset all existing cohorts default to the main admin
g = load_cohort_dict_file(file, folder, settings.decentriq_email)
# NOTE: default airlock preview to false if we ever need to reset cohorts,
# admins can easily ddl and reupload the cohorts with the correct airlock value
g = load_cohort_dict_file(file, folder, False)
g.serialize(f"{settings.data_folder}/cohort_explorer_triplestore.trig", format="trig")
if publish_graph_to_endpoint(g):
print(f"💾 Triplestore initialization: added {len(g)} triples for cohorts {file}.")
Expand Down
2 changes: 2 additions & 0 deletions backend/src/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ def run_query(query: str) -> dict[str, Any]:
OPTIONAL { ?cohort icare:studyOngoing ?study_ongoing . }
OPTIONAL { ?cohort icare:studyPopulation ?study_population . }
OPTIONAL { ?cohort icare:studyObjective ?study_objective . }
OPTIONAL { ?cohort icare:previewEnabled ?airlock . }
}
OPTIONAL {
Expand Down Expand Up @@ -139,6 +140,7 @@ def retrieve_cohorts_metadata(user_email: str) -> dict[str, Cohort]:
study_population=get_value("study_population", row),
study_objective=get_value("study_objective", row),
variables={}, # You might want to populate this separately, depending on your data structure
airlock=get_value("airlock", row),
can_edit=user_email in [*settings.admins_list, get_value("cohortEmail", row)],
)
elif get_value("cohortEmail", row) not in target_dict[cohort_id].cohort_email:
Expand Down
4 changes: 2 additions & 2 deletions frontend/src/components/Nav.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ export function Nav() {

<div className="navbar-center">
<Link className="text-xl font-thin" href="/">
iCare4CVD Cohort Explorer
iCARE4CVD Cohort Explorer
</Link>
</div>

Expand Down Expand Up @@ -213,7 +213,7 @@ export function Nav() {
<div className="card-body bg-success mt-5 rounded-lg text-slate-900">
<p>
✅ Data Clean Room{' '}
<a href={publishedDCR['dcr_url']} className="link">
<a href={publishedDCR['dcr_url']} className="link" target="_blank">
<b>{publishedDCR['dcr_title']}</b>
</a>{' '}
published in Decentriq.
Expand Down
2 changes: 1 addition & 1 deletion frontend/src/pages/_app.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ export default function App({Component, pageProps}: AppProps) {
<meta charSet="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link rel="icon" href="/icare4cvd_heart_logo.png" />
<meta name="description" content="Explore cohorts for the iCare4CVD project" />
<meta name="description" content="Explore cohorts for the iCARE4CVD project" />
<title>Cohort Explorer</title>
</Head>
<CohortsProvider>
Expand Down
4 changes: 4 additions & 0 deletions frontend/src/pages/cohorts/[cohortId].tsx
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
'use client';

// NOTE: this page is not really used in the current version of the app
// All cohorts are accessed from the cohorts.tsx page
// We keep it here as a placeholder in case we want to add pages for each cohort

import React, {useState} from 'react';
import {useRouter} from 'next/router';
import {useCohorts} from '@/components/CohortsContext';
Expand Down
16 changes: 16 additions & 0 deletions frontend/src/pages/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,22 @@ export default function Home() {
Explore and search data dictionaries of available Cohorts
</p>
</Link>

<Link
href="https://github.com/MaastrichtU-IDS/cohort-explorer"
className="group rounded-lg border border-transparent px-5 py-4 transition-colors hover:border-gray-300 hover:bg-gray-100 hover:dark:border-neutral-700 hover:dark:bg-neutral-800/30"
target="_blank"
>
<h2 className={`mb-3 text-2xl font-semibold`}>
Technical details{' '}
<span className="inline-block transition-transform group-hover:translate-x-1 motion-reduce:transform-none">
-&gt;
</span>
</h2>
<p className={`m-0 max-w-[30ch] text-sm opacity-50`}>
View the documentation and source code on GitHub
</p>
</Link>
</div>
</main>
);
Expand Down
2 changes: 1 addition & 1 deletion frontend/src/pages/upload.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ export default function UploadPage() {
<div className="card-body bg-success mt-8 rounded-lg text-slate-900">
<p>
✅ Data Clean Room{' '}
<a href={publishedDCR['dcr_url']} className="link">
<a href={publishedDCR['dcr_url']} className="link" target="_blank">
<b>{publishedDCR['dcr_title']}</b>
</a>{' '}
published.
Expand Down

0 comments on commit 5ffe7bd

Please sign in to comment.