Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a TSV format? #30

Open
LucaCappelletti94 opened this issue Apr 16, 2022 · 7 comments
Open

Is there a TSV format? #30

LucaCappelletti94 opened this issue Apr 16, 2022 · 7 comments

Comments

@LucaCappelletti94
Copy link

Hello and thank you for making this resource!

Is there a CSV or TSV version of the node list and edge list of this knowledge graph? How does one get it in any format?

Thank you!

@seanliu96
Copy link
Collaborator

Hi, thanks for your interest. Currently, we only provide DB files in the sqlite format. If you want to convert them into CSV for TSV files, please try this file aser_table2tsv.py. But this file is not compatible with aser 2.0, so you need to change the column indices with EVENTUALITY_COLUMNS and RELATION_COLUMNS .

@LucaCappelletti94
Copy link
Author

LucaCappelletti94 commented Apr 18, 2022

How much do the SQLite files occupy? From where can I download it? I'd be happy to try the procedure and eventually provide help updating the script (if there is any need).

I'm interested in this KG because I am the principal author of a scalable graph machine learning library with a Rust backend, and we are looking for large graphs (billions of nodes and edges) from the real world.

The biggest we are using is ClueWeb09 (followed by WikiData), but it is hardly of interest if you exclude simple benchmarks.

@seanliu96
Copy link
Collaborator

You can download files from our webpage. The merged core KG is around 30GB on disk. But if you are only interested in the topology and graph structure, I think the converted CSV/TSV should occupy around 1~2 GB. However, this KG is not a dense graph, and nodes are text in fact. I personally think using it for graph benchmarking is not too suitable.

@LucaCappelletti94
Copy link
Author

Thank you for your answer!

Indeed, we already have enough large graphs for plain benchmarking and we'd like to actually have something more real world, such as this. It is great that there are textual informations for the nodes, as we use Bert models to create node features for models such as GCNs and the likes.

I'll be downloading the graph and start to process it into something we can use, if you are interested I'll keep you posted on what we obtain.

Since of course you and your team are the experts on this graph, if we get something interesting out of it I'd be happy to talk about it and potentially collaborate on experimental designs.

@LucaCappelletti94
Copy link
Author

I wrote a small python script to convert the current database version into three TSVs: a node list, edge type list and edge list.
Since it worked nicely, I thought it could be helpful to share it with other possible interested parties, plus you can tell me if there are any glaring errors in it.

In plain TSV format (without further compressions), the resulting files weight respectively 583MB, 4KB, and 354MB.

The used script first downloads the data from here and then processes it into the documents mentioned above.

First of all, the download part (if you need to install downloaders, just run pip install downloaders):

from downloaders import BaseDownloader
downloader = BaseDownloader(auto_extract=False,)

downloader.download([
    "https://data.dgl.ai/dataset/ASER/core/KG.tar.0",
    "https://data.dgl.ai/dataset/ASER/core/KG.tar.1",
    "https://data.dgl.ai/dataset/ASER/core/KG.tar.2",
    "https://data.dgl.ai/dataset/ASER/core/KG.tar.3",
    "https://data.dgl.ai/dataset/ASER/core/KG.tar.4",
    "https://data.dgl.ai/dataset/ASER/core/KG.tar.5",
    "https://data.dgl.ai/dataset/ASER/core/KG.tar.6",
    "https://data.dgl.ai/dataset/ASER/core/KG.tar.7",
])

Then, create the database file with cat downloads/KG.tar.* | tar xvf -.

Finally, execute the actual processing step:

import sqlite3
import csv
import os
from tqdm.auto import tqdm
import pandas as pd

def convert_sqlite_to_tsv(
    node_list_tsv_path: str,
    edge_type_list_tsv_path: str,
    edge_list_tsv_path: str,
    db_path: str,
    threshold: float = 0.5
):
    """Writes out the provided database to the provided TSV file.
    
    Parameters
    --------------------------
    node_list_tsv_path: str
        Path where to write out the node list
    edge_type_list_tsv_path: str
        Path where to write out the edge type list
    edge_list_tsv_path: str
        Path where to write out the edge list
    db_path: str
        Path from where to load the database
    threshold: float = 0.5
        Minimum weight from which to load the edge
    """
    # We create the paths directories if needed.
    for path in (
        node_list_tsv_path,
        edge_type_list_tsv_path,
        edge_list_tsv_path
    ):
        os.makedirs(os.path.dirname(path), exist_ok=True)
    
    # We create a dictionary for the nodes to write
    # directly a numeric ID and not a long hash, which would
    # weight significantly more than just the numeric ID.
    nodes_dictionary= {}
    
    # We will do a similar thing for the edge types below
    # once we load them into the header.
    
    with open(edge_list_tsv_path, 'w', newline='') as f:
        connection = sqlite3.connect(db_path)
        header = list(pd.read_sql_query(
            'SELECT * FROM RELATIONS LIMIT 1;',
            connection
        ).columns)
        
        # We create the dictionary of the edge types
        # which we will use to replace the string edge types
        # into a more succinct numeric edge type.
        edge_types_dictionary = {
            edge_type: edge_type_id
            for edge_type_id, edge_type in enumerate(header[3:-1])
        }
        
        # We exclude the hash ID of the edge and the last
        # column, which is the co-occurrence weight.
        header = header[1:-1]
        rows_number = next(connection.execute('SELECT COUNT(*) FROM RELATIONS;'))[0]
        query = connection.execute(
            'SELECT {} FROM RELATIONS;'.format(", ".join(header))
        )
        
        # We rename the first two columns into names that
        # are more intuitive in their meaning.
        header[0] = "subject"
        header[1] = "object"
        
        # We create the TSV writer
        tsv_writer = csv.writer(f, delimiter='\t')
        
        # And write the header of the TSV writer
        # Specifically, the 4 columns are:
        tsv_writer.writerow([header[0], "category", header[1], "weight"])
        
        # We start to iterate on the database
        for row in tqdm(
            (dict(zip(header, row)) for row in query),
            desc="Writing out ASER edge list",
            total=rows_number,
            leave=False,
            dynamic_ncols=True
        ):
            for edge_type, edge_type_id in edge_types_dictionary.items():
                # We retrieve the edge weight of the current edge type
                edge_weight = row[edge_type]
                # If the weight is higher than the provided threshold
                if row[edge_type] > threshold:
                    # We retrieve the numeric node ID curresponding to the provided
                    # subject and objects for this line
                    subject_node_id = nodes_dictionary.setdefault(row["subject"], len(nodes_dictionary))
                    object_node_id = nodes_dictionary.setdefault(row["object"], len(nodes_dictionary))
                    # Then we finally write out the line.
                    tsv_writer.writerow([
                        subject_node_id,
                        edge_type_id,
                        object_node_id,
                        edge_weight,
                    ])
        connection.close()
        
        # Now we write out the node list
        with open(node_list_tsv_path, 'w', newline='') as f:
            # We create the TSV writer for the node list
            tsv_writer = csv.writer(f, delimiter='\t')
            # And write the header of the node file TSV writer
            # Specifically, the 2 columns are:
            tsv_writer.writerow(["node_name", "node_id"])
            
            for node_name, node_id in tqdm(
                nodes_dictionary.items(),
                desc="Writing out ASER node list",
                total=len(nodes_dictionary),
                leave=False,
                dynamic_ncols=True
            ):
                tsv_writer.writerow([node_name, node_id])
                
        # Now we write out the edge type list
        with open(edge_type_list_tsv_path, 'w', newline='') as f:
            # We create the TSV writer for the edge type list
            tsv_writer = csv.writer(f, delimiter='\t')
            # And write the header of the egde type list file TSV writer
            # Specifically, the 2 columns are:
            tsv_writer.writerow(["edge_type_name", "edge_type_id"])
            
            for edge_type_name, edge_type_id in tqdm(
                edge_types_dictionary.items(),
                desc="Writing out ASER edge type list",
                total=len(edge_types_dictionary),
                leave=False,
                dynamic_ncols=True
            ):
                tsv_writer.writerow([edge_type_name, edge_type_id])

You can run the function above like this, and should complete the processing in under 10 minutes on a normal machine:

convert_sqlite_to_tsv(
    node_list_tsv_path = "ASER/node_list.tsv",
    edge_type_list_tsv_path = "ASER/edge_type_list.tsv",
    edge_list_tsv_path = "ASER/edge_list.tsv",
    db_path = "downloads/KG.db"
)

The first three lines of the resulting files look like these:

node_name node_id
0 fad25b5938486f5163f8bdcfdc21d31dfabe9813 0
1 a50814e8a84dcc45f1839c33c09091d0869d5a14 1
2 f2544ac73c44d35981ce15a5aa241960edbbceca 2
edge_type_name edge_type_id
0 Precedence 0
1 Succession 1
2 Synchronous 2
subject category object weight
0 0 8 1 2
1 2 8 3 2
2 4 8 5 1

Some further questions: I was under the impression that there was textual data for the nodes: where can I retrieve it? Is it in some other table in the database? Are there any other features for either edges, nodes or edge types? Like, do the nodes have categories? If so, where can I find them?

Again, thank you!

@LucaCappelletti94
Copy link
Author

I've already seen that there is a Eventualities table in the database with additional metadata, I will incorporate it in the script and provide a new updated version.

@leopoldwhite
Copy link

I've already seen that there is a Eventualities table in the database with additional metadata, I will incorporate it in the script and provide a new updated version.

Hi, @LucaCappelletti94! I'm also looking for the TSV file (with the textual data for nodes and edges). Could you please provide the processed TSV file? I would be extremely grateful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants