Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a file tree serialiser/de-serialiser #768

Closed
tegefaulkes opened this issue Jul 11, 2024 · 7 comments · Fixed by #774
Closed

Implement a file tree serialiser/de-serialiser #768

tegefaulkes opened this issue Jul 11, 2024 · 7 comments · Fixed by #774
Assignees
Labels
development Standard development

Comments

@tegefaulkes
Copy link
Contributor

Specification

With the more complex unix commands being implemented in MatrixAI/Polykey-CLI#32 we need a utility to take a file tree and serealise it over a binary stream. This needs to follow the generator/parser pattern we've used in the past.

A file tree will be structured something like this

// Order will be in [...DirectoryNode, ...FileNode, ...ContentNode].
// While not strict, should simplify serialisation.
type FileTree = Array<TreeNode>;
type TreeNode = DirectoryNode | FileNode | ContentNode;
type FilePath = string;
type Inode = number;
type Cnode = number;

type DirectoryNode = {
  type: 'directory',
  path: FilePath,
  inode: Inode,
  parent: Inode,
  children: Array<Inode>,
  //relevant stats...
}

type FileNode = {
  type: 'file',
  path: FilePath,
  iNode: Inode,
  parent: Inode,
  cNode: Cnode,
  //relevant stats...
}

// Keeping this separate from `FileNode` so we can optionally not include it.
type ContentNode = {
  type: 'content'
  cNode: Cnode,
  contents: string,
}

The serialiser needs to take this structure and generate a binary webstream ReadableStream. sending all of this information over. We need to determine a structure for the raw binary stream. It needs some way of efficiently encoding the data. Most of the data can be basic JSON. But the file contents need to be raw binary to reduce the amount of binary sent.

The de-serialiser the parsing method we've used before. It must throw errors if we run into badly formatted data.

I'll need to do some research and prototyping into how we want to format the binary stream. It should be possible to start with a JSON message stream and downgrade to a much simplified raw binary stream for sending all of the file contents.

Additional context

Related MatrixAI/Polykey-CLI#32

Tasks

  1. Define file tree structure and types.
  2. Prototype a serilaised binary structure.
  3. Build a file tree serialiser that converts the file tree into a binary webstream.
  4. Build a deserialiser that converts the raw binary webstream back into a file tree structure.
@tegefaulkes tegefaulkes added the development Standard development label Jul 11, 2024
Copy link

linear bot commented Jul 11, 2024

@tegefaulkes
Copy link
Contributor Author

Thinking about this there's 3 options on how to approach this.

  1. Fully JSON encoded message stream. The glob walker utility in Implement utility that generates a file tree from a path pattern #767 already outputs in this format. The only downside is that the files will be encoded is binary strings which add a little bit of overhead to the data. But it is the most convenient form to use.
  2. We encode most all of the metadata as JSON messages but the file contents drop down to a raw stream encoding for sending the raw file data. best of both worlds. File tree data is in readable JSON format while the file contents are sent efficiently as raw binary. Between the metadata and the contents stage the stream will be downgraded to a raw stream. We've done this in the past.
  3. We fully encode the data as a raw stream. The most annoying of the three since we'll require generators and parses for each type of file tree node message.

I'm thinking for prototyping I'll start with option 1 just to get things working. then if needed we can convert to option 2.

@CMCDragonkai
Copy link
Member

  1. Is obviously better. You may want to use a tar archive format and allow easy compression.

@CMCDragonkai
Copy link
Member

Hey this was desynced from linear. What's the status and in relation to my question I actually preferred 2.

@tegefaulkes
Copy link
Contributor Author

Oh that's weird, I still need to find out what causes that.

This is closed now. I ended up implementing it as option 2. But we don't so much as downgrade to a raw stream. It's just a raw stream that alternates between JSON messages for the tree structure and JSON + raw binary for the file contents. It also handles sending multiple trees consecutively like this.

I did not implement any compression to the binary data. It should be fairly easy to apply to it though if we feel the need.

@CMCDragonkai
Copy link
Member

Did you benchmark it?

@tegefaulkes
Copy link
Contributor Author

No, I didn't see a need. It's runs fast enough in jest tests. Nothing in the algorithm wouldn't scale very well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development
Development

Successfully merging a pull request may close this issue.

2 participants