Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking JOANNE for zarr #39

Open
Geet-George opened this issue Dec 8, 2020 · 1 comment
Open

Chunking JOANNE for zarr #39

Geet-George opened this issue Dec 8, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@Geet-George
Copy link
Owner

@d70-t : "Maybe it's not a good timing, but I thought I'll post it here for reference. I was looking a bit throughout several documents on how the IPFS storage layer works with respect to chunking (IPFS splits up files into chunks which are transported independently). The thing is that zarr is also creating chunks, and chunks made by zarr are most probably better than chunks made by the IPFS storage layer as zarr is able to consider the multi-dimensional nature of the data. But if both systems are creating chunks, chances are high that the chunks sizes don't match up which would result in more chunks and more unnecessarily transferred data. In particular, if a file contains only one chunk, there is only one item to transfer, but as soon as a second chunk is added, another item (the chunk index) is added to the system, which effectively results in 3 times the number of network transfers.

So for storing zarr on IPFS, what I think works best is if each zarr chunk corresponds to one IPFS chunk. Currently (and since quite some time), IPFS uses a default chunk size of 256kiB or exactly 262144 bytes. This can be changed manually, but there is a hard limit of 1MiB per transferred item. There currently is also a 14 byte header per data item (which is proposed to be removed in the future). Technically the 14 bytes are additional to the 256kiB, but there are some ideas why it could be more performant to actually limit the total of chunk size + header to a power of 2.

The bottom line is, if you plan to put JOANNE via zarr on IPFS, I'd expect optimal performance if each of the zarr data chunk files is less or equal to 262130 bytes in size. If that doesn't work out for some reason, then the chunk size should be way larger (i.e. at least 4 to 8 times that size), as for example 262145 bytes would already result in 3 times as much individual transfers with default IPFS settings."

@Geet-George Geet-George added the enhancement New feature or request label Mar 8, 2021
@d70-t
Copy link

d70-t commented Mar 8, 2021

This is still a bit rough, but I've collected a few scripts which should assist in this issue at d70-t/ipfszarr. I've converted JOANNE v0.9.2 for testing purposes using nc2zarr.py -O2 and it could be worse :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants