Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5-backed data structure #145

Open
oesteban opened this issue Mar 28, 2024 · 7 comments
Open

HDF5-backed data structure #145

oesteban opened this issue Mar 28, 2024 · 7 comments
Assignees
Milestone

Comments

@oesteban
Copy link
Member

oesteban commented Mar 28, 2024

Eddymotion uses large datasets and continuously needs to update them in memory, keep a large list of transforms, etc.

For that reason, we created the DWI object (which we are now generalizing so the same approach can be applied to PET or to BOLD). The idea is to have HDF5 to help minimize the memory fingerprint without damaging I/O too much.

I guess we want to get some sort of "glorified" ArrayProxy (using nibabel's jargon) to the details around a DWI dataset (data themselves, NIfTI headers readily available to generate NIfTIs for third-party tools, gradients' table, brain mask, head-motion realignment matrices, fieldmap, etc.)

Giving a common interface to all those aspects of the DWI dataset would definitely make development faster.

I would like to pick @effigies' and @arokem's brains and see how this would be implemented.

At the moment, what we have does not seem to achieve the flexibility and seamless operation that we want.

EDIT:

In #98, we are removing data splitting from the data structure object (which I think makes a lot of sense) and this problem has become apparent:

def logo_split(self, index, with_b0=False):
"""
Produce one fold of LOGO (leave-one-gradient-out).
Parameters
----------
index : :obj:`int`
Index of the DWI orientation to be left out in this fold.
with_b0 : :obj:`bool`
Insert the *b=0* reference at the beginning of the training dataset.
Returns
-------
(train_data, train_gradients) : :obj:`tuple`
Training DWI and corresponding gradients.
Training data/gradients come **from the updated dataset**.
(test_data, test_gradients) :obj:`tuple`
Test 3D map (one DWI orientation) and corresponding b-vector/value.
The test data/gradient come **from the original dataset**.
"""
if not Path(self._filepath).exists():
self.to_filename(self._filepath)
# read original DWI data & b-vector
with h5py.File(self._filepath, "r") as in_file:
root = in_file["/0"]
dwframe = np.asanyarray(root["dataobj"][..., index])
bframe = np.asanyarray(root["gradients"][..., index])
# if the size of the mask does not match data, cache is stale
mask = np.zeros(len(self), dtype=bool)
mask[index] = True
train_data = self.dataobj[..., ~mask]
train_gradients = self.gradients[..., ~mask]
if with_b0:
train_data = np.concatenate(
(np.asanyarray(self.bzero)[..., np.newaxis], train_data),
axis=-1,
)
b0vec = np.zeros((4, 1))
b0vec[0, 0] = 1
train_gradients = np.concatenate(
(b0vec, train_gradients),
axis=-1,
)
return (
(train_data, train_gradients),
(dwframe, bframe),
)

In particular, my hesitations emerge from having to do this bit:

# read original DWI data & b-vector
with h5py.File(self._filepath, "r") as in_file:
root = in_file["/0"]
dwframe = np.asanyarray(root["dataobj"][..., index])
bframe = np.asanyarray(root["gradients"][..., index])

every time from outside methods if we want to take advantage of the HDF5 backing.

EDIT 2:

In the long term, and if this vision works well, I can see this kind of "internal format" be propagated across nipreps and even fitlins.

WDYT?

@oesteban
Copy link
Member Author

@teresamg, you've wrestled quite a bit with the DWI object -- what is your experience?

@oesteban
Copy link
Member Author

@jhlegarreta & @esavary, do you have any experience with or opinion on this problem?

@esavary
Copy link
Member

esavary commented Mar 28, 2024

@jhlegarreta & @esavary, do you have any experience with or opinion on this problem?

I don't have much experience and no strong opinion, but is there a reason to not encapsulate the HDF5 backing inside the DWI object?

@oesteban
Copy link
Member Author

is there a reason to not encapsulate the HDF5 backing inside the DWI object?

Not really. This is why I mentioned in #98 that I was not sure anymore of how this should be implemented.

We can merge #98 and roll that bit back afterward when we have a plan for this issue. I'd like to know Chris' and Ariel's opinions on #98 before we merge it though.

@jhlegarreta
Copy link
Collaborator

@jhlegarreta & @esavary, do you have any experience with or opinion on this problem?

I see the memory fingerprint concern that you mention, and I do see the HDF5 data manipulation bit that you point to as being a burden for users.

I have used HDF5 files in the past inside some wrapper class to actually provide the data.

In the long term, and if this vision works well, I can see this kind of "internal format" be propagated across nipreps and even fitlins.

My fear there is to propose yet another format that has is difficult to maintain, difficult to use across all nipreps, has a difficult fit in e.g. BIDS (maybe I'm mixing things with this last), etc. But maybe what you say about Nibabel doing something similar is already a fact (do not know enough the inner workings of Nibabel's ArrayProxy).

Also I do not know enough about memaps as another possibility.

As for the PRs I do not know the code well enough so as to weigh whether they should be merged or held now: whatever is easiest to make progress, so probably merging and leaving the HDF5 issue to be fixed for whenever it really becomes an issue. If that time has come 🙃, then I'd need to have a better grasp of the whole picture to weigh in. Apologies for that.

@oesteban oesteban added this to the 0.1.6 milestone Apr 3, 2024
@oesteban
Copy link
Member Author

@effigies - we would like to pick your brain on this one.

The design idea behind this is described in the "nipreps book": https://www.nipreps.org/nipreps-book/tutorial/intro.html#step-1-identify-an-i-o-inputs-outputs-specification

We basically wanted to have a data structure that could hold all the relevant aspects of a DWI dataset in the context of the problem. That meant it should have easy ways of storing and accessing:

  • The actual DWIs (a 4D numpy array for b > 0)
  • The b=0 reference (a 3D numpy array, the idea is that the user feeds it so, if the dataset has several b=0, they can average them and/or denoise them, etc.)
  • NIfTI headers of the original file (so that NIfTI can be produced at any moment) -- the affine being the most important aspect of that
  • A fieldmap estimation (a la SDCFlows/nitransforms)
  • A list of affines where we store the realignment affines as they are calculated
  • The gradient table (b-vecs + b-vals) according to "the actual DWIs" above

The rationale to use HDF5 was to allow Python to clean up hefty objects in memory while having an easy interface to the object. In my imagination, this should work like an overlay built-in in the data structure so that the user does not need to bother about HDF5 at all (it is just used in the background).

Maybe there were better solutions like the ArrayProxy @jhlegarreta mentions above, or maybe we should be thinking about Zarr instead. I'm happy to go in any direction that makes eddymotion work better.

Let us know what you think.

@esavary esavary self-assigned this Apr 30, 2024
@arokem
Copy link
Collaborator

arokem commented May 3, 2024

Just adding that I've had some recent interesting positive experiences with Apache Arrow (see this for some examples/details). I am not sure that it's a great match for this use-case, but thought I'd mention this here as well, to add to the options we could consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants