Some Write calls leads to data loss #113

DUOLabs333 · 2023-12-26T01:05:22Z

In some cases (eg, when cloning a moderately big repository, like this one), data written to a file can not be read back in a pread call (ie, will return 0 bytes).

The text was updated successfully, but these errors were encountered:

DUOLabs333 · 2023-12-26T01:46:19Z

What I find interesting is that for a given pread call, for some files, there's no corresponding Read call on the server.

DUOLabs333 · 2023-12-26T02:32:40Z

Checking the logs points to another "expected fileid... fileid changed" error. Specifically, I get a lot of lines with the following
expected fileid 0x1, got

willscott · 2023-12-26T12:59:44Z

Are you seeing file handles expiring? / being re-established? I wonder if fileid instability is downstream of handle instability here?

DUOLabs333 · 2023-12-26T13:32:29Z

Are fileid separate from handles (ie, are they handled by the implementation, not the server)?

willscott · 2023-12-26T13:44:18Z

It is filled in by the server if not provided by the implementation:

https://github.com/willscott/go-nfs/blob/master/file.go#L112-L122

If the underlying files are unix - eg https://github.com/willscott/go-nfs/blob/master/file/file_unix.go#L14-L24
then the Ino will be used.
Otherwise the fileid will be a hash of file path.

DUOLabs333 · 2023-12-26T14:00:40Z

It's technically possible to get a different inode than expected if you move a directory (since the path will be different), but I don't think that's what's happening here.

DUOLabs333 · 2023-12-26T16:18:21Z

What file does 0x1 map to? I'm trying to match the path by printing the MarshalBinary of the uuid in ToHandle, but it's not matching anything.

willscott · 2023-12-26T16:43:36Z

0x1 does not seem like a likely output you should get as the binary output of a UUID - where are you first seeing that as a handle?

DUOLabs333 · 2023-12-26T16:44:55Z

No, I'm trying to match the 0x1 from journalctl logs to the handles outputted by the server. Are these the same?

willscott · 2023-12-26T16:54:01Z

I don't know which part of the journalctl you're seeing that in.- journalctl may be showing either NFS handles, file system inodes, or something else like an error code.

DUOLabs333 · 2023-12-26T16:59:26Z

If you grep for NFS, you can get something like fsid 0:53: expected fileid 0x1, got 0x38c30b

DUOLabs333 · 2023-12-27T02:42:25Z

Ok, got it --- the fileids are referring to the implementation's fileid/inodes. In my case, 0x1 refers to /dev/pts for some reason.

DUOLabs333 · 2023-12-27T02:46:36Z

Interestingly, when it says expected fileid 0xfoo, got 0xbar, while bar is handed out by the implementation, foo is not (I can't find it in my now 75MB log).

willscott · 2023-12-27T08:46:59Z

foo ( the 0x1) is likely the internal identifier the client is using to identify a file, bar is perhaps the inode / fileid sent over the nfs protocol as the identifier provided by the server.

DUOLabs333 · 2023-12-27T14:24:13Z

So shouldn't it almost always be mismatched?

DUOLabs333 · 2023-12-27T16:38:16Z

Actually, I think the "expected fileid" problem is unrelated --- I think I solved the problem on my side (the error is no longer showing up in my logs), but the original error still remains.

willscott · 2023-12-27T17:49:22Z

To dive deeper into the original symptoms:

Do you see any failing writes?
Do you see read calls returning 0 bytes/erroring, or the reads not occurring at all?

DUOLabs333 · 2023-12-27T18:11:10Z

The writes don't seem to fail
That's the thing --- I don't seen any read to the relevant file(s).

DUOLabs333 · 2023-12-27T18:13:22Z

I was going to check if noac would fix it, but it just hangs the client.

DUOLabs333 · 2023-12-27T18:35:51Z

I sometimes get fatal: unable to get current working directory: No such file or directory before the clone could actually begin (I'm testing the server with a git clone), so it seems like a race condition, which shouldn't be possible.

Looking online points to some issue with getcwd, but the issue was fixed on the client side almost a decade ago.

DUOLabs333 · 2023-12-27T20:44:35Z

Ok, here's something interesting --- if I disable ReadDirPlus with nordirplus, try to clone the first time, fail, then try to clone the second time, the clone succeeds (well, almost --- the server hangs on nfs.Link until the kernel kills it due to OOM, but I guess that's better than before?).

DUOLabs333 · 2023-12-27T22:44:44Z

Similarly, for the fatal: unable to get current working directory: No such file or directory error, git checks for the .git directory, but never makes it.

DUOLabs333 · 2023-12-27T23:17:04Z

Apparently, the getcwd issue is common enough on FUSE filesystems to have a work around: cd .. && cd -. I'll look at what the other projects did to fix this.

This can be replicated by running a failing git clone, then:

#include <unistd.h>
#include <stdio.h>
#include <limits.h>

int main() {
  char cwd[PATH_MAX];
  if (getcwd(cwd, sizeof(cwd)) != NULL) {
    printf("%s\n", cwd);
  } else {
    perror("getcwd() error");
    return 1;
  }
  return 0;
}

DUOLabs333 · 2023-12-28T00:01:06Z

Interestingly, getcwd doesn't seem to make any calls to the server --- maybe it's cached?

DUOLabs333 · 2023-12-28T01:19:31Z

I see the problem --- ls -l /proc/$$/cwd is a symlink to /foo (deleted)

Of course, this doesn't explain why going out of the current directory, then going back fixes the problem --- for some reason, it "regenerates" the folder.

DUOLabs333 · 2023-12-28T02:34:06Z

I'm looking at bazil/fuse#250, and it looks like the main problem is that each Lookup returns a new handle, so previous ones get invalidated.

I made a quick POC that seems to fix the issue by just keeping a map between the path and a handle and using it if it already exists.

DUOLabs333 · 2023-12-28T02:37:12Z

Yeah, I just confirmed by making handles one-to-one (and not only unique), the problems (both the getcwd and failing reads) are fixed. Should I open a PR?

DUOLabs333 · 2023-12-28T02:42:53Z

The main problem is that since the keys are strings, you can only cache on path, not filesystem --- you could get around this by mapping to a list of lists, where the first element is the filesystem and the second is the list of bytes.

willscott · 2023-12-28T09:34:20Z

A PR if you want to open one would be helpful!

DUOLabs333 · 2023-12-28T22:58:39Z

Done in #116.

DUOLabs333 closed this as completed Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Write calls leads to data loss #113

Some Write calls leads to data loss #113

DUOLabs333 commented Dec 26, 2023 •

edited

Loading

DUOLabs333 commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023 •

edited

Loading

willscott commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023 •

edited

Loading

willscott commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023 •

edited

Loading

willscott commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023

willscott commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

willscott commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

willscott commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023 •

edited

Loading

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023 •

edited

Loading

DUOLabs333 commented Dec 28, 2023 •

edited

Loading

DUOLabs333 commented Dec 28, 2023 •

edited

Loading

DUOLabs333 commented Dec 28, 2023

DUOLabs333 commented Dec 28, 2023 •

edited

Loading

DUOLabs333 commented Dec 28, 2023

willscott commented Dec 28, 2023

DUOLabs333 commented Dec 28, 2023

Some Write calls leads to data loss #113

Some Write calls leads to data loss #113

Comments

DUOLabs333 commented Dec 26, 2023 • edited Loading

DUOLabs333 commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023 • edited Loading

willscott commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023 • edited Loading

willscott commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023 • edited Loading

willscott commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023

willscott commented Dec 26, 2023

DUOLabs333 commented Dec 26, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

willscott commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

willscott commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023 • edited Loading

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023

DUOLabs333 commented Dec 27, 2023 • edited Loading

DUOLabs333 commented Dec 28, 2023 • edited Loading

DUOLabs333 commented Dec 28, 2023 • edited Loading

DUOLabs333 commented Dec 28, 2023

DUOLabs333 commented Dec 28, 2023 • edited Loading

DUOLabs333 commented Dec 28, 2023

willscott commented Dec 28, 2023

DUOLabs333 commented Dec 28, 2023

DUOLabs333 commented Dec 26, 2023 •

edited

Loading

DUOLabs333 commented Dec 26, 2023 •

edited

Loading

DUOLabs333 commented Dec 26, 2023 •

edited

Loading

DUOLabs333 commented Dec 26, 2023 •

edited

Loading

DUOLabs333 commented Dec 27, 2023 •

edited

Loading

DUOLabs333 commented Dec 27, 2023 •

edited

Loading

DUOLabs333 commented Dec 28, 2023 •

edited

Loading

DUOLabs333 commented Dec 28, 2023 •

edited

Loading

DUOLabs333 commented Dec 28, 2023 •

edited

Loading