Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Write calls leads to data loss #113

Closed
DUOLabs333 opened this issue Dec 26, 2023 · 30 comments
Closed

Some Write calls leads to data loss #113

DUOLabs333 opened this issue Dec 26, 2023 · 30 comments

Comments

@DUOLabs333
Copy link
Contributor

DUOLabs333 commented Dec 26, 2023

In some cases (eg, when cloning a moderately big repository, like this one), data written to a file can not be read back in a pread call (ie, will return 0 bytes).

@DUOLabs333
Copy link
Contributor Author

What I find interesting is that for a given pread call, for some files, there's no corresponding Read call on the server.

@DUOLabs333
Copy link
Contributor Author

DUOLabs333 commented Dec 26, 2023

Checking the logs points to another "expected fileid... fileid changed" error. Specifically, I get a lot of lines with the following
expected fileid 0x1, got

@willscott
Copy link
Owner

Are you seeing file handles expiring? / being re-established? I wonder if fileid instability is downstream of handle instability here?

@DUOLabs333
Copy link
Contributor Author

DUOLabs333 commented Dec 26, 2023

Are fileid separate from handles (ie, are they handled by the implementation, not the server)?

@willscott
Copy link
Owner

It is filled in by the server if not provided by the implementation:

https://github.com/willscott/go-nfs/blob/master/file.go#L112-L122

If the underlying files are unix - eg https://github.com/willscott/go-nfs/blob/master/file/file_unix.go#L14-L24
then the Ino will be used.
Otherwise the fileid will be a hash of file path.

@DUOLabs333
Copy link
Contributor Author

It's technically possible to get a different inode than expected if you move a directory (since the path will be different), but I don't think that's what's happening here.

@DUOLabs333
Copy link
Contributor Author

DUOLabs333 commented Dec 26, 2023

What file does 0x1 map to? I'm trying to match the path by printing the MarshalBinary of the uuid in ToHandle, but it's not matching anything.

@willscott
Copy link
Owner

0x1 does not seem like a likely output you should get as the binary output of a UUID - where are you first seeing that as a handle?

@DUOLabs333
Copy link
Contributor Author

No, I'm trying to match the 0x1 from journalctl logs to the handles outputted by the server. Are these the same?

@willscott
Copy link
Owner

I don't know which part of the journalctl you're seeing that in.- journalctl may be showing either NFS handles, file system inodes, or something else like an error code.

@DUOLabs333
Copy link
Contributor Author

If you grep for NFS, you can get something like fsid 0:53: expected fileid 0x1, got 0x38c30b

@DUOLabs333
Copy link
Contributor Author

Ok, got it --- the fileids are referring to the implementation's fileid/inodes. In my case, 0x1 refers to /dev/pts for some reason.

@DUOLabs333
Copy link
Contributor Author

Interestingly, when it says expected fileid 0xfoo, got 0xbar, while bar is handed out by the implementation, foo is not (I can't find it in my now 75MB log).

@willscott
Copy link
Owner

foo ( the 0x1) is likely the internal identifier the client is using to identify a file, bar is perhaps the inode / fileid sent over the nfs protocol as the identifier provided by the server.

@DUOLabs333
Copy link
Contributor Author

So shouldn't it almost always be mismatched?

@DUOLabs333
Copy link
Contributor Author

Actually, I think the "expected fileid" problem is unrelated --- I think I solved the problem on my side (the error is no longer showing up in my logs), but the original error still remains.

@willscott
Copy link
Owner

To dive deeper into the original symptoms:

  • Do you see any failing writes?
  • Do you see read calls returning 0 bytes/erroring, or the reads not occurring at all?

@DUOLabs333
Copy link
Contributor Author

  1. The writes don't seem to fail
  2. That's the thing --- I don't seen any read to the relevant file(s).

@DUOLabs333
Copy link
Contributor Author

I was going to check if noac would fix it, but it just hangs the client.

@DUOLabs333
Copy link
Contributor Author

DUOLabs333 commented Dec 27, 2023

I sometimes get fatal: unable to get current working directory: No such file or directory before the clone could actually begin (I'm testing the server with a git clone), so it seems like a race condition, which shouldn't be possible.

Looking online points to some issue with getcwd, but the issue was fixed on the client side almost a decade ago.

@DUOLabs333
Copy link
Contributor Author

Ok, here's something interesting --- if I disable ReadDirPlus with nordirplus, try to clone the first time, fail, then try to clone the second time, the clone succeeds (well, almost --- the server hangs on nfs.Link until the kernel kills it due to OOM, but I guess that's better than before?).

@DUOLabs333
Copy link
Contributor Author

Similarly, for the fatal: unable to get current working directory: No such file or directory error, git checks for the .git directory, but never makes it.

@DUOLabs333
Copy link
Contributor Author

DUOLabs333 commented Dec 27, 2023

Apparently, the getcwd issue is common enough on FUSE filesystems to have a work around: cd .. && cd -. I'll look at what the other projects did to fix this.

This can be replicated by running a failing git clone, then:

#include <unistd.h>
#include <stdio.h>
#include <limits.h>

int main() {
  char cwd[PATH_MAX];
  if (getcwd(cwd, sizeof(cwd)) != NULL) {
    printf("%s\n", cwd);
  } else {
    perror("getcwd() error");
    return 1;
  }
  return 0;
}

@DUOLabs333
Copy link
Contributor Author

DUOLabs333 commented Dec 28, 2023

Interestingly, getcwd doesn't seem to make any calls to the server --- maybe it's cached?

@DUOLabs333
Copy link
Contributor Author

DUOLabs333 commented Dec 28, 2023

I see the problem --- ls -l /proc/$$/cwd is a symlink to /foo (deleted)

Of course, this doesn't explain why going out of the current directory, then going back fixes the problem --- for some reason, it "regenerates" the folder.

@DUOLabs333
Copy link
Contributor Author

I'm looking at bazil/fuse#250, and it looks like the main problem is that each Lookup returns a new handle, so previous ones get invalidated.

I made a quick POC that seems to fix the issue by just keeping a map between the path and a handle and using it if it already exists.

@DUOLabs333
Copy link
Contributor Author

DUOLabs333 commented Dec 28, 2023

Yeah, I just confirmed by making handles one-to-one (and not only unique), the problems (both the getcwd and failing reads) are fixed. Should I open a PR?

@DUOLabs333
Copy link
Contributor Author

The main problem is that since the keys are strings, you can only cache on path, not filesystem --- you could get around this by mapping to a list of lists, where the first element is the filesystem and the second is the list of bytes.

@willscott
Copy link
Owner

  • A PR if you want to open one would be helpful!

@DUOLabs333
Copy link
Contributor Author

Done in #116.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants