Skip to content

Empty read from gitdb.OStream.read() before EOF #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lordmauve opened this issue Apr 16, 2025 · 6 comments
Open

Empty read from gitdb.OStream.read() before EOF #120

lordmauve opened this issue Apr 16, 2025 · 6 comments

Comments

@lordmauve
Copy link

lordmauve commented Apr 16, 2025

I have code that relies on reading an object from a gitdb stream.

To do this I used with a standard .read() loop (like with io.RawIOBase):

stream = db.stream(bytes.fromhex(sha))
while chunk := stream.read(4096):
    yield chunk

The behaviour I expected to see (from the duck-type with RawIOBase) is to only see b'' at EOF:

If 0 bytes are returned, and size was not 0, this indicates end of file.

However stream.read(4096) can return empty chunks even before the end of the stream, so the loop exits early.

For the file where I saw this first, it is sensitive to the size parameter - it apparently occurs for 0 < size <= 4096.

Looking at the code there is a condition to repeat a read if we got insufficient bytes:

gitdb/gitdb/stream.py

Lines 316 to 317 in f36c0cc

if dcompdat and (len(dcompdat) - len(dat)) < size and self._br < self._s:
dcompdat += self.read(size - len(dcompdat))

However the leading if dcompdat and means that the condition doesn't apply if zero bytes were read. Removing this part of the condition addresses the issue (but I understand from the comment that this is in order to support compressed_bytes_read()).

@lordmauve
Copy link
Author

If I print all chunk sizes with

stream = db.stream(bytes.fromhex(sha))
sz = 0
while sz < stream.size:
    print(len(chunk))
    sz += len(chunk)

there's a spread of sizes:

$ python bad.py | sort -n | uniq -c
      1 0
      1 533
      1 4071
      1 4073
      1 4075
      2 4080
      1 4081
      1 4082
      1 4086
      2 4087
      2 4089
      1 4090
      1 4092
      1 4093
      1 4095
  45840 4096

which seems to refute the idea expressed in this comment that it will recursively read() until the requested size is filled:

gitdb/gitdb/stream.py

Lines 310 to 312 in f36c0cc

# it can happen, depending on the compression, that we get less bytes
# than ordered as it needs the final portion of the data as well.
# Recursively resolve that.

Removing if dcompdat and:

$ python bad.py | sort -n | uniq -c
      1 347
  45856 4096

@lordmauve lordmauve changed the title Empty read from gitdb.OSteam.read() before EOF Empty read from gitdb.OStream.read() before EOF Apr 16, 2025
@Byron
Copy link
Member

Byron commented Apr 16, 2025

Thanks for reporting!

I don't think, however, that the implementation can be trusted and it's better to use the git command wrappers provided in GitPython.

Getting a chunk of size 0 in the middle is certainly unexpected, but maybe if that's fixed it will be suitable for consumption nonetheless?

@lordmauve
Copy link
Author

it's better to use the git command wrappers provided in GitPython.

GitCmdObjectDB? I've found previously that git cat-file --batch is 16 times slower that gitdb, which is substantial when our monorepo is 200GB. But I do agree that the implementation can't be trusted - I tried using a gitdb instance in threads (but with independent reads) and saw corrupted data. I've also found the best interface to git is generally to wrap git commands. Direct object DB access is the one case where that isn't fast enough and for that I've been using gitdb cautiously.

For the current application I just need to re-hash previously unseen trees/blobs using a different hashing scheme to git, and it has been working OK. Maybe I should re-run the git checksums as well as a sanity check; it would probably still be faster than a git pipe, and then I could debug any issues I detect.

@Byron
Copy link
Member

Byron commented Apr 18, 2025

I see. In this case I'd recommend using pygit2 instead if it must be python, or go straight to Rust and gitoxide (or git2).

@lordmauve
Copy link
Author

Ah, maybe I should try pygit2. We also had a terrible time with libgit2 in a different application, we swore off it. But that may have been more about the bindings (node-git).

I am happy using Rust in CLI tools but our internal auth stack is not available in Rust, and the two services where we use/could use gitdb would not be cost-effective to rewrite in Rust.

@Byron
Copy link
Member

Byron commented Apr 18, 2025

I thought more in the direction of having a little CLI that performs a specific task, to shell out to from the main application. Alternatively, one could do the same but generate bindings.
Ultimately, if GitDB works (with some additional protections), then why not use it. But 200GB seemed like one would want to go native.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants