Skip to content

Adds repo.is_valid_object() check. #1267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 5, 2021
Merged

Adds repo.is_valid_object() check. #1267

merged 5 commits into from
Jun 5, 2021

Conversation

bytefluxio
Copy link
Contributor

@bytefluxio bytefluxio commented Jun 5, 2021

As discussed in #1266.

Unfortunately git cat-file --batch-check doesn't allow for specific object limits like git cat-file commit --batch-check,
I did not know this at first and added cat_file_blob_header parallel to cat_file_header, to basically check a smaller subset of objects.

An improvement that could be made, if the check takes too long on big repos, would be to split the cat_file_header object collection into object collections per type. That way only a subset of objects would be queried after the initial query for the cat_file_header collection.

@bytefluxio
Copy link
Contributor Author

@Byron What do I need to do, for tox to ignore my /venv/ dir? I tried to fun the flask8 tests locally, but it takes very long and spams me with issues, from /venv/

@Byron
Copy link
Member

Byron commented Jun 5, 2021

An improvement that could be made, if the check takes too long on big repos, would be to split the cat_file_header object collection into object collections per type. That way only a subset of objects would be queried after the initial query for the cat_file_header collection.

Without profiling the code I don't think there should be any action.

To my mind, object type checks are free (even when done in python) compared to the cost of two syscalls per checked object.

GitPython/git/cmd.py

Lines 1186 to 1187 in 595181d

cmd.stdin.write(self._prepare_ref(ref))
cmd.stdin.flush()

Please share your numbers though, I am always interested. Because I couldn't resist, I implemented an object exists' in gitoxide`, here are the numbers:

➜  gitoxide git:(main) ✗ cargo run --package object-access --release  -- tests/fixtures/repos/rust.git main
    Finished release [optimized] target(s) in 0.15s
     Running `target/release/object-access tests/fixtures/repos/rust.git main`
gitoxide: 1373760 objects (collected in 14.97025ms
parallel gitoxide : confirmed 1373760 objects exists in 19.842166ms (69234376 objects/s)
parallel gitoxide (cache = 57MB): confirmed 17396370301 bytes in 13.540528833s (101455 objects/s)
parallel gitoxide (small static cache): confirmed 17396370301 bytes in 14.009627916s (98058 objects/s)
parallel gitoxide (uncached): confirmed 17396370301 bytes in 13.587337s (101106 objects/s)
parallel libgit2:  confirmed 17396370301 bytes in 29.285405208s (46909 objects/s)
gitoxide (cache = 57MB): confirmed 17396370301 bytes in 64.446299458s (21316 objects/s)
gitoxide (small static cache): confirmed 17396370301 bytes in 64.950786333s (21151 objects/s)
gitoxide (uncached): confirmed 17396370301 bytes in 62.4331105s (22004 objects/s)
libgit2:  confirmed 17396370301 bytes in 65.2992965s (21038 objects/s)

Please note that in order to get the type of the object, in most cases one will have to decode it which is considerably more effort than an existence check alone, hence the numbers going down from 70mio objects/s to 100k objects/s on an M1 on the Rust repository. And even though a custom implementation could probably gain a few percent compared to decoding the whole object I wonder what the usecase is.

The way I see it is that if an object is referenced in a repository, it ought to exist there or else it's an error. I would love to understand your usecase though.

@Byron Byron added this to the v3.1.18 - Bugfixes milestone Jun 5, 2021
@Byron Byron merged commit 617c09e into gitpython-developers:main Jun 5, 2021
@bytefluxio
Copy link
Contributor Author

I come from a brownfield C# project that had performance issues all the time and recently switched Jobs to simulation of air traffic for air traffic controller training.
All that has drilled me to try to code in a way that does not waste operations.

In this case I was thinking, that if someone just wanted to check if a Tag is valid, it is unnecessary to get the object heads for all objects in a repo, since most of them would be commits and not other objects.
For my use case I actually need to check if a hash is valid, so the current solution is fine, plus I'm not even sure what difference it would make in our repository.

The use case for the check is as follows:

When we develop bugfixes on main or a bugfix branch, it is also often cherry picked to other release branches.
To differentiate between what versions a fix was merged into, we create sub-issues we call "merge records" that can be in the states of:
"merged": Merged into a branch (branch is noted on that "merge record" aswell
"incorporated": Incorporated into a version - means that there's a version that, who's commit is a descendant of that merge
"verified": Initial testing verified the changes to be in that version

And some others like "rejected" etc..

I'm currently automating the creating of the merge records and the incorporation. For that I have to check the validity of the merge-hash that is stated in the merge-record.

But if there's ever an issue regarding "object validity check too slow" or something of the like, or "partial_to_complete_sha_hex", too slow, I'd happily tackle that. Preferably in October :-P

@Byron
Copy link
Member

Byron commented Jun 6, 2021

Thanks for shedding some light!

This indeed is an interesting application and from where I am standing I would be surprised if ever there is an object non-existing. If everything happens in the same repository then the objects exist in the moment of creation, no matter where they have been cherry picked to. Maybe there are different repositories though, like a local clone and a server copy, whereas the script runs on the server. That would be unusual though as I would expect git push to be used to update refs.

Doing checks to learn if a commit is a descendent of some ref would be something I could imagine the tool doing, and believe there are git commands to do that quickly, too.

Last but not least, of course GitPython is always interested in improvements and if you get to profiling parts of it I will be curious to learn about the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants