Adds repo.is_valid_object() check. #1267

bytefluxio · 2021-06-05T10:32:25Z

As discussed in #1266.

Unfortunately git cat-file --batch-check doesn't allow for specific object limits like git cat-file commit --batch-check,
I did not know this at first and added cat_file_blob_header parallel to cat_file_header, to basically check a smaller subset of objects.

An improvement that could be made, if the check takes too long on big repos, would be to split the cat_file_header object collection into object collections per type. That way only a subset of objects would be queried after the initial query for the cat_file_header collection.

bytefluxio · 2021-06-05T10:37:27Z

@Byron What do I need to do, for tox to ignore my /venv/ dir? I tried to fun the flask8 tests locally, but it takes very long and spams me with issues, from /venv/

Byron · 2021-06-05T14:03:35Z

An improvement that could be made, if the check takes too long on big repos, would be to split the cat_file_header object collection into object collections per type. That way only a subset of objects would be queried after the initial query for the cat_file_header collection.

Without profiling the code I don't think there should be any action.

To my mind, object type checks are free (even when done in python) compared to the cost of two syscalls per checked object.

GitPython/git/cmd.py

Lines 1186 to 1187 in 595181d

    
           cmd.stdin.write(self._prepare_ref(ref)) 
        
           cmd.stdin.flush()

Please share your numbers though, I am always interested. Because I couldn't resist, I implemented an object exists' in gitoxide`, here are the numbers:

➜  gitoxide git:(main) ✗ cargo run --package object-access --release  -- tests/fixtures/repos/rust.git main
    Finished release [optimized] target(s) in 0.15s
     Running `target/release/object-access tests/fixtures/repos/rust.git main`
gitoxide: 1373760 objects (collected in 14.97025ms
parallel gitoxide : confirmed 1373760 objects exists in 19.842166ms (69234376 objects/s)
parallel gitoxide (cache = 57MB): confirmed 17396370301 bytes in 13.540528833s (101455 objects/s)
parallel gitoxide (small static cache): confirmed 17396370301 bytes in 14.009627916s (98058 objects/s)
parallel gitoxide (uncached): confirmed 17396370301 bytes in 13.587337s (101106 objects/s)
parallel libgit2:  confirmed 17396370301 bytes in 29.285405208s (46909 objects/s)
gitoxide (cache = 57MB): confirmed 17396370301 bytes in 64.446299458s (21316 objects/s)
gitoxide (small static cache): confirmed 17396370301 bytes in 64.950786333s (21151 objects/s)
gitoxide (uncached): confirmed 17396370301 bytes in 62.4331105s (22004 objects/s)
libgit2:  confirmed 17396370301 bytes in 65.2992965s (21038 objects/s)

Please note that in order to get the type of the object, in most cases one will have to decode it which is considerably more effort than an existence check alone, hence the numbers going down from 70mio objects/s to 100k objects/s on an M1 on the Rust repository. And even though a custom implementation could probably gain a few percent compared to decoding the whole object I wonder what the usecase is.

The way I see it is that if an object is referenced in a repository, it ought to exist there or else it's an error. I would love to understand your usecase though.

bytefluxio · 2021-06-06T08:37:14Z

I come from a brownfield C# project that had performance issues all the time and recently switched Jobs to simulation of air traffic for air traffic controller training.
All that has drilled me to try to code in a way that does not waste operations.

In this case I was thinking, that if someone just wanted to check if a Tag is valid, it is unnecessary to get the object heads for all objects in a repo, since most of them would be commits and not other objects.
For my use case I actually need to check if a hash is valid, so the current solution is fine, plus I'm not even sure what difference it would make in our repository.

The use case for the check is as follows:

When we develop bugfixes on main or a bugfix branch, it is also often cherry picked to other release branches.
To differentiate between what versions a fix was merged into, we create sub-issues we call "merge records" that can be in the states of:
"merged": Merged into a branch (branch is noted on that "merge record" aswell
"incorporated": Incorporated into a version - means that there's a version that, who's commit is a descendant of that merge
"verified": Initial testing verified the changes to be in that version

And some others like "rejected" etc..

I'm currently automating the creating of the merge records and the incorporation. For that I have to check the validity of the merge-hash that is stated in the merge-record.

But if there's ever an issue regarding "object validity check too slow" or something of the like, or "partial_to_complete_sha_hex", too slow, I'd happily tackle that. Preferably in October :-P

Byron · 2021-06-06T23:51:40Z

Thanks for shedding some light!

This indeed is an interesting application and from where I am standing I would be surprised if ever there is an object non-existing. If everything happens in the same repository then the objects exist in the moment of creation, no matter where they have been cherry picked to. Maybe there are different repositories though, like a local clone and a server copy, whereas the script runs on the server. That would be unusual though as I would expect git push to be used to update refs.

Doing checks to learn if a commit is a descendent of some ref would be something I could imagine the tool doing, and believe there are git commands to do that quickly, too.

Last but not least, of course GitPython is always interested in improvements and if you get to profiling parts of it I will be curious to learn about the results.

bytefluxio added 4 commits June 5, 2021 12:15

Adds repo.is_valid_object check

385a8c6

Fixes type check for is_valid_object

ac4fe6e

Removes local variable 'e' that is assigned to but never used

4832aa6

Removes unused import

fb2461d

Removes f-string syntax for p35 compatibility

464504c

Byron added this to the v3.1.18 - Bugfixes milestone Jun 5, 2021

Byron merged commit 617c09e into gitpython-developers:main Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds repo.is_valid_object() check. #1267

Adds repo.is_valid_object() check. #1267

bytefluxio commented Jun 5, 2021 •

edited

Loading

bytefluxio commented Jun 5, 2021

Byron commented Jun 5, 2021

bytefluxio commented Jun 6, 2021

Byron commented Jun 6, 2021

Adds repo.is_valid_object() check. #1267

Adds repo.is_valid_object() check. #1267

Conversation

bytefluxio commented Jun 5, 2021 • edited Loading

bytefluxio commented Jun 5, 2021

Byron commented Jun 5, 2021

bytefluxio commented Jun 6, 2021

Byron commented Jun 6, 2021

bytefluxio commented Jun 5, 2021 •

edited

Loading