-
-
Notifications
You must be signed in to change notification settings - Fork 933
Adds repo.is_valid_object() check. #1267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds repo.is_valid_object() check. #1267
Conversation
@Byron What do I need to do, for tox to ignore my /venv/ dir? I tried to fun the flask8 tests locally, but it takes very long and spams me with issues, from /venv/ |
Without profiling the code I don't think there should be any action. To my mind, object type checks are free (even when done in python) compared to the cost of two syscalls per checked object. Lines 1186 to 1187 in 595181d
Please share your numbers though, I am always interested. Because I couldn't resist, I implemented an
Please note that in order to get the type of the object, in most cases one will have to decode it which is considerably more effort than an existence check alone, hence the numbers going down from 70mio objects/s to 100k objects/s on an M1 on the Rust repository. And even though a custom implementation could probably gain a few percent compared to decoding the whole object I wonder what the usecase is. The way I see it is that if an object is referenced in a repository, it ought to exist there or else it's an error. I would love to understand your usecase though. |
I come from a brownfield C# project that had performance issues all the time and recently switched Jobs to simulation of air traffic for air traffic controller training. In this case I was thinking, that if someone just wanted to check if a Tag is valid, it is unnecessary to get the object heads for all objects in a repo, since most of them would be commits and not other objects. The use case for the check is as follows: When we develop bugfixes on main or a bugfix branch, it is also often cherry picked to other release branches. And some others like "rejected" etc.. I'm currently automating the creating of the merge records and the incorporation. For that I have to check the validity of the merge-hash that is stated in the merge-record. But if there's ever an issue regarding "object validity check too slow" or something of the like, or "partial_to_complete_sha_hex", too slow, I'd happily tackle that. Preferably in October :-P |
Thanks for shedding some light! This indeed is an interesting application and from where I am standing I would be surprised if ever there is an object non-existing. If everything happens in the same repository then the objects exist in the moment of creation, no matter where they have been cherry picked to. Maybe there are different repositories though, like a local clone and a server copy, whereas the script runs on the server. That would be unusual though as I would expect Doing checks to learn if a commit is a descendent of some ref would be something I could imagine the tool doing, and believe there are git commands to do that quickly, too. Last but not least, of course GitPython is always interested in improvements and if you get to profiling parts of it I will be curious to learn about the results. |
As discussed in #1266.
Unfortunately
git cat-file --batch-check
doesn't allow for specific object limits likegit cat-file commit --batch-check
,I did not know this at first and added
cat_file_blob_header
parallel tocat_file_header
, to basically check a smaller subset of objects.An improvement that could be made, if the check takes too long on big repos, would be to split the cat_file_header object collection into object collections per type. That way only a subset of objects would be queried after the initial query for the cat_file_header collection.