Closed
Description
I'm trying to get the latest commit for each item in a list of Tree
or Blob
objects (sth. like a repo browser). Currently i'm doing something like
repo = git.Repo("test.git")
tree = repo.tree()
for obj in tree:
print obj, obj.path, repo.iter_commits(paths=obj.path, max_count=1).next()
but this is incredible slow. Another solution is to use repo.iter_commits()
to fill a dict of path <-> commit:
for commit in repo.iter_commits(paths=paths):
for f in commit.stats.files.keys():
p = f[:f.index('/')] if '/' in f else f
if p in latest_commits:
continue
latest_commits[p] = commit
However, in the worst case this one iterates over all commits in the entire repository (which is obviously a very bad idea).
Is there a faster solution?
Metadata
Metadata
Assignees
Type
Projects
Relationships
Development
No branches or pull requests
Activity
Byron commentedon Jan 19, 2015
I don't think so. My general advice is to use a
GitCmdObjectDB
in yourRepo
instance - this might already improve performance.The operation you try to perform is inherently expensive, and I believe there is no better way if caches cannot be used. The second approach seems to be best, as you inherently use the git command to apply the path filter, and then python to find the actual commit per path. However,
commit.stats
is implemented in pure python, which could be a bottleneck.Also I believe you should profile your application to actually see where the time is spend, maybe more ideas arise from that.
I'd be interested to learn about your findings - please feel free to post them here.
Please also note that I will close this issue when 0.3.6 is due for release.
ercpe commentedon Jan 20, 2015
Thanks for your reply. Here are some findings from my tests.
The repository is a bare clone of Torvald's linux sources - the biggest git repo i'm aware of.
test1.py:
test2.py:
test1.py
andtest2.py
are repeatable pretty close - ranging from 1.5 to 2.1 sec. On the first run, test2 was much faster (fs cache issue?).test3.py:
test3.py
completes in short over a minute and looks at over 6000 commits.commit.stats.files.keys()
can contain something like{virt => arch/x86}/kvm/ioapic.h
when a rename happen, so my test code may be at fault. However, even before the loop hits that commit it has already exceeded the time of test1 / test2.Here is a call graph of test3:
If i read it correctly,
git.cmd.Git.execute
is called for every commit.test4.py
is the same astest3py
but withodbt=git.GitCmdObjectDB
. I haven't managed to get it finished. The process is probably still looking for revisions...Byron commentedon Jan 20, 2015
Thanks for sharing your results ! I was quite surprised to see
It would of course be even more interesting to see how fast pygit2 (bindings to libgit2) can be - apparently they have the required
stats()
function as well.ercpe commentedon Jan 21, 2015
I've redone the test with test1 and test2 - test2 is on average only slightly faster (500ms).
I have looked at pygit2 which feels faster but has a lot worse interface. Unfortunatly,
git_diff_get_stats
isn't implemented yet (libgit2/pygit2#406) :/Byron commentedon Jan 21, 2015
Ah, I was confused for a moment, but finally understood that pygit just doesn't yet bind to it. Interesting is that they indeed implement this binding manually ! It's amazing that people still do this nowadays, as I believed binding generators are the standard way to approach this issue.
Maybe with a little bit of luck, they will get to it soon so you can give that one a shot.
graingert commentedon May 18, 2016
@ercpe looks like it's implemented now