Skip to content

subtree: Fix handling of complex history #493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: maint
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 61 additions & 11 deletions contrib/subtree/git-subtree.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ b,branch= create a new branch from the split subtree
ignore-joins ignore prior --rejoin commits
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> @@ -48,6 +49,7 @@ annotate=
>  squash=
>  message=
>  prefix=
> +clearcache=

It might be more consistent to call it `clear_cache` (i.e. with an
underscore), just like `ignore_joins`.

>
>  debug () {
>  	if test -n "$debug"
> @@ -131,6 +133,9 @@ do
>  	--no-rejoin)
>  		rejoin=
>  		;;
> +	--clear-cache)
> +		clearcache=1
> +		;;
>  	--ignore-joins)
>  		ignore_joins=1
>  		;;
> @@ -206,9 +211,13 @@ debug "opts: {$*}"
>  debug
>
>  cache_setup () {
> -	cachedir="$GIT_DIR/subtree-cache/$$"
> -	rm -rf "$cachedir" ||
> -		die "Can't delete old cachedir: $cachedir"
> +	cachedir="$GIT_DIR/subtree-cache/$prefix"

Excellent, the `prefix` should be "unique enough".

> +	if test -n "$clearcache"
> +	then
> +		debug "Clearing cache"
> +		rm -rf "$cachedir" ||
> +			die "Can't delete old cachedir: $cachedir"
> +	fi
>  	mkdir -p "$cachedir" ||
>  		die "Can't create new cachedir: $cachedir"
>  	mkdir -p "$cachedir/notree" ||
> @@ -266,6 +275,16 @@ cache_set () {
>  	echo "$newrev" >"$cachedir/$oldrev"
>  }
>
> +cache_set_if_unset () {
> +	oldrev="$1"
> +	newrev="$2"

`local`? ;-)

> +	if test -e "$cachedir/$oldrev"
> +	then
> +		return
> +	fi
> +	echo "$newrev" >"$cachedir/$oldrev"

So that directory contains commit mappings, a file for each mapped
revision.

Thinking back to patch 2/11, I am now no longer that sure that it makes
sense to fill it up with every commit in that commit range: performance
suffers when directories contain too many files.

For example, I had a case in the past where it took a minute just to
enumerate a directory, and even looking whether a file existed in that
directory was not exactly fun.

In any case, I would write it slightly shorter:

	test -e "$cachedir/$oldrev" ||
	echo "$newrev" >"$cachedir/$oldrev"

> +}
> +
>  rev_exists () {
>  	if git rev-parse "$1" >/dev/null 2>&1
>  	then
> @@ -375,13 +394,13 @@ find_existing_splits () {
>  			then
>  				# squash commits refer to a subtree
>  				debug "  Squash: $sq from $sub"
> -				cache_set "$sq" "$sub"
> +				cache_set_if_unset "$sq" "$sub"
>  			fi
>  			if test -n "$main" -a -n "$sub"
>  			then
>  				debug "  Prior: $main -> $sub"
> -				cache_set $main $sub
> -				cache_set $sub $sub
> +				cache_set_if_unset $main $sub
> +				cache_set_if_unset $sub $sub
>  				try_remove_previous "$main"
>  				try_remove_previous "$sub"
>  			fi
> @@ -688,6 +707,8 @@ process_split_commit () {
>  		if test -n "$newparents"
>  		then
>  			cache_set "$rev" "$rev"
> +		else
> +			cache_set "$rev" ""

Was this hunk intended to be snuck in here? I can understand the
s/cache_set/cache_set_if_unset/ changes, of course, but not this hunk.

>  		fi
>  		return
>  	fi
> @@ -785,7 +806,7 @@ cmd_split () {
>  			# the 'onto' history is already just the subdir, so
>  			# any parent we find there can be used verbatim
>  			debug "  cache: $rev"
> -			cache_set "$rev" "$rev"
> +			cache_set_if_unset "$rev" "$rev"
>  		done
>  	fi
>
> @@ -798,7 +819,7 @@ cmd_split () {
>  		git rev-list --topo-order --skip=1 $mainline |
>  		while read rev
>  		do
> -			cache_set "$rev" ""
> +			cache_set_if_unset "$rev" ""

Okay. A quite interesting question now would be: are there any callers of
`cache_set` left? If so, why?

Thanks,
Dscho

>  		done || exit $?
>  	fi
>
> --
> gitgitgadget
>
>

onto= try connecting new tree to an existing one
rejoin merge the new branch back into HEAD
clear-cache reset the subtree mapping cache
options for 'add', 'merge', and 'pull'
squash merge subtree changes as a single commit
"
Expand All @@ -48,6 +49,7 @@ annotate=
squash=
message=
prefix=
clearcache=

debug () {
if test -n "$debug"
Expand Down Expand Up @@ -131,6 +133,9 @@ do
--no-rejoin)
rejoin=
;;
--clear-cache)
clearcache=1
;;
--ignore-joins)
ignore_joins=1
;;
Expand Down Expand Up @@ -206,9 +211,13 @@ debug "opts: {$*}"
debug

cache_setup () {
cachedir="$GIT_DIR/subtree-cache/$$"
rm -rf "$cachedir" ||
die "Can't delete old cachedir: $cachedir"
cachedir="$GIT_DIR/subtree-cache/$prefix"
if test -n "$clearcache"
then
debug "Clearing cache"
rm -rf "$cachedir" ||
die "Can't delete old cachedir: $cachedir"
fi
mkdir -p "$cachedir" ||
die "Can't create new cachedir: $cachedir"
mkdir -p "$cachedir/notree" ||
Expand Down Expand Up @@ -238,13 +247,13 @@ cache_miss () {
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ed Maste wrote (reply to this):

On Tue, 6 Oct 2020 at 18:05, Tom Clarkson via GitGitGadget
<[email protected]> wrote:
>
> From: Tom Clarkson <[email protected]>
>
> Signed-off-by: Tom Clarkson <[email protected]>
Reviewed-by: Ed Maste <[email protected]>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> From: Tom Clarkson <[email protected]>
>
> Include recursion depth in debug logs so we can see when the recursion is
> getting out of hand.
>
> Making the cache handle null mappings correctly and adding older commits
> to the cache allows the recursive algorithm to terminate at any point on
> mainline rather than needing to reach either the add point or the initial
> commit.

Makes sense.

> diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
> index 9867718503..160bad95c1 100755
> --- a/contrib/subtree/git-subtree.sh
> +++ b/contrib/subtree/git-subtree.sh
> @@ -244,7 +244,7 @@ check_parents () {
>  	do
>  		if ! test -r "$cachedir/notree/$miss"
>  		then
> -			debug "  incorrect order: $miss"
> +			debug "  unprocessed parent commit: $miss ($indent)"

Without any context, it is hard to understand what the `$indent` variable
is supposed to mean, so it is unclear why we need to print it here.

I _guess_ it is the degree removed from the first-parent lineage?

In any case, it does not hurt here, so I trust that it is good to include
it in the debug output.

>  			process_split_commit "$miss" "" "$indent"
>  		fi
>  	done
> @@ -392,6 +392,24 @@ find_existing_splits () {
>  	done
>  }
>
> +find_mainline_ref () {
> +	debug "Looking for first split..."
> +	dir="$1"
> +	revs="$2"

The `git-subtree` script seems to rely on the `local` construct, using it
in plenty of other circumstances. How about using it here, too?

> +
> +	git log --reverse --grep="^git-subtree-dir: $dir/*\$" \
> +		--no-show-signature --pretty=format:'START %H%n%s%n%n%b%nEND%n' $revs |

Since all you are interested in is the `git-subtree-mainline:` trailer,
wouldn't a format like `%(trailers:key=git-subtree-mainline)` instead of
`START %H%n%s%n%n%b%nEND%n`?

See
https://git-scm.com/docs/git-log#Documentation/git-log.txt-emtrailersoptionsem
for more information about pretty formats.

BTW I am super unfamiliar with `git subtree`'s inner workings, and
therefore it would help me incredibly if the commit message talked a bit
about the commit message layout (with a particular eye on
`git-subtree-dir` and `git-subtree-mainline` which I guess are trailers
added by `git subtree`?)...

> +	while read a b junk
> +	do
> +		case "$a" in
> +		git-subtree-mainline:)
> +			echo "$b"
> +			return
> +			;;
> +		esac
> +	done
> +}
> +
>  copy_commit () {
>  	# We're going to set some environment vars here, so
>  	# do it in a subshell to get rid of them safely later
> @@ -646,9 +664,9 @@ process_split_commit () {
>
>  	progress "$revcount/$revmax ($createcount) [$extracount]"
>
> -	debug "Processing commit: $rev"
> +	debug "Processing commit: $rev ($indent)"
>  	exists=$(cache_get "$rev")
> -	if test -n "$exists"
> +	if test -z "$(cache_miss "$rev")"
>  	then
>  		debug "  prior: $exists"

I do not see the `exists` variable being used other than for the debug
statement. Maybe better something like this?

	debug "  prior found for $rev"

>  		return
> @@ -773,6 +791,17 @@ cmd_split () {
>
>  	unrevs="$(find_existing_splits "$dir" "$revs")"
>
> +	mainline="$(find_mainline_ref "$dir" "$revs")"
> +	if test -n "$mainline"
> +	then
> +		debug "Mainline $mainline predates subtree add"
> +		git rev-list --topo-order --skip=1 $mainline |
> +		while read rev
> +		do
> +			cache_set "$rev" ""

Ah, so they are not really "null mappings", but mapped to an empty string.
Makes sense. Maybe adjust the commit message?

> +		done || exit $?
> +	fi
> +
>  	# We can't restrict rev-list to only $dir here, because some of our
>  	# parents have the $dir contents the root, and those won't match.
>  	# (and rev-list --follow doesn't seem to solve this)
> --
> gitgitgadget
>
>


check_parents () {
missed=$(cache_miss "$1")
missed=$(cache_miss $1)
local indent=$(($2 + 1))
for miss in $missed
do
if ! test -r "$cachedir/notree/$miss"
then
debug " incorrect order: $miss"
debug " unprocessed parent commit: $miss ($indent)"
process_split_commit "$miss" "" "$indent"
fi
done
Expand All @@ -266,6 +275,16 @@ cache_set () {
echo "$newrev" >"$cachedir/$oldrev"
}

cache_set_if_unset () {
oldrev="$1"
newrev="$2"
if test -e "$cachedir/$oldrev"
then
return
fi
echo "$newrev" >"$cachedir/$oldrev"
}

rev_exists () {
if git rev-parse "$1" >/dev/null 2>&1
then
Expand Down Expand Up @@ -375,13 +394,13 @@ find_existing_splits () {
then
# squash commits refer to a subtree
debug " Squash: $sq from $sub"
cache_set "$sq" "$sub"
cache_set_if_unset "$sq" "$sub"
fi
if test -n "$main" -a -n "$sub"
then
debug " Prior: $main -> $sub"
cache_set $main $sub
cache_set $sub $sub
cache_set_if_unset $main $sub
cache_set_if_unset $sub $sub
try_remove_previous "$main"
try_remove_previous "$sub"
fi
Expand All @@ -392,6 +411,24 @@ find_existing_splits () {
done
}

find_mainline_ref () {
debug "Looking for first split..."
dir="$1"
revs="$2"

git log --reverse --grep="^git-subtree-dir: $dir/*\$" \
--no-show-signature --pretty=format:'START %H%n%s%n%n%b%nEND%n' $revs |
while read a b junk
do
case "$a" in
git-subtree-mainline:)
echo "$b"
return
;;
esac
done
}

copy_commit () {
# We're going to set some environment vars here, so
# do it in a subshell to get rid of them safely later
Expand Down Expand Up @@ -646,9 +683,9 @@ process_split_commit () {

progress "$revcount/$revmax ($createcount) [$extracount]"

debug "Processing commit: $rev"
debug "Processing commit: $rev ($indent)"
exists=$(cache_get "$rev")
if test -n "$exists"
if test -z "$(cache_miss "$rev")"
then
debug " prior: $exists"
return
Expand All @@ -670,6 +707,8 @@ process_split_commit () {
if test -n "$newparents"
then
cache_set "$rev" "$rev"
else
cache_set "$rev" ""
fi
return
fi
Expand Down Expand Up @@ -767,12 +806,23 @@ cmd_split () {
# the 'onto' history is already just the subdir, so
# any parent we find there can be used verbatim
debug " cache: $rev"
cache_set "$rev" "$rev"
cache_set_if_unset "$rev" "$rev"
done
fi

unrevs="$(find_existing_splits "$dir" "$revs")"

mainline="$(find_mainline_ref "$dir" "$revs")"
if test -n "$mainline"
then
debug "Mainline $mainline predates subtree add"
git rev-list --topo-order --skip=1 $mainline |
while read rev
do
cache_set_if_unset "$rev" ""
done || exit $?
fi

# We can't restrict rev-list to only $dir here, because some of our
# parents have the $dir contents the root, and those won't match.
# (and rev-list --follow doesn't seem to solve this)
Expand Down