Skip to content

Build: replace GitPython with git commands #10594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 9, 2023
Merged

Build: replace GitPython with git commands #10594

merged 11 commits into from
Aug 9, 2023

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Aug 2, 2023

To unambiguously parse the submodules, we need to use the --null option, that results on each item being separated by a null character, all good, except that we are "sanitizing" the output of all commands, stripping those null characters. We are doing this to avoid problems when saving this in the DB, so instead of "sanitizing" the output after the command is run, I'm moving that logic to just when it's needed, this is when saving the command using the API.

Closes https://github.com/readthedocs/readthedocs-corporate/issues/1524
Closes #8990

@dataclass(slots=True)
class GitSubmodule:
path: str
url: str
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the path and URL of each submodule, since we validate that each URL has a valid/supported schema, we could remove this check if we want, it would simplify the code a little :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we would just need to call Git.submodules when the user is excluding submodules only, when including some or all, we can just call the git submodule update with an empty list or the list of given submodules.

Copy link
Member Author

@stsewd stsewd Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if the user provides submodules that aren't valid, well, git will just take care of showing the correct error message, right now we are just excluding invalid submodules from the list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a preference here?

I don't remember why we were checking if the URL has a supported scheme. Maybe that will help me to have a more informed opinion 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agjohnson may have some input here. From what I see, this was done, so users wouldn't use private submodules (SSH) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I seem to recall some issues with repositories defining bogus submodules, or having old and unused submodules defined. I don't think this was to limit the URL pattern to supported protocols.

It could have also been a security measure, as it was probably possible to define a submodule like /etc/ or something and clone from something outside the container (when we were performing submodule operations outside containers).

Any of this seem familiar or likely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stealing internal repos could have been a possibility before moving the execution to docker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds familiar to me the validation of the submodules, but this PR should fix that automatically because we trust git itself now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm thinking if any of the above are true, it was probably a throw back to when we weren't executing the submodule steps inside Docker. In any case that would seem this logic can be removed, but if anyone is hesitant about it, I'm not saying this is guaranteed to be safe either.

Copy link
Member

@humitos humitos Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'd say this is up to you @stsewd since you are the security master here 😄 and the one that has been working on this chunk of code.

return bool(self.submodules)
# TODO: remove when all projects are required
# to have a config file.
return any(self.submodules)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in 2 calls to git config for projects that don't have a config file.

@stsewd stsewd marked this pull request as ready for review August 2, 2023 21:21
@stsewd stsewd requested a review from a team as a code owner August 2, 2023 21:21
@stsewd stsewd requested a review from humitos August 2, 2023 21:21
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good 👍🏼 . I have a few questions and suggestions to continue talking about.

However, the most important thing I want to ask here is "Have you tested this with GIT_CLONE_FETCH_CHECKOUT_PATTERN feature flag?". We have that enabled by default to all projects on production and we are not using the old pattern anymore.

Comment on lines +183 to +184
except (TypeError, AttributeError):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to do log.warning here, so we can easily realize if there is a problem from NR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a log here, but we will probably see a spike in logs, since we weren't logging these before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we have lot of command's outputs that are not possible to decode? That shouldn't be normal, I'd say.

valid, submodules = repo.validate_submodules(self.dummy_conf)
self.assertTrue(valid)
self.assertEqual(
list(submodules), ["foobar", "path with spaces", "another-submodule"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the foobar submodule comes from? I don't see it on the file written in disk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's defined in the submodule branch of the test git repo.

@dataclass(slots=True)
class GitSubmodule:
path: str
url: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a preference here?

I don't remember why we were checking if the URL has a supported scheme. Maybe that will help me to have a more informed opinion 😄

@humitos
Copy link
Member

humitos commented Aug 3, 2023

Oh, one thing that I missed here. You should remove gitpython from the dependencies.

@stsewd
Copy link
Member Author

stsewd commented Aug 3, 2023

Oh, one thing that I missed here. You should remove gitpython from the dependencies.

We still use it in the code that gets tags and branches.

@stsewd
Copy link
Member Author

stsewd commented Aug 3, 2023

Have you tested this with GIT_CLONE_FETCH_CHECKOUT_PATTERN feature flag?".

yeah, I did.

@humitos
Copy link
Member

humitos commented Aug 3, 2023

We still use it in the code that gets tags and branches.

I understood this PR was meant to use git command to get these as well and be able to remove gitpython.

@stsewd
Copy link
Member Author

stsewd commented Aug 3, 2023

I understood this PR was meant to use git command to get these as well and be able to remove gitpython.

Looks like that's still under a feature flag/logic

use_lsremote = vcs_repository.supports_lsremote and (
self.data.project.has_feature(Feature.GIT_CLONE_FETCH_CHECKOUT_PATTERN)
or not vcs_repository.repo_exists()
)

Looks like #8968 is removing that logic.

@stsewd stsewd requested a review from humitos August 8, 2023 21:42
@stsewd
Copy link
Member Author

stsewd commented Aug 8, 2023

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two code comments. Other than that, we are ready to merge it I'd say. We can come back and refactor the other small pieces we are talking in the comments if we want to.

@humitos
Copy link
Member

humitos commented Aug 9, 2023

I'm merging this so I can move forward with other PRs that depend on this one.

@humitos humitos merged commit fd9559b into main Aug 9, 2023
@humitos humitos deleted the replace-gitpython branch August 9, 2023 14:09
stsewd added a commit that referenced this pull request Aug 15, 2023
We are always querying for submodules,
even when the user explicitly listed the submodules to include.

We also have the option to include all submodules,
instead of listing each one of them, we can just
omit them, git will automatically default to include all of them.

Ref #10594 (comment)
stsewd added a commit that referenced this pull request Aug 23, 2023
We are always querying for submodules,
even when the user explicitly listed the submodules to include.

We also have the option to include all submodules,
instead of listing each one of them, we can just
omit them, git will automatically default to include all of them.

Ref #10594 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Version syncing: repositories with too many branches/tags fail on syncing
3 participants