-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Cleanup project tags #5983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup project tags #5983
Conversation
- Adds a management command to clean up tags - Cleanup involves lowercasing and slugifying (canonicalizing) - Future tags will come in canonicalized
Diving into review now, but immediately:
Generally speaking, I think site admin is a better place for most management commands, especially those that don't require input. These commands are easier to find and keeps site admin role abilities encapsulated in the site admin, instead of requiring ops access to servers for some selective functions. These functions are generally lost. If that is not the correct answer here, that is fine. But I like to steer us clear of management commands where possible. All of our management commands suffer from rot and usually reimplement logic that admin functions give us. |
Cleaning tags will probably take on the order of 1-2 hours. Importing tags from github will take on the order of 1-3 days at least the first time. Incremental may be only 1 dayish. I'm not sure an admin function or a data migration is appropriate for that. We could implement it as an admin function that calls the same logic on a per project basis but I do want to run this for all projects.
This is true for the cleaning aspect but not for deleting unused tags. Deleting unused tags should be fairly quick incrementally. When tags are first cleaned, there will be a lot to delete. For example, all tags with uppercase will be deleted. However, when a bunch of spam projects get deleted in the future, they will leave behind a bunch of spam tags which this management command will clean up on future runs. Possibly reformatting tags and removing unused ones should be separated. |
Roger, thanks for the background 👍 I think my reservations are mostly around frequency, but the length of run is another point. Even a 1-2 hour migration is pretty annoying, so a management command is ok, though it seems redundant functionality. For the tag import, what is the frequency we want to run this? Just once for now to test this data? A management command works as a first pass for bulk import, but I'd want to move this out of a management command eventually if we're running this frequently or this needs to be a scheduled task that ops needs to take. If we're working towards running this more frequently, there are plenty of options to automate this via celery. I'm guessing this would be something we're running once for now, would it be helpful to run this selectively on projects in addition? This also sounds like later work. |
For the import from github, I'm looking mostly just to run this once although once a year doesn't seem too bad. Having the ability to do it on a per project basis would be nice though. At the same time, connecting to the github API requires an API token which is separate from an app token (I think). Otherwise, the rate limit is ~60/hr. Canonicalizing tags only ever needs to run once after this is merged as future tags will be canonicalized. Removing unused tags is something I would probably like to run every time we delete a bunch of spam projects. Once spam deletion is more automated, this could also be automated. Deleting unused tags should also be pretty fast overall (a few minutes maybe, I'm guesstimating). To make this more concrete, how about this proposal:
Edit: updated proposal since we can re-use our existing github credentials |
It looks like we can use our app credentials (client ID and client secret). It doesn't give any permissions but does increase the rate limit which is all we need for this. |
- Make importing tags for a project an admin function - Use the github app Oauth credentials - Make removing unused tags a reusable function so it can be reused in the future by celery
Codecov Report
@@ Coverage Diff @@
## master #5983 +/- ##
==========================================
+ Coverage 79.14% 79.15% +0.01%
==========================================
Files 175 176 +1
Lines 11127 11133 +6
Branches 1375 1377 +2
==========================================
+ Hits 8806 8812 +6
Misses 1964 1964
Partials 357 357
Continue to review full report at Codecov.
|
Yeah I agree, it sounds like management command works fine here. There's probably a pattern for admin action against all projects, short of selecting "all projects" in the UI selector. I think ultimately full automation would the way to go with some of these functoins, and leave fiddling with the admin out of it. I do see some potential to selectively choose Projects from the admin though. |
I added an admin function to get tags for a single project and used the same logic in the management command. Regardless, I think this is ready for a full review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. 👍 It would be useful to have some tests here eventually, but getting it shipped is more valuable to me.
help = __doc__ | ||
|
||
def handle(self, *args, **options): | ||
queryset = Project.objects.filter(tags=None).filter(repo__contains='github.com') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should live in a manager for github/gitlab/bitbucket and other providers we support. I've seen these types of queries scattered around a lot, and would be good to standardize on an approach. Not necessary in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree and I wouldn't mind expanding them to the other providers.
self.stdout.write('Set tags for {}: {}'.format(project.slug, tags)) | ||
|
||
# Sleeping half a second should keep us under 5k requests/hour | ||
time.sleep(0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever. 👍
Notes & Possible Enhancements
GitHub actually limits the length of the tag which might not be a bad idea. Basically all tags longer than ~30 characters are certainly spam.
Limiting the number of tags for a project to some reasonable number like 20 isn't a bad idea.
I would definitely like to autocomplete tags but I think that's a task best left until after other design work