-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add Initial Modeling with Through Model and Data Migration for RemoteRepository Model #7536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Initial Modeling with Through Model and Data Migration for RemoteRepository Model #7536
Conversation
readthedocs/oauth/migrations/0013_data_migration_for_remote_relation_model.py
Outdated
Show resolved
Hide resolved
|
||
relations_queryset = RemoteRelation.objects.all().select_related('remoterepository') | ||
remote_relations = remote_relations_generator(relations_queryset) | ||
batch_size = 5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about what the size can be for each batch. but if we do all of them at once it will create a huge SQL Query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is one of my concerns of this migrations. I don't have a good answer yet and we may ask somebody else for advice here.
The RemoteRepostory
table is incredible huge. I already had to trigger the creation of an INDEX in this table manually without blocking things (concurrently) because it took hours.
In this case, we will need to find a way to not block the DB while deploying these changes.
readthedocs/oauth/migrations/0012_add_fields_to_remote_relation_model.py
Outdated
Show resolved
Hide resolved
readthedocs/oauth/migrations/0013_data_migration_for_remote_relation_model.py
Outdated
Show resolved
Hide resolved
readthedocs/oauth/migrations/0013_data_migration_for_remote_relation_model.py
Outdated
Show resolved
Hide resolved
|
||
relations_queryset = RemoteRelation.objects.all().select_related('remoterepository') | ||
remote_relations = remote_relations_generator(relations_queryset) | ||
batch_size = 5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is one of my concerns of this migrations. I don't have a good answer yet and we may ask somebody else for advice here.
The RemoteRepostory
table is incredible huge. I already had to trigger the creation of an INDEX in this table manually without blocking things (concurrently) because it took hours.
In this case, we will need to find a way to not block the DB while deploying these changes.
readthedocs/oauth/migrations/0013_data_migration_for_remote_relation_model.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going in a good direction IMO. The PR description was super useful to understand what you did 💯
I think there is no more big modeling changes here probably. The rest of the work is more about "how to deploy it without breaking things".
We will need a plan for:
- do not kill the db while running this migrations
- how we can improve the migration?
- is it enough just by lowering the
batch_size
? - will it block the db while running the migration?
- keeping compatibility with all our existing features that use this models
- are we deploying the model changes and data migrations together with the python code changes that use these models? (https://github.com/readthedocs/readthedocs.org/pull/7169/files#diff-12aacc84bda4107192b54a46c0922fb2R63-R79)
- doing everything at once sounds risky to me (we need to update .org & .com) code all at once
- is it possible to find a way of doing incremental changes here (migrate feature by feature)?
Unfortunately, I don't have the answer to most of those questions yet, but it's definitely something we need to talk about and have a plan for.
I'm not sure if you already did a test locally and ran this queries, but it would be useful if you can connect your Social accounts (github, gitlab and bitbucket) to your local instance and double check that the data migration and structure is all in its place.
Thanks for helping us on this, really appreciated! ❤️
@@ -90,14 +91,7 @@ class RemoteRepository(models.Model): | |||
User, | |||
verbose_name=_('Users'), | |||
related_name='oauth_repositories', | |||
) | |||
account = models.ForeignKey( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add the remote_id
field here as well and populate it with the id
that it's in remoterepository.json
.
If we are not doing it in this PR, we should create an issue to keep track of. We will need it to remove the usage of full_name
in many places.
I created a project at https://github.com/orgs/readthedocs/projects/74?add_cards_query=is%3Aopen where we can track all of the related PR and issues. Let me know if you have permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay sure, maybe we can do data migration for this and #7536 (comment) first.
This PR will remove the json field from RemoteRepository
model so I think its better to do this first in a different PR or make it the first migration on this PR.
@humitos What do you say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a project at https://github.com/orgs/readthedocs/projects/74?add_cards_query=is%3Aopen where we can track all of the related PR and issues. Let me know if you have permissions.
@humitos I can not see some of the cards maybe they are from a private repo or something, otherwise it all good 👍
migrations.RemoveField( | ||
model_name='remoterepository', | ||
name='account', | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are moving the SocialAccount
to the RemoteRelation
, shouldn't we have a provider
or similar in the RemoteRepository
to know to which service this repository belongs to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Great idea 💯 It will be better to add a provider
field so we don't have to do an extra query to know the provider name or if the user is deleted or disconnected we don't lose that data.
Some thoughts on migration of this data here #7536 (comment)
I have connected my GitHub account and imported all my repositories and ran the migrations. In addition to that I created ~53k |
Updated the PR! Used |
Learned something interesting, It seems https://docs.djangoproject.com/en/3.0/ref/models/querysets/#bulk-update
|
I got another idea that I wanted to share here after talking about it with @saadmk11 today. Keeping in mind that we can't just run migrations on this table because is so huge that it will take down the DB. In a few words, I'm thinking on:
To accomplish this, this is the plan we've got:
At this point, we could do the first deploy. Active users and new active users will have updated data immediately after they log in. From here, we can continue migrating not frequently used data (or stale) following these steps:
This will give us the best of two worlds: active/new users will have updated data & old users will be migrated slowly without killing the db. However, immediately one of those old users log in, their data will be automatically re-synced from GitHub via the Django signal.
I left 7) for the end and alone because I'm not sure if removing this fields will take the DB down as well, considering the amount of data we have in that table. |
@humitos This makes sense to me. I think the main thing we need to worry about is GH rate limits, but there's no reason we couldn't ship the modeling a week in advance, and then migrate the data over a longer period of time, then deploy the code to use it, right? |
We can't really ship the new modeling and continue using the old one in the application Python code. The new modeling modifies relationships between models and renames a table, so the queries will need to be changed as well. In my proposal, we are deploying the new modeling and application Python code changes to use it at the same time --together with a script to re-sync active users in the same deploy. I'm still not 100% convinced that we have to migrate all the data at the same time, considering that we can re-fetch it (and up to date!) from GitHub when we need it. That's why I left the 6) step outside the first deploy. It may not be required. |
863d329
to
c3592eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me. I'm surprised that tests keep passing after changing the RemoteRepository.users
field. I'd expected the code to fail because of this considering that field is not valid anymore and it has to be accessed via the remoterelation
.
We need to do exactly the same migration for RemoteOrganization
as well.
readthedocs/oauth/models.py
Outdated
@@ -206,3 +210,30 @@ def matches(self, user): | |||
}, | |||
), | |||
} for project in projects] | |||
|
|||
|
|||
class RemoteRelation(TimeStampedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to rename this model to something like RemoteRepositoryRelation
because we will also need a relation model for RemoteOrganization
as well and we will have conflicting names.
Does it make sense to add make remoterepository
field nullable and add remoteorganization
field as well to this RemoteRelation
model and use it for both? It seems it's exactly the same data but RemoteOrganization
does not has admin
field.
This is the initial modeling and data migration for the
RemoteRepository
Model.This represents a simple implementation to start the discussion for the implementation of #7169
The first migration
0011
decouples theRemoteRepository
model with a through modelRemoteRelation
usingSeparateDatabaseAndState
so that Django does not create a new table for this and uses the table that already exists forManyToMany
relation.The second migration
0012
adds the required fields for theRemoteRelation
Model. (usesJSONField
instead ofmodels.TextField
for thejson
field)The third migration
0013
is added to perform data migration betweenRemoteRepository
andRemoteRelation
The fourth migration
0014
removes the transferred fields from theRemoteRepository
model.Ref: #2759 and #7169