-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Save config on build model #4749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -112,6 +112,7 @@ class BuildSerializer(serializers.ModelSerializer): | |||
version_slug = serializers.ReadOnlyField(source='version.slug') | |||
docs_url = serializers.ReadOnlyField(source='version.get_absolute_url') | |||
state_display = serializers.ReadOnlyField(source='get_state_display') | |||
config = serializers.JSONField(required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this because of rpkilby/jsonfield#188 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be a code comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. We aren't saving the config anywhere yet, right?
readthedocs/builds/models.py
Outdated
@property | ||
def config(self): | ||
last_build = ( | ||
self.builds.filter(type='html', state='finished') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still use the type
field? It seems old and unnecessary at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I was thinking the same.
@@ -109,6 +129,15 @@ def __str__(self): | |||
pk=self.pk, | |||
)) | |||
|
|||
@property | |||
def config(self): | |||
last_build = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how long this query will take. Our build table is pretty large, but perhaps the filtering will make it a bit faster. Feels like something we might want to denormalize if we use it a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use this, for now, we'll see if we can cache this maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is only used in the tests. Do we need this at all? Did you add it as a helper for debugging or so? Does it worth to keep it here? Where do you think it will be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be used on the api and resolver, it'll be used on the second part of this PR
Not yet, I'm only need to write a test p: |
key: val | ||
for key, val in build.items() | ||
if key not in private_keys | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blocks are the same, just refactored it to be more readable
config = self.config.as_dict() | ||
build_data['config'] = config | ||
api_v2.build(pk).put(build_data) | ||
self.build['config'] = config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we are going to use this here, but just in case, it also helps when testing this, write a test for this was hard 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, testing just about everything about our builds is tricky :/
This is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of saving this data into the DB, but I think we don't need to save it "per build" and store lots of megabytes into the DB. This will complicate things in the near future.
I want to "Request changes" here to discuss this a little more before merging.
readthedocs/builds/models.py
Outdated
@@ -448,6 +478,7 @@ class Build(models.Model): | |||
exit_code = models.IntegerField(_('Exit code'), null=True, blank=True) | |||
commit = models.CharField( | |||
_('Commit'), max_length=255, null=True, blank=True) | |||
config = JSONField(_('Configuration used in the build'), default={}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, but I think this will increase a lot the size of the db with not very useful data (because it will become old/inaccurate pretty soon). On the other hand, moving this to the cold storage I think that doesn't make sense since finally we won't be able to query these models and get the data we want.
Because of this problem, I was suggesting to save this config in the Version
object and always save the config of the latest build per version. We don't really care what was the config of a build 2 months ago. This way, we will have only one config JSON saved into our DB per (active) version and it will be always up to date.
Besides, answering the question: "how many project use mkdocs
?" with the current approach is complicated since we will need to count only the latest build per version. In the end, I think that in many cases what I'm suggesting is what we want here.
I want to make this stored data useful, help us to debug problems but also to make decisions when we need to know how users are using our own platform: "how much used is the build/latest
image?" and more questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the JSON example generated from the test and checking how many builds we do have per day, I can estimate that we will have an increase of 1.5Mb per day.
In [13]: Build.objects.filter(date__lt=datetime.date(2018, 10, 15), date__gt=datetime.date(2018, 10, 14)).count()
Out[13]: 3278
In [14]: text_json = '{"version": "1", "formats": ["pdf"], "python": {"version": 3.5, "requirements": "requirements.txt", "install_with_pip": false, "install_with_setup": false, "extra_requi
...: rements": [], "use_system_site_packages": false}, "build": {"image": "readthedocs/build:2.0"}, "conda": null, "sphinx": {"builder": "sphinx", "configuration": null, "fail_on_warning
...: ": false}, "mkdocs": {"configuration": null, "fail_on_warning": false}, "doctype": "sphinx", "submodules": {"include": 1, "exclude": [], "recursive": true}}'
...:
In [15]: len(text_json)
Out[15]: 505
In [16]: 505 * 3278 / 1024 ** 2
Out[16]: 1.5787029266357422
(please, verify my math and logic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.5MB per day feels like not that much storage, to be honest. It's less about 500MB a year, give or take. Of course, doing this kind of stuff is how we keep filling up our DB, so probably best not to take the naive approach.
I do see value in having recent changes of the config files stored. I wonder if there is a good way to store it on the Version, but keep a historical copy of all changes. We don't want to keep 500 copies of the same exact data structure, but being able to diff them over time feels very useful, especially when a build goes from passing to failing, or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be overkill, perhaps we could just only save the config when it differs from the previously found one on the Build
for that version? Something that limits data so we're not just storing the same data over and over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is an easy way, I'd like to have this to show to users the configuration that was used on each build, if we don't save the full config we need to query the previous one before saving (this is easy) to see if they differ. If we want to show the config for a specific build, we need to know the previous one. So, maybe another model? Or maybe a foreign key to the previous Build
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer to take an action about this before merging.
@ericholscher eric I like your proposal. I think it's "simple enough" to implement and shouldn't complicate things too much if we save "the whole config" if "we detect that differs", instead of saving a diff that later will be more complicated to generate the real config JSON text.
This way, we can query "what's the latest build of this project where the config
attribute is not None
" and that will be the config
used for the current Build (and we can show it to the user as @stsewd suggested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, Build
has a date field, we can use that to query the previous build with a valid config
@@ -841,6 +841,60 @@ def test_config_filenames_regex(correct_config_filename): | |||
assert re.match(CONFIG_FILENAME_REGEX, correct_config_filename) | |||
|
|||
|
|||
def test_as_dict(tmpdir): | |||
apply_fs(tmpdir, {'requirements.txt': ''}) | |||
build = get_build_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the variable is tricky. This is not a build object, but a config object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a BuildConfig
object, for historical reasons it was left like that p: I mean, we are using that variable name in all the tests of the config module, not sure if we want to rename all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Feel free to do whatever you prefer. In case you think this should be renamed, open an issue in the Cleanup milestone for the "future".
readthedocs/builds/models.py
Outdated
@@ -448,6 +478,7 @@ class Build(models.Model): | |||
exit_code = models.IntegerField(_('Exit code'), null=True, blank=True) | |||
commit = models.CharField( | |||
_('Commit'), max_length=255, null=True, blank=True) | |||
config = JSONField(_('Configuration used in the build'), default={}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a mutable object as default it's a problem. You need to change it to dict
function.
Reference: https://docs.djangoproject.com/es/2.1/ref/models/fields/#django.db.models.Field.default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this rpkilby/jsonfield#125, but we can change it just to be 100% sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Change it, please. That's not the default behavior..
Codecov Report
@@ Coverage Diff @@
## master #4749 +/- ##
==========================================
+ Coverage 76.47% 76.72% +0.25%
==========================================
Files 158 158
Lines 9997 10045 +48
Branches 1262 1268 +6
==========================================
+ Hits 7645 7707 +62
+ Misses 2010 1998 -12
+ Partials 342 340 -2
|
I have added a way to save the config only if it has changed from the previous build, I'm saving the pk to the original config on the config of the rno emptyepeated build (that way we have a faster access to the config), and we can detect if the previous config is valid (no empty). From my perspective, I haven't added too much complexity to the code, but we still need to be careful on how to access to the config field (using I haven't touched the api code, I'm not sure where/how (or if really) we are going to access to the config from there, but we can just add a readonly field to the serializer later. @humitos @ericholscher let me know if my solution is fine or if there is a better way to implement this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like most of the code, but I have some suggestions.
I'd like to be more explicit in some docstrings since I'm sure that all this magic will confuse us.
Besides, I'd like you to double check the logic to compare if the config is equal to the previous one because the tests are using only one value and I'm not sure how deep the comparison will go.
Finally, there are some names (varibles, keys) that I'd like to change before merging.
readthedocs/builds/models.py
Outdated
@@ -497,6 +498,47 @@ class Meta(object): | |||
get_latest_by = 'date' | |||
index_together = [['version', 'state', 'type']] | |||
|
|||
@property | |||
def previous(self): | |||
"""Returns the previous build to this one.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to be more explicit here,
Return the previous build to the current Build (matching the project and version)
readthedocs/builds/models.py
Outdated
return None | ||
|
||
def get_config(self): | ||
"""Get the config from correct object.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not going to access the config
from this object directly, I suggest to use _config
for the model field, and make this method a @property
called just config
. This way, we will avoid potential future problems and mistakes.
The docstring could be more explicit.
Get the config used for this build.
Since we are saving the config into the JSON field only when it differs from the previous one,
this helper returns the correct JSON used for this particular Build object
(it could be stored in this object or one of the previous ones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using _config
, also I'd add a @property.set
, to set _config
through just config
. The weird part, would be on the API, but I think we can handle that 🤔
readthedocs/builds/models.py
Outdated
from the previous one. | ||
|
||
If the config is the same, we save the pk of the object | ||
that has the **real** config under the ``__config`` key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __config
field could be more explicit also, like __config_used__build_pk
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be good if this is a CONSTANT defined in the class, instead something that it's written many times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not convinced with the name either, not sure about __config_used__build_pk
. I don't have better ideas for the name though :/
readthedocs/builds/models.py
Outdated
that has the **real** config under the ``__config`` key. | ||
""" | ||
previous = self.previous | ||
if previous is not None and self.config == previous.get_config(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how deep the ==
operator works with dicts?
Also, I think this comparison needs to be a method because it will be more explicit. Like, has_config_changed
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's recursive https://stackoverflow.com/a/17217255/5689214
readthedocs/builds/models.py
Outdated
@@ -109,6 +130,16 @@ def __str__(self): | |||
pk=self.pk, | |||
)) | |||
|
|||
@property | |||
def config(self): | |||
"""Returns the configuration used in the last successful build.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to explain a little more what this returns and what's its type. Probably, it's good to use :returns:
and :rtype:
in the docstring. We do have many different types of Config objects and this will confuse us in the near future. In fact, I think this is not returning a Config object, but a simple dict 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's just a dict
@@ -109,6 +129,15 @@ def __str__(self): | |||
pk=self.pk, | |||
)) | |||
|
|||
@property | |||
def config(self): | |||
last_build = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is only used in the tests. Do we need this at all? Did you add it as a helper for debugging or so? Does it worth to keep it here? Where do you think it will be used?
I have changed the approach, now we have a Also, I was worried about the serializer and the api, but looks like it just works! we have access to the real config trough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few performance questions. I do worry about adding a ton of queries to our biggest table (~7M builds)
readthedocs/projects/tasks.py
Outdated
def save_build_config(self): | ||
"""Save config in the build object.""" | ||
pk = self.build['id'] | ||
build_data = api_v2.build(pk).get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit overkill. Can we not use the patch
method to just update the config, without having to get()
it previously? Similar to: https://github.com/rtfd/readthedocs.org/blob/master/readthedocs/projects/tasks.py#L608
if previous is not None and self._config == previous.config: | ||
previous_pk = previous._config.get(self.CONFIG_KEY, previous.pk) | ||
self._config = {self.CONFIG_KEY: previous_pk} | ||
super(Build, self).save(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried that we're running this on each Build
save. We do this a number of times during a documentation build, and this will be adding a decent number of queries (eg. self.previous
is a query, as is previous.config
).
Is there a way to set the _config
just once when we have changed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is an is_changed
method on django, but I think we can do this using an internal variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I'm just trying to think if there is a way to run this logic just when we are trying to set a config
. Perhaps it belongs somewhere else?
|
Forget about #4749 (comment), looks like it's working, I'll write a test for patch anyway. |
Something weird is happening on local dev, sometimes the response from the API is in str instead of a dict, I'll investigate more, if that doesn't work, I'll try to test with postgres, or maybe I mess up my db. |
So, it happens that sometimes the build was saved as expected and others not, the bug was that as we have |
From the latest comments I understand that this is ready to be merged. Now, you are saving and running all the checks only when the All your tests are passing and also your QA. I took a very quick look and seems good to get merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Will be interesting to get this data saved over time. 💯
@@ -112,6 +112,7 @@ class BuildSerializer(serializers.ModelSerializer): | |||
version_slug = serializers.ReadOnlyField(source='version.slug') | |||
docs_url = serializers.ReadOnlyField(source='version.get_absolute_url') | |||
state_display = serializers.ReadOnlyField(source='get_state_display') | |||
config = serializers.JSONField(required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be a code comment
config = self.config.as_dict() | ||
build_data['config'] = config | ||
api_v2.build(pk).put(build_data) | ||
self.build['config'] = config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, testing just about everything about our builds is tricky :/
readthedocs/config/config.py
Outdated
@@ -248,6 +248,21 @@ def python_full_version(self): | |||
) | |||
return ver | |||
|
|||
def as_dict(self): | |||
attributes = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this logic should live somewhere else, as a PUBLIC_ATTRIBUTES
or something that is more likely to be updated when changed/added?
@ericholscher done! |
🎉 |
Not sure how big this is going to be, maybe I will make 2 PRs (one to just save the config information, and another to replate
Project.documentation_type
).Related #4638