Skip to content

Save config on build model #4749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Nov 7, 2018

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Oct 11, 2018

Not sure how big this is going to be, maybe I will make 2 PRs (one to just save the config information, and another to replate Project.documentation_type).

Related #4638

@stsewd stsewd added the PR: work in progress Pull request is not ready for full review label Oct 11, 2018
@@ -112,6 +112,7 @@ class BuildSerializer(serializers.ModelSerializer):
version_slug = serializers.ReadOnlyField(source='version.slug')
docs_url = serializers.ReadOnlyField(source='version.get_absolute_url')
state_display = serializers.ReadOnlyField(source='get_state_display')
config = serializers.JSONField(required=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because of rpkilby/jsonfield#188 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be a code comment

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. We aren't saving the config anywhere yet, right?

@property
def config(self):
last_build = (
self.builds.filter(type='html', state='finished')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still use the type field? It seems old and unnecessary at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I was thinking the same.

@@ -109,6 +129,15 @@ def __str__(self):
pk=self.pk,
))

@property
def config(self):
last_build = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how long this query will take. Our build table is pretty large, but perhaps the filtering will make it a bit faster. Feels like something we might want to denormalize if we use it a lot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use this, for now, we'll see if we can cache this maybe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is only used in the tests. Do we need this at all? Did you add it as a helper for debugging or so? Does it worth to keep it here? Where do you think it will be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be used on the api and resolver, it'll be used on the second part of this PR

@stsewd
Copy link
Member Author

stsewd commented Oct 12, 2018

We aren't saving the config anywhere yet, right?

Not yet, I'm only need to write a test p:

key: val
for key, val in build.items()
if key not in private_keys
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocks are the same, just refactored it to be more readable

config = self.config.as_dict()
build_data['config'] = config
api_v2.build(pk).put(build_data)
self.build['config'] = config
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are going to use this here, but just in case, it also helps when testing this, write a test for this was hard 😞

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, testing just about everything about our builds is tricky :/

@stsewd stsewd removed the PR: work in progress Pull request is not ready for full review label Oct 12, 2018
@stsewd
Copy link
Member Author

stsewd commented Oct 12, 2018

This is ready for review

@humitos humitos requested a review from a team October 15, 2018 09:27
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of saving this data into the DB, but I think we don't need to save it "per build" and store lots of megabytes into the DB. This will complicate things in the near future.

I want to "Request changes" here to discuss this a little more before merging.

@@ -448,6 +478,7 @@ class Build(models.Model):
exit_code = models.IntegerField(_('Exit code'), null=True, blank=True)
commit = models.CharField(
_('Commit'), max_length=255, null=True, blank=True)
config = JSONField(_('Configuration used in the build'), default={})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, but I think this will increase a lot the size of the db with not very useful data (because it will become old/inaccurate pretty soon). On the other hand, moving this to the cold storage I think that doesn't make sense since finally we won't be able to query these models and get the data we want.

Because of this problem, I was suggesting to save this config in the Version object and always save the config of the latest build per version. We don't really care what was the config of a build 2 months ago. This way, we will have only one config JSON saved into our DB per (active) version and it will be always up to date.

Besides, answering the question: "how many project use mkdocs?" with the current approach is complicated since we will need to count only the latest build per version. In the end, I think that in many cases what I'm suggesting is what we want here.

I want to make this stored data useful, help us to debug problems but also to make decisions when we need to know how users are using our own platform: "how much used is the build/latest image?" and more questions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the JSON example generated from the test and checking how many builds we do have per day, I can estimate that we will have an increase of 1.5Mb per day.

In [13]: Build.objects.filter(date__lt=datetime.date(2018, 10, 15), date__gt=datetime.date(2018, 10, 14)).count()
Out[13]: 3278

In [14]: text_json = '{"version": "1", "formats": ["pdf"], "python": {"version": 3.5, "requirements": "requirements.txt", "install_with_pip": false, "install_with_setup": false, "extra_requi
    ...: rements": [], "use_system_site_packages": false}, "build": {"image": "readthedocs/build:2.0"}, "conda": null, "sphinx": {"builder": "sphinx", "configuration": null, "fail_on_warning
    ...: ": false}, "mkdocs": {"configuration": null, "fail_on_warning": false}, "doctype": "sphinx", "submodules": {"include": 1, "exclude": [], "recursive": true}}'
    ...: 

In [15]: len(text_json)
Out[15]: 505

In [16]: 505 * 3278 / 1024 ** 2
Out[16]: 1.5787029266357422

(please, verify my math and logic)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.5MB per day feels like not that much storage, to be honest. It's less about 500MB a year, give or take. Of course, doing this kind of stuff is how we keep filling up our DB, so probably best not to take the naive approach.

I do see value in having recent changes of the config files stored. I wonder if there is a good way to store it on the Version, but keep a historical copy of all changes. We don't want to keep 500 copies of the same exact data structure, but being able to diff them over time feels very useful, especially when a build goes from passing to failing, or similar.

Copy link
Member

@ericholscher ericholscher Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be overkill, perhaps we could just only save the config when it differs from the previously found one on the Build for that version? Something that limits data so we're not just storing the same data over and over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is an easy way, I'd like to have this to show to users the configuration that was used on each build, if we don't save the full config we need to query the previous one before saving (this is easy) to see if they differ. If we want to show the config for a specific build, we need to know the previous one. So, maybe another model? Or maybe a foreign key to the previous Build?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer to take an action about this before merging.

@ericholscher eric I like your proposal. I think it's "simple enough" to implement and shouldn't complicate things too much if we save "the whole config" if "we detect that differs", instead of saving a diff that later will be more complicated to generate the real config JSON text.

This way, we can query "what's the latest build of this project where the config attribute is not None" and that will be the config used for the current Build (and we can show it to the user as @stsewd suggested).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Build has a date field, we can use that to query the previous build with a valid config

@@ -841,6 +841,60 @@ def test_config_filenames_regex(correct_config_filename):
assert re.match(CONFIG_FILENAME_REGEX, correct_config_filename)


def test_as_dict(tmpdir):
apply_fs(tmpdir, {'requirements.txt': ''})
build = get_build_config(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the variable is tricky. This is not a build object, but a config object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a BuildConfig object, for historical reasons it was left like that p: I mean, we are using that variable name in all the tests of the config module, not sure if we want to rename all of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Feel free to do whatever you prefer. In case you think this should be renamed, open an issue in the Cleanup milestone for the "future".

@@ -448,6 +478,7 @@ class Build(models.Model):
exit_code = models.IntegerField(_('Exit code'), null=True, blank=True)
commit = models.CharField(
_('Commit'), max_length=255, null=True, blank=True)
config = JSONField(_('Configuration used in the build'), default={})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a mutable object as default it's a problem. You need to change it to dict function.

Reference: https://docs.djangoproject.com/es/2.1/ref/models/fields/#django.db.models.Field.default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this rpkilby/jsonfield#125, but we can change it just to be 100% sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Change it, please. That's not the default behavior..

@codecov
Copy link

codecov bot commented Oct 23, 2018

Codecov Report

Merging #4749 into master will increase coverage by 0.25%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4749      +/-   ##
==========================================
+ Coverage   76.47%   76.72%   +0.25%     
==========================================
  Files         158      158              
  Lines        9997    10045      +48     
  Branches     1262     1268       +6     
==========================================
+ Hits         7645     7707      +62     
+ Misses       2010     1998      -12     
+ Partials      342      340       -2
Impacted Files Coverage Δ
readthedocs/projects/tasks.py 72.07% <100%> (+2.69%) ⬆️
readthedocs/config/config.py 98.65% <100%> (+0.61%) ⬆️
readthedocs/restapi/serializers.py 97.53% <100%> (+0.03%) ⬆️
readthedocs/builds/models.py 78.41% <100%> (+2.61%) ⬆️
readthedocs/projects/admin.py 91.08% <0%> (+0.08%) ⬆️

@stsewd
Copy link
Member Author

stsewd commented Oct 23, 2018

I have added a way to save the config only if it has changed from the previous build, I'm saving the pk to the original config on the config of the rno emptyepeated build (that way we have a faster access to the config), and we can detect if the previous config is valid (no empty).

From my perspective, I haven't added too much complexity to the code, but we still need to be careful on how to access to the config field (using get_config), hope with the tests we are covered.

I haven't touched the api code, I'm not sure where/how (or if really) we are going to access to the config from there, but we can just add a readonly field to the serializer later.

@humitos @ericholscher let me know if my solution is fine or if there is a better way to implement this

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like most of the code, but I have some suggestions.

I'd like to be more explicit in some docstrings since I'm sure that all this magic will confuse us.

Besides, I'd like you to double check the logic to compare if the config is equal to the previous one because the tests are using only one value and I'm not sure how deep the comparison will go.

Finally, there are some names (varibles, keys) that I'd like to change before merging.

@@ -497,6 +498,47 @@ class Meta(object):
get_latest_by = 'date'
index_together = [['version', 'state', 'type']]

@property
def previous(self):
"""Returns the previous build to this one."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to be more explicit here,

Return the previous build to the current Build (matching the project and version)

return None

def get_config(self):
"""Get the config from correct object."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not going to access the config from this object directly, I suggest to use _config for the model field, and make this method a @property called just config. This way, we will avoid potential future problems and mistakes.

The docstring could be more explicit.

Get the config used for this build.

Since we are saving the config into the JSON field only when it differs from the previous one,
this helper returns the correct JSON used for this particular Build object
(it could be stored in this object or one of the previous ones)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using _config, also I'd add a @property.set, to set _config through just config. The weird part, would be on the API, but I think we can handle that 🤔

from the previous one.

If the config is the same, we save the pk of the object
that has the **real** config under the ``__config`` key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __config field could be more explicit also, like __config_used__build_pk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be good if this is a CONSTANT defined in the class, instead something that it's written many times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not convinced with the name either, not sure about __config_used__build_pk. I don't have better ideas for the name though :/

that has the **real** config under the ``__config`` key.
"""
previous = self.previous
if previous is not None and self.config == previous.get_config():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how deep the == operator works with dicts?

Also, I think this comparison needs to be a method because it will be more explicit. Like, has_config_changed or similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -109,6 +130,16 @@ def __str__(self):
pk=self.pk,
))

@property
def config(self):
"""Returns the configuration used in the last successful build."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to explain a little more what this returns and what's its type. Probably, it's good to use :returns: and :rtype: in the docstring. We do have many different types of Config objects and this will confuse us in the near future. In fact, I think this is not returning a Config object, but a simple dict 😕

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's just a dict

@@ -109,6 +129,15 @@ def __str__(self):
pk=self.pk,
))

@property
def config(self):
last_build = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is only used in the tests. Do we need this at all? Did you add it as a helper for debugging or so? Does it worth to keep it here? Where do you think it will be used?

@stsewd
Copy link
Member Author

stsewd commented Oct 23, 2018

I have changed the approach, now we have a _config field, but the model has a config property which works as a proxy to the real config.

Also, I was worried about the serializer and the api, but looks like it just works! we have access to the real config trough configfrom the api, and saving to config works too!

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few performance questions. I do worry about adding a ton of queries to our biggest table (~7M builds)

def save_build_config(self):
"""Save config in the build object."""
pk = self.build['id']
build_data = api_v2.build(pk).get()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit overkill. Can we not use the patch method to just update the config, without having to get() it previously? Similar to: https://github.com/rtfd/readthedocs.org/blob/master/readthedocs/projects/tasks.py#L608

if previous is not None and self._config == previous.config:
previous_pk = previous._config.get(self.CONFIG_KEY, previous.pk)
self._config = {self.CONFIG_KEY: previous_pk}
super(Build, self).save(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried that we're running this on each Build save. We do this a number of times during a documentation build, and this will be adding a decent number of queries (eg. self.previous is a query, as is previous.config).

Is there a way to set the _config just once when we have changed it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is an is_changed method on django, but I think we can do this using an internal variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'm just trying to think if there is a way to run this logic just when we are trying to set a config. Perhaps it belongs somewhere else?

@stsewd
Copy link
Member Author

stsewd commented Nov 5, 2018

patch isn't saving the reference but the whole config, I need to write a test and investigate that a little more

@stsewd
Copy link
Member Author

stsewd commented Nov 5, 2018

Forget about #4749 (comment), looks like it's working, I'll write a test for patch anyway.

@stsewd
Copy link
Member Author

stsewd commented Nov 5, 2018

Something weird is happening on local dev, sometimes the response from the API is in str instead of a dict, I'll investigate more, if that doesn't work, I'll try to test with postgres, or maybe I mess up my db.

@stsewd
Copy link
Member Author

stsewd commented Nov 6, 2018

So, it happens that sometimes the build was saved as expected and others not, the bug was that as we have _config and config exposed on the api, they were overriding each other (actually _config was overriding to config). As _config didn't have the json serializer, it was saving the value as plain text (jsonfield doesn't do any validation apparently). Anyway, the fix? Just ignore _config on the API :) the tests didn't catch this, so, I wrote a test that checks that _config isn't in the response.

@humitos
Copy link
Member

humitos commented Nov 6, 2018

From the latest comments I understand that this is ready to be merged. Now, you are saving and running all the checks only when the _config has changed on save which was one of the worries of @ericholscher. The other one was the .patch that you already made that change.

All your tests are passing and also your QA.

I took a very quick look and seems good to get merged.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Will be interesting to get this data saved over time. 💯

@@ -112,6 +112,7 @@ class BuildSerializer(serializers.ModelSerializer):
version_slug = serializers.ReadOnlyField(source='version.slug')
docs_url = serializers.ReadOnlyField(source='version.get_absolute_url')
state_display = serializers.ReadOnlyField(source='get_state_display')
config = serializers.JSONField(required=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be a code comment

config = self.config.as_dict()
build_data['config'] = config
api_v2.build(pk).put(build_data)
self.build['config'] = config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, testing just about everything about our builds is tricky :/

@@ -248,6 +248,21 @@ def python_full_version(self):
)
return ver

def as_dict(self):
attributes = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this logic should live somewhere else, as a PUBLIC_ATTRIBUTES or something that is more likely to be updated when changed/added?

@stsewd
Copy link
Member Author

stsewd commented Nov 7, 2018

@ericholscher done!

@ericholscher ericholscher merged commit 1529e2d into readthedocs:master Nov 7, 2018
@ericholscher
Copy link
Member

🎉

@stsewd stsewd deleted the save-config-on-build branch November 7, 2018 21:42
@stsewd stsewd mentioned this pull request Nov 14, 2018
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants