Save config on build model #4749

stsewd · 2018-10-11T22:23:54Z

Not sure how big this is going to be, maybe I will make 2 PRs (one to just save the config information, and another to replate Project.documentation_type).

Related #4638

stsewd · 2018-10-11T22:25:21Z

readthedocs/restapi/serializers.py

@@ -112,6 +112,7 @@ class BuildSerializer(serializers.ModelSerializer):
    version_slug = serializers.ReadOnlyField(source='version.slug')
    docs_url = serializers.ReadOnlyField(source='version.get_absolute_url')
    state_display = serializers.ReadOnlyField(source='get_state_display')
+    config = serializers.JSONField(required=False)


We need this because of rpkilby/jsonfield#188 (comment)

Probably should be a code comment

ericholscher

Looks good so far. We aren't saving the config anywhere yet, right?

ericholscher · 2018-10-12T10:06:20Z

readthedocs/builds/models.py

+    @property
+    def config(self):
+        last_build = (
+            self.builds.filter(type='html', state='finished')


Do we still use the type field? It seems old and unnecessary at this point.

Yup, I was thinking the same.

ericholscher · 2018-10-12T10:07:23Z

readthedocs/builds/models.py

@@ -109,6 +129,15 @@ def __str__(self):
                pk=self.pk,
            ))

+    @property
+    def config(self):
+        last_build = (


I wonder how long this query will take. Our build table is pretty large, but perhaps the filtering will make it a bit faster. Feels like something we might want to denormalize if we use it a lot.

We don't use this, for now, we'll see if we can cache this maybe

This method is only used in the tests. Do we need this at all? Did you add it as a helper for debugging or so? Does it worth to keep it here? Where do you think it will be used?

This will be used on the api and resolver, it'll be used on the second part of this PR

stsewd · 2018-10-12T15:02:27Z

We aren't saving the config anywhere yet, right?

Not yet, I'm only need to write a test p:

stsewd · 2018-10-12T17:26:30Z

readthedocs/projects/tasks.py

+            key: val
+            for key, val in build.items()
+            if key not in private_keys
+        }


The blocks are the same, just refactored it to be more readable

stsewd · 2018-10-12T17:28:02Z

readthedocs/projects/tasks.py

+        config = self.config.as_dict()
+        build_data['config'] = config
+        api_v2.build(pk).put(build_data)
+        self.build['config'] = config


I don't think we are going to use this here, but just in case, it also helps when testing this, write a test for this was hard 😞

Yea, testing just about everything about our builds is tricky :/

readthedocs/rtd_tests/tests/test_version_config.py

readthedocs/builds/models.py

stsewd · 2018-10-12T17:57:16Z

This is ready for review

humitos

I like the idea of saving this data into the DB, but I think we don't need to save it "per build" and store lots of megabytes into the DB. This will complicate things in the near future.

I want to "Request changes" here to discuss this a little more before merging.

humitos · 2018-10-15T10:32:37Z

readthedocs/builds/models.py

@@ -448,6 +478,7 @@ class Build(models.Model):
    exit_code = models.IntegerField(_('Exit code'), null=True, blank=True)
    commit = models.CharField(
        _('Commit'), max_length=255, null=True, blank=True)
+    config = JSONField(_('Configuration used in the build'), default={})


I like this, but I think this will increase a lot the size of the db with not very useful data (because it will become old/inaccurate pretty soon). On the other hand, moving this to the cold storage I think that doesn't make sense since finally we won't be able to query these models and get the data we want.

Because of this problem, I was suggesting to save this config in the Version object and always save the config of the latest build per version. We don't really care what was the config of a build 2 months ago. This way, we will have only one config JSON saved into our DB per (active) version and it will be always up to date.

Besides, answering the question: "how many project use mkdocs?" with the current approach is complicated since we will need to count only the latest build per version. In the end, I think that in many cases what I'm suggesting is what we want here.

I want to make this stored data useful, help us to debug problems but also to make decisions when we need to know how users are using our own platform: "how much used is the build/latest image?" and more questions.

Using the JSON example generated from the test and checking how many builds we do have per day, I can estimate that we will have an increase of 1.5Mb per day.

In [13]: Build.objects.filter(date__lt=datetime.date(2018, 10, 15), date__gt=datetime.date(2018, 10, 14)).count() Out[13]: 3278 In [14]: text_json = '{"version": "1", "formats": ["pdf"], "python": {"version": 3.5, "requirements": "requirements.txt", "install_with_pip": false, "install_with_setup": false, "extra_requi ...: rements": [], "use_system_site_packages": false}, "build": {"image": "readthedocs/build:2.0"}, "conda": null, "sphinx": {"builder": "sphinx", "configuration": null, "fail_on_warning ...: ": false}, "mkdocs": {"configuration": null, "fail_on_warning": false}, "doctype": "sphinx", "submodules": {"include": 1, "exclude": [], "recursive": true}}' ...: In [15]: len(text_json) Out[15]: 505 In [16]: 505 * 3278 / 1024 ** 2 Out[16]: 1.5787029266357422

(please, verify my math and logic)

1.5MB per day feels like not that much storage, to be honest. It's less about 500MB a year, give or take. Of course, doing this kind of stuff is how we keep filling up our DB, so probably best not to take the naive approach.

I do see value in having recent changes of the config files stored. I wonder if there is a good way to store it on the Version, but keep a historical copy of all changes. We don't want to keep 500 copies of the same exact data structure, but being able to diff them over time feels very useful, especially when a build goes from passing to failing, or similar.

That might be overkill, perhaps we could just only save the config when it differs from the previously found one on the Build for that version? Something that limits data so we're not just storing the same data over and over.

I'm not sure if there is an easy way, I'd like to have this to show to users the configuration that was used on each build, if we don't save the full config we need to query the previous one before saving (this is easy) to see if they differ. If we want to show the config for a specific build, we need to know the previous one. So, maybe another model? Or maybe a foreign key to the previous Build?

I personally prefer to take an action about this before merging.

@ericholscher eric I like your proposal. I think it's "simple enough" to implement and shouldn't complicate things too much if we save "the whole config" if "we detect that differs", instead of saving a diff that later will be more complicated to generate the real config JSON text.

This way, we can query "what's the latest build of this project where the config attribute is not None" and that will be the config used for the current Build (and we can show it to the user as @stsewd suggested).

So, Build has a date field, we can use that to query the previous build with a valid config

humitos · 2018-10-15T10:42:00Z

readthedocs/config/tests/test_config.py

@@ -841,6 +841,60 @@ def test_config_filenames_regex(correct_config_filename):
    assert re.match(CONFIG_FILENAME_REGEX, correct_config_filename)


+def test_as_dict(tmpdir):
+    apply_fs(tmpdir, {'requirements.txt': ''})
+    build = get_build_config(


The name of the variable is tricky. This is not a build object, but a config object.

It is a BuildConfig object, for historical reasons it was left like that p: I mean, we are using that variable name in all the tests of the config module, not sure if we want to rename all of them.

OK.

Feel free to do whatever you prefer. In case you think this should be renamed, open an issue in the Cleanup milestone for the "future".

humitos · 2018-10-15T10:46:00Z

readthedocs/builds/models.py

@@ -448,6 +478,7 @@ class Build(models.Model):
    exit_code = models.IntegerField(_('Exit code'), null=True, blank=True)
    commit = models.CharField(
        _('Commit'), max_length=255, null=True, blank=True)
+    config = JSONField(_('Configuration used in the build'), default={})


Using a mutable object as default it's a problem. You need to change it to dict function.

Reference: https://docs.djangoproject.com/es/2.1/ref/models/fields/#django.db.models.Field.default

I read this rpkilby/jsonfield#125, but we can change it just to be 100% sure

Yes. Change it, please. That's not the default behavior..

codecov · 2018-10-23T03:57:18Z

Codecov Report

Merging #4749 into master will increase coverage by 0.25%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4749      +/-   ##
==========================================
+ Coverage   76.47%   76.72%   +0.25%     
==========================================
  Files         158      158              
  Lines        9997    10045      +48     
  Branches     1262     1268       +6     
==========================================
+ Hits         7645     7707      +62     
+ Misses       2010     1998      -12     
+ Partials      342      340       -2

Impacted Files	Coverage Δ
readthedocs/projects/tasks.py	`72.07% <100%> (+2.69%)`	⬆️
readthedocs/config/config.py	`98.65% <100%> (+0.61%)`	⬆️
readthedocs/restapi/serializers.py	`97.53% <100%> (+0.03%)`	⬆️
readthedocs/builds/models.py	`78.41% <100%> (+2.61%)`	⬆️
readthedocs/projects/admin.py	`91.08% <0%> (+0.08%)`	⬆️

stsewd · 2018-10-23T04:16:49Z

I have added a way to save the config only if it has changed from the previous build, I'm saving the pk to the original config on the config of the rno emptyepeated build (that way we have a faster access to the config), and we can detect if the previous config is valid (no empty).

From my perspective, I haven't added too much complexity to the code, but we still need to be careful on how to access to the config field (using get_config), hope with the tests we are covered.

I haven't touched the api code, I'm not sure where/how (or if really) we are going to access to the config from there, but we can just add a readonly field to the serializer later.

@humitos @ericholscher let me know if my solution is fine or if there is a better way to implement this

humitos

I like most of the code, but I have some suggestions.

I'd like to be more explicit in some docstrings since I'm sure that all this magic will confuse us.

Besides, I'd like you to double check the logic to compare if the config is equal to the previous one because the tests are using only one value and I'm not sure how deep the comparison will go.

Finally, there are some names (varibles, keys) that I'd like to change before merging.

humitos · 2018-10-23T10:36:53Z

readthedocs/builds/models.py

@@ -497,6 +498,47 @@ class Meta(object):
        get_latest_by = 'date'
        index_together = [['version', 'state', 'type']]

+    @property
+    def previous(self):
+        """Returns the previous build to this one."""


I'm trying to be more explicit here,

Return the previous build to the current Build (matching the project and version)

humitos · 2018-10-23T10:42:16Z

readthedocs/builds/models.py

+        return None
+
+    def get_config(self):
+        """Get the config from correct object."""


If we are not going to access the config from this object directly, I suggest to use _config for the model field, and make this method a @property called just config. This way, we will avoid potential future problems and mistakes.

The docstring could be more explicit.

Get the config used for this build. Since we are saving the config into the JSON field only when it differs from the previous one, this helper returns the correct JSON used for this particular Build object (it could be stored in this object or one of the previous ones)

I like the idea of using _config, also I'd add a @property.set, to set _config through just config. The weird part, would be on the API, but I think we can handle that 🤔

humitos · 2018-10-23T10:43:34Z

readthedocs/builds/models.py

+        from the previous one.
+
+        If the config is the same, we save the pk of the object
+        that has the **real** config under the ``__config`` key.


The __config field could be more explicit also, like __config_used__build_pk.

Also, it would be good if this is a CONSTANT defined in the class, instead something that it's written many times.

Yeah, I'm not convinced with the name either, not sure about __config_used__build_pk. I don't have better ideas for the name though :/

humitos · 2018-10-23T10:50:14Z

readthedocs/builds/models.py

+        that has the **real** config under the ``__config`` key.
+        """
+        previous = self.previous
+        if previous is not None and self.config == previous.get_config():


Do you know how deep the == operator works with dicts?

Also, I think this comparison needs to be a method because it will be more explicit. Like, has_config_changed or similar.

It's recursive https://stackoverflow.com/a/17217255/5689214

readthedocs/rtd_tests/tests/test_builds.py

humitos · 2018-10-23T11:03:48Z

readthedocs/builds/models.py

@@ -109,6 +130,16 @@ def __str__(self):
                pk=self.pk,
            ))

+    @property
+    def config(self):
+        """Returns the configuration used in the last successful build."""


It would be good to explain a little more what this returns and what's its type. Probably, it's good to use :returns: and :rtype: in the docstring. We do have many different types of Config objects and this will confuse us in the near future. In fact, I think this is not returning a Config object, but a simple dict 😕

Yep, it's just a dict

humitos · 2018-10-23T11:06:49Z

readthedocs/builds/models.py

@@ -109,6 +129,15 @@ def __str__(self):
                pk=self.pk,
            ))

+    @property
+    def config(self):
+        last_build = (


This method is only used in the tests. Do we need this at all? Did you add it as a helper for debugging or so? Does it worth to keep it here? Where do you think it will be used?

stsewd · 2018-10-23T20:24:32Z

I have changed the approach, now we have a _config field, but the model has a config property which works as a proxy to the real config.

Also, I was worried about the serializer and the api, but looks like it just works! we have access to the real config trough configfrom the api, and saving to config works too!

ericholscher

Just a few performance questions. I do worry about adding a ton of queries to our biggest table (~7M builds)

ericholscher · 2018-11-01T20:04:09Z

readthedocs/projects/tasks.py

+    def save_build_config(self):
+        """Save config in the build object."""
+        pk = self.build['id']
+        build_data = api_v2.build(pk).get()


This feels a bit overkill. Can we not use the patch method to just update the config, without having to get() it previously? Similar to: https://github.com/rtfd/readthedocs.org/blob/master/readthedocs/projects/tasks.py#L608

ericholscher · 2018-11-01T20:07:33Z

readthedocs/builds/models.py

+        if previous is not None and self._config == previous.config:
+            previous_pk = previous._config.get(self.CONFIG_KEY, previous.pk)
+            self._config = {self.CONFIG_KEY: previous_pk}
+        super(Build, self).save(*args, **kwargs)


I'm a little worried that we're running this on each Build save. We do this a number of times during a documentation build, and this will be adding a decent number of queries (eg. self.previous is a query, as is previous.config).

Is there a way to set the _config just once when we have changed it?

I don't think there is an is_changed method on django, but I think we can do this using an internal variable.

Yea, I'm just trying to think if there is a way to run this logic just when we are trying to set a config. Perhaps it belongs somewhere else?

readthedocs/builds/models.py

stsewd · 2018-11-05T19:04:27Z

patch isn't saving the reference but the whole config, I need to write a test and investigate that a little more

stsewd · 2018-11-05T19:07:45Z

Forget about #4749 (comment), looks like it's working, I'll write a test for patch anyway.

stsewd · 2018-11-05T21:40:18Z

Something weird is happening on local dev, sometimes the response from the API is in str instead of a dict, I'll investigate more, if that doesn't work, I'll try to test with postgres, or maybe I mess up my db.

stsewd · 2018-11-06T03:43:43Z

So, it happens that sometimes the build was saved as expected and others not, the bug was that as we have _config and config exposed on the api, they were overriding each other (actually _config was overriding to config). As _config didn't have the json serializer, it was saving the value as plain text (jsonfield doesn't do any validation apparently). Anyway, the fix? Just ignore _config on the API :) the tests didn't catch this, so, I wrote a test that checks that _config isn't in the response.

humitos · 2018-11-06T09:31:43Z

From the latest comments I understand that this is ready to be merged. Now, you are saving and running all the checks only when the _config has changed on save which was one of the worries of @ericholscher. The other one was the .patch that you already made that change.

All your tests are passing and also your QA.

I took a very quick look and seems good to get merged.

ericholscher

Looks good. Will be interesting to get this data saved over time. 💯

ericholscher · 2018-11-07T17:24:57Z

readthedocs/restapi/serializers.py

@@ -112,6 +112,7 @@ class BuildSerializer(serializers.ModelSerializer):
    version_slug = serializers.ReadOnlyField(source='version.slug')
    docs_url = serializers.ReadOnlyField(source='version.get_absolute_url')
    state_display = serializers.ReadOnlyField(source='get_state_display')
+    config = serializers.JSONField(required=False)


Probably should be a code comment

ericholscher · 2018-11-07T17:25:55Z

readthedocs/projects/tasks.py

+        config = self.config.as_dict()
+        build_data['config'] = config
+        api_v2.build(pk).put(build_data)
+        self.build['config'] = config


Yea, testing just about everything about our builds is tricky :/

ericholscher · 2018-11-07T17:26:32Z

readthedocs/config/config.py

@@ -248,6 +248,21 @@ def python_full_version(self):
            )
        return ver

+    def as_dict(self):
+        attributes = [


I wonder if this logic should live somewhere else, as a PUBLIC_ATTRIBUTES or something that is more likely to be updated when changed/added?

stsewd · 2018-11-07T21:13:22Z

@ericholscher done!

ericholscher · 2018-11-07T21:28:13Z

🎉

stsewd added 3 commits October 11, 2018 16:21

New field

6891db8

Change API

b1b3620

Implement as_dict to config

27d8251

stsewd added the PR: work in progress Pull request is not ready for full review label Oct 11, 2018

stsewd commented Oct 11, 2018

View reviewed changes

ericholscher reviewed Oct 12, 2018

View reviewed changes

stsewd added 4 commits October 12, 2018 12:05

One more test for the api

ab5ea04

Save config on build

5f4fb8c

Simple query

9d70812

Test for version.config property

13abeda

stsewd commented Oct 12, 2018

View reviewed changes

readthedocs/rtd_tests/tests/test_version_config.py Show resolved Hide resolved

stsewd commented Oct 12, 2018

View reviewed changes

readthedocs/builds/models.py Show resolved Hide resolved

stsewd removed the PR: work in progress Pull request is not ready for full review label Oct 12, 2018

humitos requested a review from a team October 15, 2018 09:27

humitos requested changes Oct 15, 2018

View reviewed changes

stsewd added 6 commits October 15, 2018 10:44

Use callable as default

98d26a4

Update migration

660c03a

Save on model only if there is a difference

d29b6d9

Merge branch 'master' into save-config-on-build

7fc1e2b

More tests

8f62f03

Linter

76929ac

One more test

7ade872

humitos reviewed Oct 23, 2018

View reviewed changes

stsewd added 2 commits October 23, 2018 14:58

Better approach for save config

6f8becc

Tests

07efa1b

ericholscher reviewed Nov 1, 2018

View reviewed changes

stsewd added 4 commits November 2, 2018 13:19

Check for previous only on first creation or if the object hast changed

822b793

Merge branch 'master' into save-config-on-build

8c60760

Rebase migration

7ed110d

Use just patch

2ffdbf6

stsewd commented Nov 2, 2018

View reviewed changes

readthedocs/builds/models.py Show resolved Hide resolved

stsewd added 3 commits November 5, 2018 11:35

Docstring

5cce5d8

Don't reference to empty builds

9d8266f

Fix varname

28cd935

stsewd added 2 commits November 5, 2018 14:31

Add test

433b709

Docstring

b88837d

stsewd added 2 commits November 5, 2018 16:42

Merge branch 'master' into save-config-on-build

4723705

Don't expose _config on the API

043f4e6

humitos approved these changes Nov 6, 2018

View reviewed changes

ericholscher approved these changes Nov 7, 2018

View reviewed changes

stsewd added 2 commits November 7, 2018 15:32

Refactor as feedback

79fbb8e

Merge branch 'master' into save-config-on-build

f33d54a

ericholscher merged commit 1529e2d into readthedocs:master Nov 7, 2018

stsewd deleted the save-config-on-build branch November 7, 2018 21:42

stsewd mentioned this pull request Nov 14, 2018

Implement extended install option #4740

Merged

7 tasks

stsewd mentioned this pull request Dec 12, 2018

Add build metadata collection #3751

Closed

stsewd mentioned this pull request May 20, 2020

Show configuration used in each build readthedocs/ext-theme#8

Closed

Save config on build model #4749

Save config on build model #4749

Conversation

stsewd commented Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Oct 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Oct 12, 2018

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher Oct 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 23, 2018 • edited Loading

Codecov Report

stsewd commented Oct 23, 2018

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Oct 23, 2018

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Nov 5, 2018

stsewd commented Nov 5, 2018

stsewd commented Nov 5, 2018

stsewd commented Nov 6, 2018

humitos commented Nov 6, 2018

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Nov 7, 2018

ericholscher commented Nov 7, 2018

stsewd commented Oct 11, 2018 •

edited

Loading

ericholscher Oct 15, 2018 •

edited

Loading

codecov bot commented Oct 23, 2018 •

edited

Loading