Skip to content

Collect build data #9113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 18, 2022
Merged

Collect build data #9113

merged 20 commits into from
May 18, 2022

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Apr 14, 2022

WIP, implementation of #8124

Some notes:

  • We need to collect the data before the container is killed

  • We need to upload the data after the build has finished, otherwise some fields may be out of date or unset (build.length especially)

  • Since this data is collected from the builders, we need to pass the data over our API (no db access)

  • I'm passing only the data that we won't have after the container is killed, and complementing the other data before saving the object (project/org/build etc).

  • We need to get the original config from the user, we modify the "original copy", so we need to copy it twice.

  • Check if json fields play nice with dates (we are saving the date in isoformat for now)

    This doesn't work, not sure if it's going to work on metabase as well.

    from django.db.models import DateTimeField
    from django.db.models.functions import Cast
    BuildData.objects.annotate(start_date=Cast('data__build__start', output_field=DateTimeField()))
  • The output of all apt packages seems to be the biggest one.

  • Are we going to use a separate DB to store this information?

  • Check https://docs.djangoproject.com/en/4.0/topics/serialization/#djangojsonencoder doesn't look useful for our case

  • We probably need to fix// lsremote fails if the output is bigger than DATA_UPLOAD_MAX_MEMORY_SIZE #9112 too, since it could pollute our data

@stsewd stsewd force-pushed the colect-build-data branch 11 times, most recently from 866c822 to ce54523 Compare April 19, 2022 17:49
@stsewd stsewd force-pushed the colect-build-data branch from ce54523 to f0b0fc3 Compare April 19, 2022 18:22
@stsewd stsewd marked this pull request as ready for review April 19, 2022 19:39
@stsewd stsewd requested a review from a team as a code owner April 19, 2022 19:39
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a good initial approach! I left some comments to consider.

Since this data is collected from the builders, we need to pass the data over our API (no db access)

I mentioned this in one of the comments as well. I think we can avoid sending the data via the API and use a Celery task instead. This way, we will be passing the data via Redis without congesting our web servers running the API and reducing the "extra time" we are introducing to each build from a user's perspective.

Are we going to use a separate DB to store this information?

We already talked about this in the design doc and I understood that we decided to use a different database already (#8124 (comment))

This will allow to not congest the production database and chunk its data or drop it completely if neeed without thinking that it may break something.

Check if json fields play nice with dates (we are saving the date in isoformat for now)

In case this is required, we can write a small Func to cast the field (see https://docs.djangoproject.com/en/4.0/ref/models/expressions/#func-expressions-1) or even just Extract (https://docs.djangoproject.com/en/4.0/ref/models/database-functions/#extract). However, I don't think we will need this immediately, at least.

@stsewd
Copy link
Member Author

stsewd commented Apr 27, 2022

Updated to use a dedicated db. We need to change our deploy scripts, because django migrate operates on one db at the time

https://docs.djangoproject.com/en/4.0/topics/db/multi-db/#synchronizing-your-databases

basically

django-admin migrate
django-admin migrate --database telemetry

I have tested this locally and it works as expected, well almost... When running showmigrations and migrate you'll see the other migrations being listed/executed, but if you check the db, you'll see that everything is correct (django fakes the other migrations). Even migrate --plan will show that all migrations will be applied, so there isn't a good way to tell if the router will work as expected other than checking that the setting DATABASE_ROUTERS is set.

And we also need to create the db on production, obviously.

I'll be opening another PR to support running multiple dbs in our docker setup.

stsewd added a commit to readthedocs/common that referenced this pull request Apr 27, 2022
@stsewd stsewd requested review from humitos and a team April 27, 2022 22:47
Comment on lines 641 to 642
except Exception:
log.exception("Error while uploading build data")
log.exception("Error while saving build data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a try/except here? Why the .delay would fail? It made more sense when using an API call, but I think we can remove this block now that it's using a Celery task.

Copy link
Member Author

@stsewd stsewd May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just being defensive here. Other reasons it could fail would be if self.data or its attributes aren't defined.

@stsewd stsewd requested a review from humitos May 5, 2022 22:51
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this locally, but the code looks good to me! It seems there are some extra steps that are required when deploying, so it would be good to note them in the release project.

@humitos
Copy link
Member

humitos commented May 13, 2022

BuildData.objects.annotate(start_date=Cast('data__build__start', output_field=DateTimeField()))

Considering that this does not work and we are saving the results via Celery, which could back up the queue and run later, wouldn't be good to use build__start date (or build__date) for created_at and modified_at field? That way we will be re-using one of the build's date instead of adding another one.

@stsewd
Copy link
Member Author

stsewd commented May 16, 2022

for created_at and modified_at field? That way we will be re-using one of the build's date instead of adding another one.

Do you mean for the BuildData model? I think it's useful to have the actual dates for that model (for debugging and for auditing), if we need real date fields for the build itself, I'm fine adding new fields

@humitos
Copy link
Member

humitos commented May 17, 2022

No, that's fine. No need to add these fields at this point. It was just an idea to reuse the default ones, but I think what you said makes sense.

@stsewd stsewd merged commit c3ace24 into main May 18, 2022
@stsewd stsewd deleted the colect-build-data branch May 18, 2022 03:29
stsewd added a commit to readthedocs/common that referenced this pull request May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants