-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Collect build data #9113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect build data #9113
Conversation
866c822
to
ce54523
Compare
ce54523
to
f0b0fc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a good initial approach! I left some comments to consider.
Since this data is collected from the builders, we need to pass the data over our API (no db access)
I mentioned this in one of the comments as well. I think we can avoid sending the data via the API and use a Celery task instead. This way, we will be passing the data via Redis without congesting our web servers running the API and reducing the "extra time" we are introducing to each build from a user's perspective.
Are we going to use a separate DB to store this information?
We already talked about this in the design doc and I understood that we decided to use a different database already (#8124 (comment))
This will allow to not congest the production database and chunk its data or drop it completely if neeed without thinking that it may break something.
Check if json fields play nice with dates (we are saving the date in isoformat for now)
In case this is required, we can write a small Func
to cast the field (see https://docs.djangoproject.com/en/4.0/ref/models/expressions/#func-expressions-1) or even just Extract
(https://docs.djangoproject.com/en/4.0/ref/models/database-functions/#extract). However, I don't think we will need this immediately, at least.
Updated to use a dedicated db. We need to change our deploy scripts, because https://docs.djangoproject.com/en/4.0/topics/db/multi-db/#synchronizing-your-databases basically
I have tested this locally and it works as expected, well almost... When running And we also need to create the db on production, obviously. I'll be opening another PR to support running multiple dbs in our docker setup. |
readthedocs/projects/tasks/builds.py
Outdated
except Exception: | ||
log.exception("Error while uploading build data") | ||
log.exception("Error while saving build data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a try/except here? Why the .delay
would fail? It made more sense when using an API call, but I think we can remove this block now that it's using a Celery task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just being defensive here. Other reasons it could fail would be if self.data or its attributes aren't defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested this locally, but the code looks good to me! It seems there are some extra steps that are required when deploying, so it would be good to note them in the release project.
Considering that this does not work and we are saving the results via Celery, which could back up the queue and run later, wouldn't be good to use |
Do you mean for the BuildData model? I think it's useful to have the actual dates for that model (for debugging and for auditing), if we need real date fields for the build itself, I'm fine adding new fields |
No, that's fine. No need to add these fields at this point. It was just an idea to reuse the default ones, but I think what you said makes sense. |
WIP, implementation of #8124
Some notes:
We need to collect the data before the container is killed
We need to upload the data after the build has finished, otherwise some fields may be out of date or unset (build.length especially)
Since this data is collected from the builders, we need to pass the data over our API (no db access)
I'm passing only the data that we won't have after the container is killed, and complementing the other data before saving the object (project/org/build etc).
We need to get the original config from the user, we modify the "original copy", so we need to copy it twice.
Check if json fields play nice with dates (we are saving the date in isoformat for now)
This doesn't work, not sure if it's going to work on metabase as well.
The output of all apt packages seems to be the biggest one.
Are we going to use a separate DB to store this information?
Check https://docs.djangoproject.com/en/4.0/topics/serialization/#djangojsonencoderdoesn't look useful for our caseWe probably need to fix// lsremote fails if the output is bigger than DATA_UPLOAD_MAX_MEMORY_SIZE #9112 too, since it could pollute our data