-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Design doc: collect data about builds #8124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2ef98a9
to
eed1ff3
Compare
@astrojuanlu you may want to chime in, 'cause data science p: |
eed1ff3
to
2046cb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. I am +1000 on making data-driven decisions. Some high level remarks:
- It's not clear to me, by reading this document, what data we are already storing. My understanding is that build logs are kept somewhere, but I have no idea. I think that should be clarified here.
- Related to the previous point: if we are already storing the build logs somewhere, perhaps this proposal should focus on proposing new structured data that we can store. And if so, my remarks about storing the full
requirements.txt
orenvironment.yml
might not apply. - And finally, related to the previous points: whether this data will be structured (JSON) or not (blob of text) affects the decision of how to store it, I think. I am not an expert but I think blob storage might not be the best choice for analytics (long, sequential, "columnar" reads). For structured data, a PostgreSQL database will probably do the job (but I have no idea how "big" is our "data") and for unstructured data, ElasticSearch is probably the way to go (given that we are already familiar with it).
|
||
We can get a json with all dependencies with ``conda list --json``. | ||
That command gets all the root dependencies and their dependencies, | ||
so we may be collecting some noise, but we can use ``pip list`` as a secondary source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly with pip, yes this gives us all installed dependencies without distinction of whether they are transitive dependencies or not (too bad), but probably we should also try to store the environment.yml
somehow, just an idea.
Storage | ||
------- | ||
|
||
We can save all this information in json files in cloud storage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert, but I believe ElasticSearch is a great fit for this kind of data, and we are already using it for search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue is that we quickly run out of memory storing this in anything that is "live". If we don't want to query this data in any kind of real-time manner, so something closer to a map reduce type workflow is probably best. JSON in cloud storage probably isn't ideal for this, but I'm pretty confident we don't want it stored "hot" unless we're only storing a short timeframe.
A couple options here:
- Only store the last 5 builds for any version
- Only store the last 3 months of builds
There are tradeoffs here, but if we scale the amount of data down, it gives us more flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also consider storing only 1 build per active version. I don't think we want historical data for this, but the actual one for the current active versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another +1 for avoiding long term hot storage of metrics. S3/cold storage isn't great to work with when doing analysis of build/build commands, but we can address this with some scripts to automate the process. 1 month of build data is about 7G of cloned files, but luckily we're not going to inspect this data frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start. In particular getting the dependency data is a super interesting data set. I think we have a tradeoff between data size & queryability, so we need to think through what kind of analysis we want to do.
I see three major buckets that data could fall into:
- Things we want to continuously monitor and graph via cloudwatch. This is more classic telemetry, and is super interesting and useful for making decisions without having to do any queries
- Medium-term data, which should be a aggregated but recent subset of data stored in a live-queryable system (ES, SQL, or something more purpose built on AWS?)
- Long-term data, which should be everything stored somewhere not in memory. This is for longer-term data analysis and understanding changes over time. It can also store more data.
We should think about which ones we want to prioritize, and what kind of queries we care most about to think about the "v1" of this.
Storage | ||
------- | ||
|
||
We can save all this information in json files in cloud storage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue is that we quickly run out of memory storing this in anything that is "live". If we don't want to query this data in any kind of real-time manner, so something closer to a map reduce type workflow is probably best. JSON in cloud storage probably isn't ideal for this, but I'm pretty confident we don't want it stored "hot" unless we're only storing a short timeframe.
A couple options here:
- Only store the last 5 builds for any version
- Only store the last 3 months of builds
There are tradeoffs here, but if we scale the amount of data down, it gives us more flexibility.
|
||
.. Since this information isn't sensitive, | ||
I think we are fine with this structure | ||
(we can't do bulk deletes of all info about a project if we follow this structure). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think storing it in "cold" storage is right for the long-term large-scale analysis. I'd love to have a subset of important things look more like our monitoring data, and reported/graphed live via cloudwatch. We could pick the top 10 most important/interesting things (eg. theme, sphinx version, etc.)
|
||
Analyzing the data | ||
------------------ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we actually probably need to have this section first. Knowing what kind of analysis we want to run on the data will tell us how to store & query it.
Currently we have defined a way of storing data that we don't really look at. I think it's probably more valuable to start to have some kind of short-term storage of data in a query engine that we can query dynamically (ES or Postgres, as @astrojuanlu said). We'd have to pretty strongly truncate this data (eg. <4GB in memory) -- but I think we can definitely get some useful subsets of data into that format given that it isn't huge.
In particular, just getting this data into some kind of queryable format:
- .readthedocs.yml
- conf.py dumped as JSON
- pip & conda files dumped
Would get us a lot of value across a short timeframe (eg. 1 month).
But, what are we going to do when we have this data to query? We should think about what kind of queries we want to run and what analysis would be valuable to make sure we're building the system properly.
Maybe it does makes sense to consider using a different db separated from production to store/query this data. This way we can use it without too much worry about taking down prod. |
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
I'll check how we can use the ES stack for this. I'm not sure about storing the requirements/environments as they may contain secrets (specially in .com). And about storing the conf.py the same, they may have secrets, and it would require to hook our extension, I think we want to stop relying on our extension the most we can. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
In light of recent improvements to add structured logging to our application and send it to NewRelic (see #8689 and related), would it be possible to re-evaluate the status of this document? Some open questions that were addressed: the storage backend ✔️ We are sending structured logs to a place separated from our production database, and looks like NewRelic could also host the build data proposed here. Some open questions that were not addressed: ❓
Some ideas (credit to @humitos for writing these down in the first place):
I think these are all superb questions and I don't really have much to add. The "difficult" part, in a way, was already done, and in a very graceful way I believe: with some minimal changes to our logging code we are storing data that we didn't have before. It would be cool to find a way to move forward with this that doesn't require us making any difficult decisions, and at the same time make it extendable so we can easily add more data to the mix when we need it. |
I'm happy with the result so far even without doing a full exploration of the NR Logs UI or documentation yet, but I think we will get a lot of insights from there just by creating some rules/dashboards without too much effort.
I'd definitely do a test by sending all the build data to NR to see how much data it is and if the service can afford it and also if we can query JSON fields in the same easy way as we can query key/value fields. One good test for this would be to send Note that sending all this data per-build would be a ton of data, so we probably want to send it to NR but not to our centralized rsyslog on util01. |
We've done some related work already and we are plotting some nice plots in Metabase and New Relic with the data we can get from the database and logs respectively. However, there are some data we are not collecting yet and I think we should continue the conversation around that particular topic which is not yet defined. Note that now we are fully migrated to Django 3.2 and soon we will have real JSON field in the database that we will be able to query as simple as Once we have that data into our database we will be able to query it from Metabase (note that Metabase does not natively support PostgreSQL JSONb fields, but it allows us to write the SQL query by ourselves) and be able to do some quick tests to plot the usage of some features we care. After that, I'd propose to continue with the plan I mentioned when working on KPI's for Embed API (https://hackmd.io/eQViVXSeTFuxqSrrNO_8Pw#Plan-to-action):
Using PostgreSQL JSON fields allows us to expand the data we are dumping in the future without any modification required in the database, which is great! Besides, it does not introduce a different technology that we have to learn nor other overhead/barriers. We would be all re-using all the Django, Celery, and SQL knowledge we already and we feel comfortable with it. |
I think this plan makes sense, and I agree that we should start advancing this work again around KPI's. I do think having a more specific idea of what kind of queries we want to run (#8124 (comment)) will allow us to ensure we're building this schema correctly, but I'm 💯 on Metabase/Postgres being the right place to visualize and report on this. We can even make dashboards on Metabase public, if we wanted to share this info with the community. |
Another +1 on using postgres here. Elastic is a really good technical solution, but we have better team strength in postgres right now. A separate database as Manuel mentioned is probably a good idea, though I could see us eventually duplicating a lot of model attributes to make things more queryable and that could be a point against splitting the database. For queries that we want to run, @humitos and I discussed this topic last week. It would be most helpful to implement queries that we do not have at all currently. There is a lot of value in a query like project pip dependencies, and not a significant amount of value in a new implementations of existing queries or data we already have access to -- like build config, which we're already saving the rendered config on each build. I'd like to see this move forward. Are we all in agreement on the technical implementation? Do we agree that starting with a new query like project requirement tracking is a good place to start? If we're mostly in agreement, let's plop a quick note on the design doc that we're going towards postgres instead and get on with implementation. |
I have updated the document with what I have understood after reading the other document and comments:
Things to decide? Using another dbWe will have to duplicate some data and have some additional code, but this idea makes sense to me. What data are we willing to duplicate?The main one is the final config used on the build, we already save this on the build object, but it doesn't always store the config itself currently (if it's using the same config we save a reference to the previous one that was used, we can also migrate this to a foreign key). Duplicating the data will make it easier to query the information, as it won't require fetching that information from the other db/aggregate results (and probably complicated to do on metabase?), but some data can already be queried from our main db (final config, build length). How much to storeShould we store this for every build, or something like the latest build of each version of a project? We could also put a limit of the number of builds to store per project or builds that aren't older than x days. |
"pip": [{ | ||
"name": "sphinx", | ||
"version": "3.4.5" | ||
}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know if with this "list of dictionaries" we can query
- projects using sphinx==3.4.5
- projects using sphinx>=2.0.0
I guess we can, but it may be good to double-check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do basic things like "start with 2.
", not sure about something more complex (we could use an > operator, but that will depend on the ascii ordering)
I'm fine duplicating data for now because it will keep things simpler to start. Also, it seems that the main duplicated data will be the config file (it will be in
I'd start by storing all the builds. I prefer to store more data and then delete it than the other way. As we don't care too much about "historical data" (e.g. builds from 3 years ago) to make decisions, I think we can limit the data to "N months/years ago" because it's simpler to implement instead of "M builds per project, no matter how old they are". This will also help us to make decisions to support "N months/years of backward compatibility" for all the projects. |
Currently, this is accurate. However, I have proposed adding metadata in the build detail page from the rendered config file. Things like build tools versions, ubuntu image, etc might be nice additions. This is a new feature though. I am currently surfacing the build config as one of the debug outputs available to admins in the new UI though. We talked about this as a way to get rid of no-op commands like Anyways, not a big deal, it's not a huge amount of duplicated data.
👍 on doing things inefficiently to start. I usually do find myself just wanting the most recent builds of versions though when I'm getting data out though. |
If the config file is strictly required to do this, we should keep it in the main db. However, I think it would be better to implement this by attaching known tags to the Build object instead and query them by simply doing |
Adding a BuildTag sounds more complicated than just having the Build.config I feel. I don't think it's complexity that we're removing by not storing the Build.config, but it does reduce data duplication. For now, I'd say do not move the build config object off of Build, and instead duplicate what we need to the telemetry database. The cost here is low and leaves the door open to actually using the Build.config data. Later, we can decide to move it away from Build completely. For example, we might just decide that showing the rendered config to users (not just admin debug) is a helpful UI feature. We'll need a Build.config of course for that. |
Another decision to take: Should the model have several fields with some of them being json fields (config) or just one big json field? #8124 (comment) |
It seems the idea with this database is that the data is temporary and mutable. I don't have opinions on whether or not a singular field works, but I do agree with @humitos that it's easy enough to switch later if we do hit issues with query performance, storage, etc. Worth mentioning that JSON fields don't necessarily mitigate the need for migrations. As with any document store database, a la Mongo, eventually you need to do a "migration". At some point our plan for schema changes and we need to rewrite data or suffer through handling mixed data schemas. JSON is not a panacea here, but having a well thought out schema up front goes a long way. |
Yeah, I'm talking about schema migrations. Those are the ones that I want to avoid here since if they happened we are forced to make data migration as well. However, if we use only one JSON field, we can have two different schemas inside the JSON and:
So, things will keep working even if we change the "schema inside the json" but it will require schema migration at db level if we use regular fields. I'm seeing this as simpler, more flexible and easier for a starting point. |
This is maybe getting a bit ahead of the implementation. If we find ourselves wanting to hack up JSON fields to do something that multiple fields and/or a data/schema migrations already give us, I would lean towards the simpler implementation. Let's start at the easiest place possible and build from there. It sounds like we can ignore optimizations for redundant data on the Build model for now, and I agree it seems that a singular field is a good place to start. If we hate this design, we'll switch to multiple fields in some fashion. Anything else holding this design back right now? |
👍🏼
Nope. IMO, this was already ready to start with the implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say lets merge this, since we're now working on implementation. 👍
I'm fine merging this and continuing the discussion about the specifics in the PR for the implementation 👍🏼 |
This doc describes what to collect and how to collect it.
To be decided
I did a PoC with what is needed to collect packages from pip/conda in #8123
Read it here https://docs--8124.org.readthedocs.build/en/8124/development/design/telemetry.html