Skip to content

Design doc: collect data about builds #8124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 26, 2022
Merged

Design doc: collect data about builds #8124

merged 12 commits into from
Apr 26, 2022

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Apr 20, 2021

This doc describes what to collect and how to collect it.

To be decided

  • How/where to storage this information? I proposed a json file in storage, but we can use any other fancy tool specialized for this.
  • Should we make this public (for .org at least)?

I did a PoC with what is needed to collect packages from pip/conda in #8123

Read it here https://docs--8124.org.readthedocs.build/en/8124/development/design/telemetry.html

@stsewd stsewd force-pushed the build-telemetry-design-doc branch 3 times, most recently from 2ef98a9 to eed1ff3 Compare April 21, 2021 23:20
@stsewd stsewd marked this pull request as ready for review April 21, 2021 23:23
@stsewd stsewd requested a review from a team April 21, 2021 23:23
@stsewd
Copy link
Member Author

stsewd commented Apr 21, 2021

@astrojuanlu you may want to chime in, 'cause data science p:

@stsewd stsewd force-pushed the build-telemetry-design-doc branch from eed1ff3 to 2046cb3 Compare April 21, 2021 23:39
Copy link
Contributor

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. I am +1000 on making data-driven decisions. Some high level remarks:

  • It's not clear to me, by reading this document, what data we are already storing. My understanding is that build logs are kept somewhere, but I have no idea. I think that should be clarified here.
  • Related to the previous point: if we are already storing the build logs somewhere, perhaps this proposal should focus on proposing new structured data that we can store. And if so, my remarks about storing the full requirements.txt or environment.yml might not apply.
  • And finally, related to the previous points: whether this data will be structured (JSON) or not (blob of text) affects the decision of how to store it, I think. I am not an expert but I think blob storage might not be the best choice for analytics (long, sequential, "columnar" reads). For structured data, a PostgreSQL database will probably do the job (but I have no idea how "big" is our "data") and for unstructured data, ElasticSearch is probably the way to go (given that we are already familiar with it).


We can get a json with all dependencies with ``conda list --json``.
That command gets all the root dependencies and their dependencies,
so we may be collecting some noise, but we can use ``pip list`` as a secondary source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly with pip, yes this gives us all installed dependencies without distinction of whether they are transitive dependencies or not (too bad), but probably we should also try to store the environment.yml somehow, just an idea.

Storage
-------

We can save all this information in json files in cloud storage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert, but I believe ElasticSearch is a great fit for this kind of data, and we are already using it for search.

Copy link
Member

@ericholscher ericholscher Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that we quickly run out of memory storing this in anything that is "live". If we don't want to query this data in any kind of real-time manner, so something closer to a map reduce type workflow is probably best. JSON in cloud storage probably isn't ideal for this, but I'm pretty confident we don't want it stored "hot" unless we're only storing a short timeframe.

A couple options here:

  • Only store the last 5 builds for any version
  • Only store the last 3 months of builds

There are tradeoffs here, but if we scale the amount of data down, it gives us more flexibility.

Copy link
Member

@humitos humitos Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also consider storing only 1 build per active version. I don't think we want historical data for this, but the actual one for the current active versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another +1 for avoiding long term hot storage of metrics. S3/cold storage isn't great to work with when doing analysis of build/build commands, but we can address this with some scripts to automate the process. 1 month of build data is about 7G of cloned files, but luckily we're not going to inspect this data frequently.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start. In particular getting the dependency data is a super interesting data set. I think we have a tradeoff between data size & queryability, so we need to think through what kind of analysis we want to do.

I see three major buckets that data could fall into:

  • Things we want to continuously monitor and graph via cloudwatch. This is more classic telemetry, and is super interesting and useful for making decisions without having to do any queries
  • Medium-term data, which should be a aggregated but recent subset of data stored in a live-queryable system (ES, SQL, or something more purpose built on AWS?)
  • Long-term data, which should be everything stored somewhere not in memory. This is for longer-term data analysis and understanding changes over time. It can also store more data.

We should think about which ones we want to prioritize, and what kind of queries we care most about to think about the "v1" of this.

Storage
-------

We can save all this information in json files in cloud storage,
Copy link
Member

@ericholscher ericholscher Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that we quickly run out of memory storing this in anything that is "live". If we don't want to query this data in any kind of real-time manner, so something closer to a map reduce type workflow is probably best. JSON in cloud storage probably isn't ideal for this, but I'm pretty confident we don't want it stored "hot" unless we're only storing a short timeframe.

A couple options here:

  • Only store the last 5 builds for any version
  • Only store the last 3 months of builds

There are tradeoffs here, but if we scale the amount of data down, it gives us more flexibility.


.. Since this information isn't sensitive,
I think we are fine with this structure
(we can't do bulk deletes of all info about a project if we follow this structure).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think storing it in "cold" storage is right for the long-term large-scale analysis. I'd love to have a subset of important things look more like our monitoring data, and reported/graphed live via cloudwatch. We could pick the top 10 most important/interesting things (eg. theme, sphinx version, etc.)


Analyzing the data
------------------

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually probably need to have this section first. Knowing what kind of analysis we want to run on the data will tell us how to store & query it.

Currently we have defined a way of storing data that we don't really look at. I think it's probably more valuable to start to have some kind of short-term storage of data in a query engine that we can query dynamically (ES or Postgres, as @astrojuanlu said). We'd have to pretty strongly truncate this data (eg. <4GB in memory) -- but I think we can definitely get some useful subsets of data into that format given that it isn't huge.

In particular, just getting this data into some kind of queryable format:

  • .readthedocs.yml
  • conf.py dumped as JSON
  • pip & conda files dumped

Would get us a lot of value across a short timeframe (eg. 1 month).

But, what are we going to do when we have this data to query? We should think about what kind of queries we want to run and what analysis would be valuable to make sure we're building the system properly.

@humitos
Copy link
Member

humitos commented Apr 26, 2021

For structured data, a PostgreSQL database will probably do the job (but I have no idea how "big" is our "data")

Maybe it does makes sense to consider using a different db separated from production to store/query this data. This way we can use it without too much worry about taking down prod.

Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
@stsewd
Copy link
Member Author

stsewd commented Apr 26, 2021

I'll check how we can use the ES stack for this. I'm not sure about storing the requirements/environments as they may contain secrets (specially in .com). And about storing the conf.py the same, they may have secrets, and it would require to hook our extension, I think we want to stop relying on our extension the most we can.

@stsewd stsewd requested a review from a team April 27, 2021 21:22
@stale
Copy link

stale bot commented Jun 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Jun 16, 2021
@stsewd stsewd removed the Status: stale Issue will be considered inactive soon label Jun 16, 2021
@ericholscher ericholscher added the Accepted Accepted issue on our roadmap label Jun 17, 2021
@astrojuanlu
Copy link
Contributor

In light of recent improvements to add structured logging to our application and send it to NewRelic (see #8689 and related), would it be possible to re-evaluate the status of this document?

Some open questions that were addressed: the storage backend ✔️ We are sending structured logs to a place separated from our production database, and looks like NewRelic could also host the build data proposed here.

Some open questions that were not addressed: ❓

But, what are we going to do when we have this data to query? We should think about what kind of queries we want to run and what analysis would be valuable to make sure we're building the system properly.

Some ideas (credit to @humitos for writing these down in the first place):

  • How much build time a particular project is using in a date range?
  • How much build time a particular organization is using in a date range?
  • How many projects are installing sphinx-hoverxref, and which of them are using a specific version?
  • What projects are using sphinx<=1.8.x?
  • How many builds using sphinx<=1.8.x have failed in the last 2 weeks?
  • How many projects are using the new build.tools config?
  • What projects would require USE_LATEST_SPHINX feature flag due that their builds are failing because of docutils==0.18 release?
  • How many projects have tunned configuration for search (eg. search.ranking)?
  • How many projects have enabled PDF output?
  • How many projects use conda compared to pip?

I think these are all superb questions and I don't really have much to add.

The "difficult" part, in a way, was already done, and in a very graceful way I believe: with some minimal changes to our logging code we are storing data that we didn't have before. It would be cool to find a way to move forward with this that doesn't require us making any difficult decisions, and at the same time make it extendable so we can easily add more data to the mix when we need it.

@humitos
Copy link
Member

humitos commented Nov 26, 2021

@astrojuanlu

We are sending structured logs to a place separated from our production database, and looks like NewRelic could also host the
build data proposed here.

I'm happy with the result so far even without doing a full exploration of the NR Logs UI or documentation yet, but I think we will get a lot of insights from there just by creating some rules/dashboards without too much effort.

The "difficult" part, in a way, was already done, and in a very graceful way I believe: with some minimal changes to our logging code we are storing data that we didn't have before.

I'd definitely do a test by sending all the build data to NR to see how much data it is and if the service can afford it and also if we can query JSON fields in the same easy way as we can query key/value fields. One good test for this would be to send packages.user.pip field as JSON and check if we can query it as "sphinx-hoverxref" in packages.user.pip.keys() or similar using NR. If the test of sending JSON field works, it will be easy to send one-log line per build (Canonical Log Lines) with all this data to NR.

Note that sending all this data per-build would be a ton of data, so we probably want to send it to NR but not to our centralized rsyslog on util01.

@humitos
Copy link
Member

humitos commented Feb 1, 2022

We've done some related work already and we are plotting some nice plots in Metabase and New Relic with the data we can get from the database and logs respectively. However, there are some data we are not collecting yet and I think we should continue the conversation around that particular topic which is not yet defined.

Note that now we are fully migrated to Django 3.2 and soon we will have real JSON field in the database that we will be able to query as simple as Build.objects.filter(_config__build__os='ubuntu-20.04'). See #8868 and #8869

Once we have that data into our database we will be able to query it from Metabase (note that Metabase does not natively support PostgreSQL JSONb fields, but it allows us to write the SQL query by ourselves) and be able to do some quick tests to plot the usage of some features we care.

After that, I'd propose to continue with the plan I mentioned when working on KPI's for Embed API (https://hackmd.io/eQViVXSeTFuxqSrrNO_8Pw#Plan-to-action):

  1. create a new db telemetry separated from our production database (we can decide how much data we want to save here and chunk it over time to keep only the last 6/12 months or similar if we think it's a lot of data)
  2. using after_return Celery handler (from Build process: use Celery handlers #8815 refactor) to trigger a new task with all the build data collected from the current build and we want to dump in the telemetry db. This task will be executed on web queue and save all this data into the telemetry database (my proposal for the data to dump https://hackmd.io/eQViVXSeTFuxqSrrNO_8Pw#Structure1)
  3. connect telemetry database to Metabase and plot data we are interested into

Using PostgreSQL JSON fields allows us to expand the data we are dumping in the future without any modification required in the database, which is great! Besides, it does not introduce a different technology that we have to learn nor other overhead/barriers. We would be all re-using all the Django, Celery, and SQL knowledge we already and we feel comfortable with it.

@ericholscher
Copy link
Member

I think this plan makes sense, and I agree that we should start advancing this work again around KPI's. I do think having a more specific idea of what kind of queries we want to run (#8124 (comment)) will allow us to ensure we're building this schema correctly, but I'm 💯 on Metabase/Postgres being the right place to visualize and report on this. We can even make dashboards on Metabase public, if we wanted to share this info with the community.

@agjohnson
Copy link
Contributor

Another +1 on using postgres here. Elastic is a really good technical solution, but we have better team strength in postgres right now. A separate database as Manuel mentioned is probably a good idea, though I could see us eventually duplicating a lot of model attributes to make things more queryable and that could be a point against splitting the database.

For queries that we want to run, @humitos and I discussed this topic last week. It would be most helpful to implement queries that we do not have at all currently. There is a lot of value in a query like project pip dependencies, and not a significant amount of value in a new implementations of existing queries or data we already have access to -- like build config, which we're already saving the rendered config on each build.

I'd like to see this move forward. Are we all in agreement on the technical implementation? Do we agree that starting with a new query like project requirement tracking is a good place to start?

If we're mostly in agreement, let's plop a quick note on the design doc that we're going towards postgres instead and get on with implementation.

@stsewd stsewd requested a review from a team as a code owner March 8, 2022 17:21
@stsewd
Copy link
Member Author

stsewd commented Mar 8, 2022

I have updated the document with what I have understood after reading the other document and comments:

  • We are going to use metabase + postgres
  • We will use django models to define the data.

Things to decide?

Using another db

We will have to duplicate some data and have some additional code, but this idea makes sense to me.
Useful reading: https://docs.djangoproject.com/en/4.0/topics/db/multi-db/

What data are we willing to duplicate?

The main one is the final config used on the build, we already save this on the build object, but it doesn't always store the config itself currently (if it's using the same config we save a reference to the previous one that was used, we can also migrate this to a foreign key).

Duplicating the data will make it easier to query the information, as it won't require fetching that information from the other db/aggregate results (and probably complicated to do on metabase?), but some data can already be queried from our main db (final config, build length).

How much to store

Should we store this for every build, or something like the latest build of each version of a project? We could also put a limit of the number of builds to store per project or builds that aren't older than x days.

Comment on lines +227 to +230
"pip": [{
"name": "sphinx",
"version": "3.4.5"
}],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know if with this "list of dictionaries" we can query

  • projects using sphinx==3.4.5
  • projects using sphinx>=2.0.0

I guess we can, but it may be good to double-check it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do basic things like "start with 2.", not sure about something more complex (we could use an > operator, but that will depend on the ascii ordering)

@humitos
Copy link
Member

humitos commented Mar 9, 2022

@stsewd

We will have to duplicate some data

I'm fine duplicating data for now because it will keep things simpler to start. Also, it seems that the main duplicated data will be the config file (it will be in Build objects and in the telemetry database). However, I think storing the config in our main database won't be needed after we have it in the telemetry database. We are not using the Build.config from the main app other than at build time and we could delete it from there once it's on telemetry database. I'm 👍🏼 on that.

Should we store this for every build, or something like the latest build of each version of a project? We could also put a limit of the number of builds to store per project or builds that aren't older than x days.

I'd start by storing all the builds. I prefer to store more data and then delete it than the other way.

As we don't care too much about "historical data" (e.g. builds from 3 years ago) to make decisions, I think we can limit the data to "N months/years ago" because it's simpler to implement instead of "M builds per project, no matter how old they are". This will also help us to make decisions to support "N months/years of backward compatibility" for all the projects.

@agjohnson
Copy link
Contributor

However, I think storing the config in our main database won't be needed after we have it in the telemetry database

Currently, this is accurate. However, I have proposed adding metadata in the build detail page from the rendered config file. Things like build tools versions, ubuntu image, etc might be nice additions. This is a new feature though. I am currently surfacing the build config as one of the debug outputs available to admins in the new UI though. We talked about this as a way to get rid of no-op commands like cat conf.py.

Anyways, not a big deal, it's not a huge amount of duplicated data.

I prefer to store more data and then delete it than the other way.

👍 on doing things inefficiently to start.

I usually do find myself just wanting the most recent builds of versions though when I'm getting data out though.

@humitos
Copy link
Member

humitos commented Mar 10, 2022

@agjohnson

I have proposed adding metadata in the build detail page from the rendered config file. Things like build tools versions, ubuntu image, etc might be nice additions. This is a new feature though. I am currently surfacing the build config as one of the debug outputs available to admins in the new UI though.

If the config file is strictly required to do this, we should keep it in the main db. However, I think it would be better to implement this by attaching known tags to the Build object instead and query them by simply doing buid.tags.all(). These BuildTag objects could have more metadata like the color to be rendered, icon, etc. Anyways, this is outside the conversation we are having in this thread, but I just wanted to mention that I don't see Build._config strictly required currently to be in our main db and I'd be happy to remove it from there and reduce the complexity of our codebase.

@agjohnson
Copy link
Contributor

agjohnson commented Mar 10, 2022

Adding a BuildTag sounds more complicated than just having the Build.config I feel. I don't think it's complexity that we're removing by not storing the Build.config, but it does reduce data duplication. For now, I'd say do not move the build config object off of Build, and instead duplicate what we need to the telemetry database. The cost here is low and leaves the door open to actually using the Build.config data. Later, we can decide to move it away from Build completely.

For example, we might just decide that showing the rendered config to users (not just admin debug) is a helpful UI feature. We'll need a Build.config of course for that.

@stsewd
Copy link
Member Author

stsewd commented Mar 10, 2022

Another decision to take: Should the model have several fields with some of them being json fields (config) or just one big json field? #8124 (comment)

@agjohnson
Copy link
Contributor

It seems the idea with this database is that the data is temporary and mutable. I don't have opinions on whether or not a singular field works, but I do agree with @humitos that it's easy enough to switch later if we do hit issues with query performance, storage, etc.

Worth mentioning that JSON fields don't necessarily mitigate the need for migrations. As with any document store database, a la Mongo, eventually you need to do a "migration". At some point our plan for schema changes and we need to rewrite data or suffer through handling mixed data schemas. JSON is not a panacea here, but having a well thought out schema up front goes a long way.

@humitos
Copy link
Member

humitos commented Apr 4, 2022

@agjohnson

Worth mentioning that JSON fields don't necessarily mitigate the need for migrations. As with any document store database, a la Mongo, eventually you need to do a "migration".

Yeah, I'm talking about schema migrations. Those are the ones that I want to avoid here since if they happened we are forced to make data migration as well.

However, if we use only one JSON field, we can have two different schemas inside the JSON and:

  • make two different queries if it's required: one for the new schema and one for the old one
  • just wait until the old data gets deleted and always use the query with the new schema on it

So, things will keep working even if we change the "schema inside the json" but it will require schema migration at db level if we use regular fields. I'm seeing this as simpler, more flexible and easier for a starting point.

@agjohnson
Copy link
Contributor

However, if we use only one JSON field, we can have two different schemas inside the JSON

This is maybe getting a bit ahead of the implementation. If we find ourselves wanting to hack up JSON fields to do something that multiple fields and/or a data/schema migrations already give us, I would lean towards the simpler implementation.

Let's start at the easiest place possible and build from there. It sounds like we can ignore optimizations for redundant data on the Build model for now, and I agree it seems that a singular field is a good place to start.

If we hate this design, we'll switch to multiple fields in some fashion.

Anything else holding this design back right now?

@humitos
Copy link
Member

humitos commented Apr 14, 2022

and I agree it seems that a singular field is a good place to start.
If we hate this design, we'll switch to multiple fields in some fashion.

👍🏼

Anything else holding this design back right now?

Nope. IMO, this was already ready to start with the implementation.

@stsewd stsewd mentioned this pull request Apr 14, 2022
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say lets merge this, since we're now working on implementation. 👍

@humitos
Copy link
Member

humitos commented Apr 26, 2022

I'm fine merging this and continuing the discussion about the specifics in the PR for the implementation 👍🏼

@stsewd stsewd merged commit 62da193 into main Apr 26, 2022
@stsewd stsewd deleted the build-telemetry-design-doc branch April 26, 2022 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants