Skip to content

Design doc: collect data about builds #8124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 26, 2022
253 changes: 253 additions & 0 deletions docs/development/design/telemetry.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
Collect Data About Builds
=========================

We may want to take some decisions in the future about deprecations and supported versions.
Right now we don't have data about the usage of packages and their versions on Read the Docs
to be able to make an informed decision.

.. contents::
:local:
:depth: 3

Tools
-----

Kibana:
- https://www.elastic.co/kibana
- We can import data from ES.
- Cloud service provided by Elastic.
Superset:
- https://superset.apache.org/
- We can import data from several DBs (including postgres and ES).
- Easy to setup locally, but doesn't look like there is cloud provider for it.
Metabase:
- https://www.metabase.com/
- We can import data from several DBs (including postgres).
- Cloud service provided by Metabase.

Summary: We have several tools that can inspect data form a postgres DB,
and we also have ``Kibana`` that works *only* with ElasticSearch.
The data to be collected can be saved in a postgres or ES database,
if we use postgres, we would need to use *real* json fields.

Data to be collected
--------------------

The following data can be collected after installing all dependencies.

Configuration file
~~~~~~~~~~~~~~~~~~

We are saving the config file in our database,
but to save some space we are saving it only if it's different than the one from a previous build
(if it's the same we save a reference to it).

The config file being saved isn't the original one used by the user,
but the result of merging it with its default values.
It's saved using a _fake_ ``JSONField``
(charfield that is transformed to json when creating the model object).
For these reasons we can't query or download them in bulk without iterating over all objects.

We may also want to have the original config file,
so we know which settings users are using.

PIP packages
~~~~~~~~~~~~

We can get a json with all and root dependencies with ``pip list``.
This will allow us to have the name of the packages and their versions used in the build.

.. code-block::

$ pip list --pre --local --format json | jq
# and
$ pip list --pre --not-required --local --format json | jq
[
{
"name": "requests-mock",
"version": "1.8.0"
},
{
"name": "requests-toolbelt",
"version": "0.9.1"
},
{
"name": "rstcheck",
"version": "3.3.1"
},
{
"name": "selectolax",
"version": "0.2.10"
},
{
"name": "slumber",
"version": "0.7.1"
},
{
"name": "sphinx-autobuild",
"version": "2020.9.1"
},
{
"name": "sphinx-hoverxref",
"version": "0.5b1"
},
]

With the ``--not-required`` option, pip will list only the root dependencies.

Conda packages
~~~~~~~~~~~~~~

We can get a json with all dependencies with ``conda list --json``.
That command gets all the root dependencies and their dependencies,
so we may be collecting some noise, but we can use ``pip list`` as a secondary source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly with pip, yes this gives us all installed dependencies without distinction of whether they are transitive dependencies or not (too bad), but probably we should also try to store the environment.yml somehow, just an idea.


.. code-block::

$ conda list --json --name conda-env

[
{
"base_url": "https://conda.anaconda.org/conda-forge",
"build_number": 0,
"build_string": "py_0",
"channel": "conda-forge",
"dist_name": "alabaster-0.7.12-py_0",
"name": "alabaster",
"platform": "noarch",
"version": "0.7.12"
},
{
"base_url": "https://conda.anaconda.org/conda-forge",
"build_number": 0,
"build_string": "pyh9f0ad1d_0",
"channel": "conda-forge",
"dist_name": "asn1crypto-1.4.0-pyh9f0ad1d_0",
"name": "asn1crypto",
"platform": "noarch",
"version": "1.4.0"
},
{
"base_url": "https://conda.anaconda.org/conda-forge",
"build_number": 3,
"build_string": "3",
"channel": "conda-forge",
"dist_name": "python-3.5.4-3",
"name": "python",
"platform": "linux-64",
"version": "3.5.4"
}
]

APT packages
~~~~~~~~~~~~

This isn't implemented yet, but when it is,
we can get the list from the config file,
or we can list the packages installed with ``dpkg --get-selections``.
That command would list all pre-installed packages as well, so we may be getting some noise.

.. code-block::

$ dpkg --get-selections

adduser install
apt install
base-files install
base-passwd install
bash install
binutils install
binutils-common:amd64 install
binutils-x86-64-linux-gnu install
bsdutils install
build-essential install

Python
~~~~~~

We can get the Python version from the config file when using a Python environment,
and from the ``conda list`` output when using a Conda environment.

OS
~~

We can infer the OS version from the build image used in the config file,
but since it changes with time, we can get it from the OS itself:

.. code-block::

$ lsb_release --description
Description: Ubuntu 18.04.5 LTS
# or
$ cat /etc/issue
Ubuntu 18.04.5 LTS \n \l

Format
~~~~~~

The final file to be saved would have the following information:

- project: the project slug
- version: the version slug
- build: the build id (which may stop existing if the project is deleted)
- date: full date in isoformat or timestamp (POSIX)
- user_config: Original user config file
- final_config: Final configuration used (merged with defaults)
- packages.pip: List of pip packages with name and version
- packages.conda: List of conda packages with name, channel, and version
- packages.apt: List of apt packages
- python: Python version used
- os: Operating system used

.. code-block:: json

{
"project": "docs",
"version": "latest",
"build": 12,
"date": "2021-04-20-...",
"user_config": {},
"final_config": {},
"packages": {
"pip": [{
"name": "sphinx",
"version": "3.4.5"
}],
"pip_all": [
{
"name": "sphinx",
"version": "3.4.5"
},
{
"name": "docutils",
"version": "0.16.0"
}
],
"conda": [{
"name": "sphinx",
"channel": "conda-forge",
"version": "0.1"
}],
"apt": [
"python3-dev",
"cmatrix"
]
},
"python": "3.7",
"os": {
"name": "ubuntu",
"version": "18.04.5"
}
}

Storage
-------

Since this information isn't sensitive,
we should be fine saving this data even if the project/version is deleted.
As we don't care about historical data,
we can save the information per-version and from their latest build only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually probably need to have this section first. Knowing what kind of analysis we want to run on the data will tell us how to store & query it.

Currently we have defined a way of storing data that we don't really look at. I think it's probably more valuable to start to have some kind of short-term storage of data in a query engine that we can query dynamically (ES or Postgres, as @astrojuanlu said). We'd have to pretty strongly truncate this data (eg. <4GB in memory) -- but I think we can definitely get some useful subsets of data into that format given that it isn't huge.

In particular, just getting this data into some kind of queryable format:

  • .readthedocs.yml
  • conf.py dumped as JSON
  • pip & conda files dumped

Would get us a lot of value across a short timeframe (eg. 1 month).

But, what are we going to do when we have this data to query? We should think about what kind of queries we want to run and what analysis would be valuable to make sure we're building the system properly.

We can collect data for one year,
export it to cloud storage after being analyzed (maybe share this data publicity),
and remove it from our database if it takes too much space.