From 2046cb388467b7c3806bab8bcd4812e5eac3736d Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 20 Apr 2021 18:54:17 -0500 Subject: [PATCH 01/11] Design doc: collect data about builds --- docs/development/design/telemetry.rst | 232 ++++++++++++++++++++++++++ 1 file changed, 232 insertions(+) create mode 100644 docs/development/design/telemetry.rst diff --git a/docs/development/design/telemetry.rst b/docs/development/design/telemetry.rst new file mode 100644 index 00000000000..fd979505e11 --- /dev/null +++ b/docs/development/design/telemetry.rst @@ -0,0 +1,232 @@ +Collect Data About Builds +========================= + +We may want to take some decisions in the future about deprecations and supported versions. +Right now we don't have data about the usage of packages and their versions on Read the Docs +to be able to make a good decision. + +.. contents:: + :local: + :depth: 3 + +Data to be collected +-------------------- + +The following data can be collected after installing all dependencies. + +Configuration file +~~~~~~~~~~~~~~~~~~ + +We are saving the config file in our database, +but to save some space we are saving it only if it's different than the one from a previous build +(if it's the same we save a reference to it). + +The config file being saved isn't the original one used by the user, +but the result of merging it with its default values. +It's saved using a _fake_ ``JSONField`` +(charfield that is transformed to json when creating the model object). +For these reasons we can't query or download them in bulk without iterating over all objects. + +We may also want to have the original config file, +so we know which settings users are using. + +PIP packages +~~~~~~~~~~~~ + +We can get a json with all root dependencies with ``pip list``. +This will allow us to have the name of the packages and their versions used in the build. + +.. code-block:: + + $ pip list --pre --not-required --local --format json | jq + [ + { + "name": "requests-mock", + "version": "1.8.0" + }, + { + "name": "requests-toolbelt", + "version": "0.9.1" + }, + { + "name": "rstcheck", + "version": "3.3.1" + }, + { + "name": "selectolax", + "version": "0.2.10" + }, + { + "name": "slumber", + "version": "0.7.1" + }, + { + "name": "sphinx-autobuild", + "version": "2020.9.1" + }, + { + "name": "sphinx-hoverxref", + "version": "0.5b1" + }, + ] + +Conda packages +~~~~~~~~~~~~~~ + +We can get a json with all dependencies with ``conda list --json``. +That command gets all the root dependencies and their dependencies, +so we may be collecting some noise, but we can use ``pip list`` as a secondary source. + +.. code-block:: + + $ conda list --json --name conda-env + + [ + { + "base_url": "https://conda.anaconda.org/conda-forge", + "build_number": 0, + "build_string": "py_0", + "channel": "conda-forge", + "dist_name": "alabaster-0.7.12-py_0", + "name": "alabaster", + "platform": "noarch", + "version": "0.7.12" + }, + { + "base_url": "https://conda.anaconda.org/conda-forge", + "build_number": 0, + "build_string": "pyh9f0ad1d_0", + "channel": "conda-forge", + "dist_name": "asn1crypto-1.4.0-pyh9f0ad1d_0", + "name": "asn1crypto", + "platform": "noarch", + "version": "1.4.0" + }, + { + "base_url": "https://conda.anaconda.org/conda-forge", + "build_number": 3, + "build_string": "3", + "channel": "conda-forge", + "dist_name": "python-3.5.4-3", + "name": "python", + "platform": "linux-64", + "version": "3.5.4" + } + ] + +APT packages +~~~~~~~~~~~~ + +This isn't implemented yet, but when it is, +we can get the list from the config file, +or we can list the packages installed with ``dpkg --get-selections``. +That command would list all pre-installed packages as well, so we may be getting some noise. + +.. code-block:: + + $ dpkg --get-selections + + adduser install + apt install + base-files install + base-passwd install + bash install + binutils install + binutils-common:amd64 install + binutils-x86-64-linux-gnu install + bsdutils install + build-essential install + +Python +~~~~~~ + +We can get the Python version from the config file when using a Python environment, +and from the ``conda list`` output when using a Conda environment. + +OS +~~ + +We can infer the OS version from the build image used in the config file, +but since it changes with time, we can get it from the OS itself: + +.. code-block:: + + $ lsb_release --description + Description: Ubuntu 18.04.5 LTS + # or + $ cat /etc/issue + Ubuntu 18.04.5 LTS \n \l + +Storage +------- + +We can save all this information in json files in cloud storage, +then we could use a tool to import all this data into. +Or we can decide for a tool or service where to fed all this data directly into. + +If we decide to save the files in cloud storage, +we can try to calculate a hash of the file so we don't upload duplicates that happen on the same day/month. +We can aggregate this data per year/month saving them in following structure: +``telemetry/builds/{year}/{month}/{year}-{month}-{day}-{timestamp-pk|pk}.json``, +that way is easy to download, all data per year/month without iterating over all files. + +.. Since this information isn't sensitive, + I think we are fine with this structure + (we can't do bulk deletes of all info about a project if we follow this structure). + +Format +~~~~~~ + +The final file to be saved would have the following information: + +- project: the project slug +- version: the version slug +- build: the build id (which may stop existing if the project is deleted) +- date: full date in isoformat or timestamp (POSIX) +- user_config: Original user config file +- final_config: Final configuration used (merged with defaults) +- packages.pip: List of pip packages with name and version +- packages.conda: List of conda packages with name, channel, and version +- packages.apt: List of apt packages +- python: Python version used +- os: Operating system used + +.. code-block:: json + + { + "project": "docs", + "version": "latest", + "build": 12, + "date": "2021-04-20-...", + "user_config": {}, + "final_config": {}, + "packages": { + "pip": [{ + "name": "sphinx", + "version": "3.4.5" + }], + "conda": [{ + "name": "sphinx", + "channel": "conda-forge", + "version": "0.1" + }], + "apt": [ + "python3-dev", + "cmatrix" + ] + }, + "python": "3.7", + "os": { + "name": "ubuntu", + "version": "18.04.5" + } + } + +Analyzing the data +------------------ + +.. How we would analyze this data? If we decide for a tool to fed the information into + this wouldn't be a problem, but if we decide to go for storing the files for ourselves + we can pick a tool later. + Should we make this data public so other people can analyze it? + Make it public after being analyzed and curated by us? From d6cc0c676f2a939eff3be83e89a5ab06efd0e17c Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Mon, 26 Apr 2021 13:06:59 -0500 Subject: [PATCH 02/11] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Juan Luis Cano Rodríguez --- docs/development/design/telemetry.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/development/design/telemetry.rst b/docs/development/design/telemetry.rst index fd979505e11..2d0d1d190d3 100644 --- a/docs/development/design/telemetry.rst +++ b/docs/development/design/telemetry.rst @@ -3,7 +3,7 @@ Collect Data About Builds We may want to take some decisions in the future about deprecations and supported versions. Right now we don't have data about the usage of packages and their versions on Read the Docs -to be able to make a good decision. +to be able to make an informed decision. .. contents:: :local: From b10d4376e7edf459f4afdda34212c1f3cee8526f Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 27 Apr 2021 16:12:57 -0500 Subject: [PATCH 03/11] Updates --- docs/development/design/telemetry.rst | 66 +++++++++++++++++---------- 1 file changed, 41 insertions(+), 25 deletions(-) diff --git a/docs/development/design/telemetry.rst b/docs/development/design/telemetry.rst index 2d0d1d190d3..443def9e78e 100644 --- a/docs/development/design/telemetry.rst +++ b/docs/development/design/telemetry.rst @@ -9,6 +9,26 @@ to be able to make an informed decision. :local: :depth: 3 +Tools +----- + +Kibana: + - https://www.elastic.co/kibana + - We can import data from ES. + - Cloud service provided by Elastic. +Superset: + - https://superset.apache.org/ + - We can import data from several DBs (including postgres and ES). + - Easy to setup locally, but doesn't look like there is cloud provider for it. +Metabase: + - https://www.metabase.com/ + - We can import data from several DBs (including postgres). + - Cloud service provided by Metabase. + +Summary: We have several tools that can inspect data form a postgres DB, +and we also have ``Kibana`` that works *only* with ElasticSearch. +The data to be collected can be saved in a postgres or ES database. + Data to be collected -------------------- @@ -33,11 +53,13 @@ so we know which settings users are using. PIP packages ~~~~~~~~~~~~ -We can get a json with all root dependencies with ``pip list``. +We can get a json with all and root dependencies with ``pip list``. This will allow us to have the name of the packages and their versions used in the build. .. code-block:: + $ pip list --pre --local --format json | jq + # and $ pip list --pre --not-required --local --format json | jq [ { @@ -70,6 +92,8 @@ This will allow us to have the name of the packages and their versions used in t }, ] +With the ``--not-required`` option, pip will list only the root dependencies. + Conda packages ~~~~~~~~~~~~~~ @@ -157,23 +181,6 @@ but since it changes with time, we can get it from the OS itself: $ cat /etc/issue Ubuntu 18.04.5 LTS \n \l -Storage -------- - -We can save all this information in json files in cloud storage, -then we could use a tool to import all this data into. -Or we can decide for a tool or service where to fed all this data directly into. - -If we decide to save the files in cloud storage, -we can try to calculate a hash of the file so we don't upload duplicates that happen on the same day/month. -We can aggregate this data per year/month saving them in following structure: -``telemetry/builds/{year}/{month}/{year}-{month}-{day}-{timestamp-pk|pk}.json``, -that way is easy to download, all data per year/month without iterating over all files. - -.. Since this information isn't sensitive, - I think we are fine with this structure - (we can't do bulk deletes of all info about a project if we follow this structure). - Format ~~~~~~ @@ -205,6 +212,16 @@ The final file to be saved would have the following information: "name": "sphinx", "version": "3.4.5" }], + "pip_all": [ + { + "name": "sphinx", + "version": "3.4.5" + }, + { + "name": "docutils", + "version": "0.16.0" + } + ], "conda": [{ "name": "sphinx", "channel": "conda-forge", @@ -222,11 +239,10 @@ The final file to be saved would have the following information: } } -Analyzing the data ------------------- +Storage +------- -.. How we would analyze this data? If we decide for a tool to fed the information into - this wouldn't be a problem, but if we decide to go for storing the files for ourselves - we can pick a tool later. - Should we make this data public so other people can analyze it? - Make it public after being analyzed and curated by us? +Since this information isn't sensitive, +we should be fine saving this data even if the project/version is deleted. +As we don't care about historical data, +we can save the information per-version and from their latest build only. From b21a41f37e5739dfda899e974477b35a5b6cdc31 Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 27 Apr 2021 16:14:58 -0500 Subject: [PATCH 04/11] Mention json fields --- docs/development/design/telemetry.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/development/design/telemetry.rst b/docs/development/design/telemetry.rst index 443def9e78e..72532849a4e 100644 --- a/docs/development/design/telemetry.rst +++ b/docs/development/design/telemetry.rst @@ -27,7 +27,8 @@ Metabase: Summary: We have several tools that can inspect data form a postgres DB, and we also have ``Kibana`` that works *only* with ElasticSearch. -The data to be collected can be saved in a postgres or ES database. +The data to be collected can be saved in a postgres or ES database, +if we use postgres, we would need to use *real* json fields. Data to be collected -------------------- From 8c6810fe9bb891cbbf1c8b6f2cfeb3677b6f440b Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 27 Apr 2021 16:19:54 -0500 Subject: [PATCH 05/11] Mention to only save data for one year inside the db --- docs/development/design/telemetry.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/development/design/telemetry.rst b/docs/development/design/telemetry.rst index 72532849a4e..d9629f57792 100644 --- a/docs/development/design/telemetry.rst +++ b/docs/development/design/telemetry.rst @@ -247,3 +247,7 @@ Since this information isn't sensitive, we should be fine saving this data even if the project/version is deleted. As we don't care about historical data, we can save the information per-version and from their latest build only. + +We can collect data for one year, +export it to cloud storage after being analyzed (maybe share this data publicity), +and remove it from our database if it takes too much space. From ba20364a12b2a52f0212a7478e389db8592c91ed Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 8 Mar 2022 13:14:36 -0500 Subject: [PATCH 06/11] We have json fields now --- docs/dev/design/telemetry.rst | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/docs/dev/design/telemetry.rst b/docs/dev/design/telemetry.rst index d9629f57792..708a27df663 100644 --- a/docs/dev/design/telemetry.rst +++ b/docs/dev/design/telemetry.rst @@ -27,8 +27,7 @@ Metabase: Summary: We have several tools that can inspect data form a postgres DB, and we also have ``Kibana`` that works *only* with ElasticSearch. -The data to be collected can be saved in a postgres or ES database, -if we use postgres, we would need to use *real* json fields. +The data to be collected can be saved in a postgres or ES database. Data to be collected -------------------- @@ -44,9 +43,6 @@ but to save some space we are saving it only if it's different than the one from The config file being saved isn't the original one used by the user, but the result of merging it with its default values. -It's saved using a _fake_ ``JSONField`` -(charfield that is transformed to json when creating the model object). -For these reasons we can't query or download them in bulk without iterating over all objects. We may also want to have the original config file, so we know which settings users are using. @@ -142,8 +138,7 @@ so we may be collecting some noise, but we can use ``pip list`` as a secondary s APT packages ~~~~~~~~~~~~ -This isn't implemented yet, but when it is, -we can get the list from the config file, +We can get the list from the config file, or we can list the packages installed with ``dpkg --get-selections``. That command would list all pre-installed packages as well, so we may be getting some noise. @@ -185,9 +180,9 @@ but since it changes with time, we can get it from the OS itself: Format ~~~~~~ -The final file to be saved would have the following information: +The final information to be saved would have the following information: -- project: the project slug +- project: the project id/slug - version: the version slug - build: the build id (which may stop existing if the project is deleted) - date: full date in isoformat or timestamp (POSIX) From bc7d49f4a0111713f8322f6635d4ed5220ec4d69 Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 8 Mar 2022 14:42:34 -0500 Subject: [PATCH 07/11] More updates --- docs/dev/design/telemetry.rst | 43 ++++++++++++++++++++++++----------- 1 file changed, 30 insertions(+), 13 deletions(-) diff --git a/docs/dev/design/telemetry.rst b/docs/dev/design/telemetry.rst index 708a27df663..373b3fbfa1a 100644 --- a/docs/dev/design/telemetry.rst +++ b/docs/dev/design/telemetry.rst @@ -28,6 +28,8 @@ Metabase: Summary: We have several tools that can inspect data form a postgres DB, and we also have ``Kibana`` that works *only* with ElasticSearch. The data to be collected can be saved in a postgres or ES database. +Currently, we are making use of Metabase to get other information, +so it's probably the right choice for this task. Data to be collected -------------------- @@ -197,12 +199,26 @@ The final information to be saved would have the following information: .. code-block:: json { - "project": "docs", - "version": "latest", - "build": 12, - "date": "2021-04-20-...", - "user_config": {}, - "final_config": {}, + "project": { + "id": 2, + "slug": "docs" + }, + "version": { + "id": 1, + "slug": "latest" + }, + "build": { + "id": 3, + "date/start": "2021-04-20-...", + "length": "00:06:34", + "status": "normal", + "success": true, + "commit": "abcd1234" + }, + "config": { + "user": {}, + "final": {} + }, "packages": { "pip": [{ "name": "sphinx", @@ -229,20 +245,21 @@ The final information to be saved would have the following information: ] }, "python": "3.7", - "os": { - "name": "ubuntu", - "version": "18.04.5" - } + "os": "ubuntu-18.04.5" } Storage ------- +We can store this information in a dedicated database (telemetry), +using Django's models. + Since this information isn't sensitive, we should be fine saving this data even if the project/version is deleted. As we don't care about historical data, we can save the information per-version and from their latest build only. +And delete old data if it grows too much. -We can collect data for one year, -export it to cloud storage after being analyzed (maybe share this data publicity), -and remove it from our database if it takes too much space. +Should we make heavy use of JSON fields? +Or try to avoid nesting structures as possible? +Like config.user/config.final vs user_config/final_config. From 8cfb25499cdfaa3b6ba0780a2137ed50db60aa7d Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 8 Mar 2022 14:45:03 -0500 Subject: [PATCH 08/11] Another update --- docs/dev/design/telemetry.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/dev/design/telemetry.rst b/docs/dev/design/telemetry.rst index 373b3fbfa1a..f1c6cd06035 100644 --- a/docs/dev/design/telemetry.rst +++ b/docs/dev/design/telemetry.rst @@ -251,8 +251,8 @@ The final information to be saved would have the following information: Storage ------- -We can store this information in a dedicated database (telemetry), -using Django's models. +All this information can be collected after the build has finished, +and we can store it in a dedicated database (telemetry), using Django's models. Since this information isn't sensitive, we should be fine saving this data even if the project/version is deleted. From ae248bb8fd121b19141d41dba341d5100622cffa Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 8 Mar 2022 14:48:24 -0500 Subject: [PATCH 09/11] Update --- docs/dev/design/telemetry.rst | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/dev/design/telemetry.rst b/docs/dev/design/telemetry.rst index f1c6cd06035..777cd89a0c1 100644 --- a/docs/dev/design/telemetry.rst +++ b/docs/dev/design/telemetry.rst @@ -182,12 +182,11 @@ but since it changes with time, we can get it from the OS itself: Format ~~~~~~ -The final information to be saved would have the following information: +The final information to be saved would consist of: - project: the project id/slug -- version: the version slug -- build: the build id (which may stop existing if the project is deleted) -- date: full date in isoformat or timestamp (POSIX) +- version: the version id/slug +- build: the build id, date, length, status. - user_config: Original user config file - final_config: Final configuration used (merged with defaults) - packages.pip: List of pip packages with name and version From 6752de13bd7f8ebcf393e66c4fdb738e17171046 Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Tue, 8 Mar 2022 15:17:47 -0500 Subject: [PATCH 10/11] organization was missing --- docs/dev/design/telemetry.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/dev/design/telemetry.rst b/docs/dev/design/telemetry.rst index 777cd89a0c1..2f37ff6da92 100644 --- a/docs/dev/design/telemetry.rst +++ b/docs/dev/design/telemetry.rst @@ -184,6 +184,7 @@ Format The final information to be saved would consist of: +- organization: the organization id/slug - project: the project id/slug - version: the version id/slug - build: the build id, date, length, status. @@ -198,6 +199,10 @@ The final information to be saved would consist of: .. code-block:: json { + "organization": { + "id": 1, + "slug": "org" + }, "project": { "id": 2, "slug": "docs" From a58ba223fb2c2762e218022ca546ba2e8d7cef58 Mon Sep 17 00:00:00 2001 From: Santos Gallegos Date: Thu, 10 Mar 2022 13:43:07 -0500 Subject: [PATCH 11/11] Updates --- docs/dev/design/telemetry.rst | 58 +++++++++++++++++++++++++++++++---- 1 file changed, 52 insertions(+), 6 deletions(-) diff --git a/docs/dev/design/telemetry.rst b/docs/dev/design/telemetry.rst index 2f37ff6da92..3649b3279d1 100644 --- a/docs/dev/design/telemetry.rst +++ b/docs/dev/design/telemetry.rst @@ -97,7 +97,8 @@ Conda packages ~~~~~~~~~~~~~~ We can get a json with all dependencies with ``conda list --json``. -That command gets all the root dependencies and their dependencies, +That command gets all the root dependencies and their dependencies +(there is no way to list only the root dependencies), so we may be collecting some noise, but we can use ``pip list`` as a secondary source. .. code-block:: @@ -144,7 +145,7 @@ We can get the list from the config file, or we can list the packages installed with ``dpkg --get-selections``. That command would list all pre-installed packages as well, so we may be getting some noise. -.. code-block:: +.. code-block:: console $ dpkg --get-selections @@ -159,6 +160,50 @@ That command would list all pre-installed packages as well, so we may be getting bsdutils install build-essential install +We can get the installed version with: + +.. code-block:: console + + $ dpkg --status python3 + + Package: python3 + Status: install ok installed + Priority: optional + Section: python + Installed-Size: 189 + Maintainer: Ubuntu Developers + Architecture: amd64 + Multi-Arch: allowed + Source: python3-defaults + Version: 3.8.2-0ubuntu2 + Replaces: python3-minimal (<< 3.1.2-2) + Provides: python3-profiler + Depends: python3.8 (>= 3.8.2-1~), libpython3-stdlib (= 3.8.2-0ubuntu2) + Pre-Depends: python3-minimal (= 3.8.2-0ubuntu2) + Suggests: python3-doc (>= 3.8.2-0ubuntu2), python3-tk (>= 3.8.2-1~), python3-venv (>= 3.8.2-0ubuntu2) + Description: interactive high-level object-oriented language (default python3 version) + Python, the high-level, interactive object oriented language, + includes an extensive class library with lots of goodies for + network programming, system administration, sounds and graphics. + . + This package is a dependency package, which depends on Debian's default + Python 3 version (currently v3.8). + Homepage: https://www.python.org/ + Original-Maintainer: Matthias Klose + +Or with + +.. code-block:: console + + $ apt-cache policy python3 + + Installed: 3.8.2-0ubuntu2 + Candidate: 3.8.2-0ubuntu2 + Version table: + *** 3.8.2-0ubuntu2 500 + 500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages + 100 /var/lib/dpkg/status + Python ~~~~~~ @@ -243,10 +288,10 @@ The final information to be saved would consist of: "channel": "conda-forge", "version": "0.1" }], - "apt": [ - "python3-dev", - "cmatrix" - ] + "apt": [{ + "name": "python3-dev", + "version": "3.8.2-0ubuntu2" + }], }, "python": "3.7", "os": "ubuntu-18.04.5" @@ -267,3 +312,4 @@ And delete old data if it grows too much. Should we make heavy use of JSON fields? Or try to avoid nesting structures as possible? Like config.user/config.final vs user_config/final_config. +Or having several fields in our model instead of just one big json field?