Skip to content

Commit 2046cb3

Browse files
committed
Design doc: collect data about builds
1 parent adf864b commit 2046cb3

File tree

1 file changed

+232
-0
lines changed

1 file changed

+232
-0
lines changed

docs/development/design/telemetry.rst

+232
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
Collect Data About Builds
2+
=========================
3+
4+
We may want to take some decisions in the future about deprecations and supported versions.
5+
Right now we don't have data about the usage of packages and their versions on Read the Docs
6+
to be able to make a good decision.
7+
8+
.. contents::
9+
:local:
10+
:depth: 3
11+
12+
Data to be collected
13+
--------------------
14+
15+
The following data can be collected after installing all dependencies.
16+
17+
Configuration file
18+
~~~~~~~~~~~~~~~~~~
19+
20+
We are saving the config file in our database,
21+
but to save some space we are saving it only if it's different than the one from a previous build
22+
(if it's the same we save a reference to it).
23+
24+
The config file being saved isn't the original one used by the user,
25+
but the result of merging it with its default values.
26+
It's saved using a _fake_ ``JSONField``
27+
(charfield that is transformed to json when creating the model object).
28+
For these reasons we can't query or download them in bulk without iterating over all objects.
29+
30+
We may also want to have the original config file,
31+
so we know which settings users are using.
32+
33+
PIP packages
34+
~~~~~~~~~~~~
35+
36+
We can get a json with all root dependencies with ``pip list``.
37+
This will allow us to have the name of the packages and their versions used in the build.
38+
39+
.. code-block::
40+
41+
$ pip list --pre --not-required --local --format json | jq
42+
[
43+
{
44+
"name": "requests-mock",
45+
"version": "1.8.0"
46+
},
47+
{
48+
"name": "requests-toolbelt",
49+
"version": "0.9.1"
50+
},
51+
{
52+
"name": "rstcheck",
53+
"version": "3.3.1"
54+
},
55+
{
56+
"name": "selectolax",
57+
"version": "0.2.10"
58+
},
59+
{
60+
"name": "slumber",
61+
"version": "0.7.1"
62+
},
63+
{
64+
"name": "sphinx-autobuild",
65+
"version": "2020.9.1"
66+
},
67+
{
68+
"name": "sphinx-hoverxref",
69+
"version": "0.5b1"
70+
},
71+
]
72+
73+
Conda packages
74+
~~~~~~~~~~~~~~
75+
76+
We can get a json with all dependencies with ``conda list --json``.
77+
That command gets all the root dependencies and their dependencies,
78+
so we may be collecting some noise, but we can use ``pip list`` as a secondary source.
79+
80+
.. code-block::
81+
82+
$ conda list --json --name conda-env
83+
84+
[
85+
{
86+
"base_url": "https://conda.anaconda.org/conda-forge",
87+
"build_number": 0,
88+
"build_string": "py_0",
89+
"channel": "conda-forge",
90+
"dist_name": "alabaster-0.7.12-py_0",
91+
"name": "alabaster",
92+
"platform": "noarch",
93+
"version": "0.7.12"
94+
},
95+
{
96+
"base_url": "https://conda.anaconda.org/conda-forge",
97+
"build_number": 0,
98+
"build_string": "pyh9f0ad1d_0",
99+
"channel": "conda-forge",
100+
"dist_name": "asn1crypto-1.4.0-pyh9f0ad1d_0",
101+
"name": "asn1crypto",
102+
"platform": "noarch",
103+
"version": "1.4.0"
104+
},
105+
{
106+
"base_url": "https://conda.anaconda.org/conda-forge",
107+
"build_number": 3,
108+
"build_string": "3",
109+
"channel": "conda-forge",
110+
"dist_name": "python-3.5.4-3",
111+
"name": "python",
112+
"platform": "linux-64",
113+
"version": "3.5.4"
114+
}
115+
]
116+
117+
APT packages
118+
~~~~~~~~~~~~
119+
120+
This isn't implemented yet, but when it is,
121+
we can get the list from the config file,
122+
or we can list the packages installed with ``dpkg --get-selections``.
123+
That command would list all pre-installed packages as well, so we may be getting some noise.
124+
125+
.. code-block::
126+
127+
$ dpkg --get-selections
128+
129+
adduser install
130+
apt install
131+
base-files install
132+
base-passwd install
133+
bash install
134+
binutils install
135+
binutils-common:amd64 install
136+
binutils-x86-64-linux-gnu install
137+
bsdutils install
138+
build-essential install
139+
140+
Python
141+
~~~~~~
142+
143+
We can get the Python version from the config file when using a Python environment,
144+
and from the ``conda list`` output when using a Conda environment.
145+
146+
OS
147+
~~
148+
149+
We can infer the OS version from the build image used in the config file,
150+
but since it changes with time, we can get it from the OS itself:
151+
152+
.. code-block::
153+
154+
$ lsb_release --description
155+
Description: Ubuntu 18.04.5 LTS
156+
# or
157+
$ cat /etc/issue
158+
Ubuntu 18.04.5 LTS \n \l
159+
160+
Storage
161+
-------
162+
163+
We can save all this information in json files in cloud storage,
164+
then we could use a tool to import all this data into.
165+
Or we can decide for a tool or service where to fed all this data directly into.
166+
167+
If we decide to save the files in cloud storage,
168+
we can try to calculate a hash of the file so we don't upload duplicates that happen on the same day/month.
169+
We can aggregate this data per year/month saving them in following structure:
170+
``telemetry/builds/{year}/{month}/{year}-{month}-{day}-{timestamp-pk|pk}.json``,
171+
that way is easy to download, all data per year/month without iterating over all files.
172+
173+
.. Since this information isn't sensitive,
174+
I think we are fine with this structure
175+
(we can't do bulk deletes of all info about a project if we follow this structure).
176+
177+
Format
178+
~~~~~~
179+
180+
The final file to be saved would have the following information:
181+
182+
- project: the project slug
183+
- version: the version slug
184+
- build: the build id (which may stop existing if the project is deleted)
185+
- date: full date in isoformat or timestamp (POSIX)
186+
- user_config: Original user config file
187+
- final_config: Final configuration used (merged with defaults)
188+
- packages.pip: List of pip packages with name and version
189+
- packages.conda: List of conda packages with name, channel, and version
190+
- packages.apt: List of apt packages
191+
- python: Python version used
192+
- os: Operating system used
193+
194+
.. code-block:: json
195+
196+
{
197+
"project": "docs",
198+
"version": "latest",
199+
"build": 12,
200+
"date": "2021-04-20-...",
201+
"user_config": {},
202+
"final_config": {},
203+
"packages": {
204+
"pip": [{
205+
"name": "sphinx",
206+
"version": "3.4.5"
207+
}],
208+
"conda": [{
209+
"name": "sphinx",
210+
"channel": "conda-forge",
211+
"version": "0.1"
212+
}],
213+
"apt": [
214+
"python3-dev",
215+
"cmatrix"
216+
]
217+
},
218+
"python": "3.7",
219+
"os": {
220+
"name": "ubuntu",
221+
"version": "18.04.5"
222+
}
223+
}
224+
225+
Analyzing the data
226+
------------------
227+
228+
.. How we would analyze this data? If we decide for a tool to fed the information into
229+
this wouldn't be a problem, but if we decide to go for storing the files for ourselves
230+
we can pick a tool later.
231+
Should we make this data public so other people can analyze it?
232+
Make it public after being analyzed and curated by us?

0 commit comments

Comments
 (0)