Skip to content

Commit d59ccaf

Browse files
authored
Version file tree diff: design doc (#11507)
* Version file tree diff: design doc ref #11319 * format * This is just plain text * After thoughts * Updates from review * Update doc * Linter * Updates from comments * Updates from recent conversations * More updates
1 parent 173d044 commit d59ccaf

File tree

1 file changed

+359
-0
lines changed

1 file changed

+359
-0
lines changed

docs/dev/design/file-tree-diff.rst

+359
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
Version file tree diff
2+
======================
3+
4+
Goals
5+
-----
6+
7+
- Compare files from two versions to identify the files that have been added, removed, or modified.
8+
- Provide an API for this feature.
9+
- Integrate this feature to suggest redirects on files that were removed.
10+
- Integrate this feature to list the files that changed in a pull request.
11+
12+
Non-goals
13+
---------
14+
15+
- Replace the `docdiff <https://github.com/readthedocs/addons?tab=readme-ov-file#docdiff>`__ feature from addons.
16+
That works on the client side, and it's good for comparing the content of files.
17+
18+
Current problems
19+
----------------
20+
21+
Currently, when a user opens a PRs, they need to manually search for the files of interest (new and modified files).
22+
We have a GitHub action that links to the root of the documentation preview, that helps a little, but it's not enough.
23+
24+
When files are removed or renamed, users may not be aware that a redirect may be needed.
25+
We track 404s in our traffic analytics, but they don't keep track of the version,
26+
and it may be too late to add a redirect when users are already seeing a 404.
27+
28+
In the past, we haven't implemented those features, because it's hard to map the source files to the generated files,
29+
since that depends on the build tool and configuration used by the project.
30+
31+
Git providers may already offer a way to compare file trees, but again,
32+
they work on the source files, and not on the generated files.
33+
34+
All hope was lost for having nice features like this, until now.
35+
36+
Proposed solution
37+
-----------------
38+
39+
Since redirects and files of interest are related to the generated files,
40+
instead of working over the source files, we will work over the generated files, which we have access to.
41+
42+
The key points of this feature are:
43+
44+
- Get the diff of the file tree between two versions.
45+
- Expose that as an API.
46+
- Integrate that in PR previews.
47+
48+
Diff between two versions
49+
-------------------------
50+
51+
Using a manifest
52+
~~~~~~~~~~~~~~~~
53+
54+
We can create a manifest that contains the hashes and other important metadata of the files,
55+
we can save this manifest in storage or in the DB.
56+
57+
When a build finishes, we generate this manifest for all HTML files, and store it.
58+
When we need to compare two versions, we can just compare the manifests.
59+
60+
This doesn't require downloading the files, but it requires building a version to generate the manifest.
61+
62+
The manifest will be a JSON object with the following structure:
63+
64+
.. code:: json
65+
66+
{
67+
"build": {
68+
"id": 1
69+
},
70+
"files": {
71+
"index.html": {
72+
"hash": "1234567890"
73+
},
74+
"path/to/file.html": {
75+
"hash": "1234567890"
76+
}
77+
}
78+
}
79+
80+
Using rclone
81+
~~~~~~~~~~~~
82+
83+
.. note::
84+
85+
This solution won't be used in the final implementation, it's kept here for reference.
86+
87+
We are already using ``rclone`` to speed up uploads to S3,
88+
``rclone`` has a command (``rclone check``) to return the diff between two directories.
89+
For this, it uses the metadata of the files, like size and hash
90+
(it doesn't download the files).
91+
92+
.. code:: console
93+
94+
$ ls a
95+
changed.txt new.txt unchanged.txt
96+
$ ls b
97+
changed.txt deleted.txt unchanged.txt
98+
$ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b
99+
+ new.txt
100+
- deleted.txt
101+
= unchanged.txt
102+
* changed.txt
103+
104+
The result is a list of files with a mark indicating if they were added, removed, or modified, or if they were unchanged.
105+
The result is easy to parse.
106+
There is no option to exclude the files that were unchanged when using ``--combined``,
107+
another option can be to output each type of change to a different file (``--missing-on-dst``, ``--missing-on-src``, ``--differ``).
108+
109+
To start, we will only consider HTML files (``--include=*.html``).
110+
111+
Changed files
112+
-------------
113+
114+
Listing the files that were added or deleted is straightforward,
115+
but when listing the files that were modified, we want to list files that had relevant changes only.
116+
117+
For example, if the build injects some content that changes on every build (like a timestamp or commit),
118+
we don't want to list all files as modified.
119+
120+
We have a couple of options to improve this list.
121+
122+
Hashing the main content
123+
~~~~~~~~~~~~~~~~~~~~~~~~
124+
125+
Timestamps and other metadata is usually added in the footer of the files, outside the main content.
126+
Instead of hashing the whole file, we can hash only the main content of the file,
127+
and use that hash to compare the files.
128+
129+
This will allow us to better detect files that were modified in a meaningful way.
130+
131+
Since we don't need a secure hash, we can use MD5, since it's built-in in Python.
132+
133+
Lines changed between two files
134+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
135+
136+
.. note::
137+
138+
This solution won't be used in the final implementation, it's kept here for reference.
139+
140+
In order to provide more useful information, we can sort the files by some metrics,
141+
like the number of lines that changed.
142+
143+
Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed.
144+
This is useful to link to the most relevant files that changed in a PR.
145+
146+
.. code:: console
147+
148+
$ cat a.txt
149+
One
150+
Two
151+
Three
152+
Four
153+
Five
154+
$ cat b.txt
155+
Ore
156+
Three
157+
Four
158+
Five
159+
Six
160+
$ diff --side-by-side --suppress-common-lines a.txt b.txt
161+
One | Ore
162+
Two <
163+
> Six
164+
165+
.. note::
166+
167+
Taken from https://stackoverflow.com/questions/27236891/diff-command-to-get-number-of-different-lines-only.
168+
169+
The command will return only the lines that changed between the two files.
170+
We can just count the lines, or maybe even parse each symbol to check if the line was added or removed.
171+
172+
Another alternative is to use the `difflib <https://docs.python.org/3/library/difflib.html>`__ module,
173+
the only downside is that it doesn't distinguish lines that were changed from lines that were added or removed.
174+
But maybe that's ok? Do we really need to know if a line was changed instead of added or removed?
175+
176+
.. code:: python
177+
178+
import difflib
179+
180+
diff = difflib.ndiff(["one", "two", "three", "four"], ["ore", "three", "four", "five"])
181+
print(list(diff))
182+
# ['+ ore', '- one', '- two', ' three', ' four', '+ five']
183+
184+
A good thing of using Python is that we don't need to write the files to disk,
185+
and the result is easier to parse.
186+
187+
Alternative metrics
188+
+++++++++++++++++++
189+
190+
.. note::
191+
192+
This solution won't be used in the final implementation, it's kept here for reference.
193+
194+
Checking the number of lines that changed is a good metric, but it requires downloading the files.
195+
Another metric we could use is the size of the files, that can be obtained from the metadata (no need of downloading the files),
196+
The most a file size has changed, the most lines have likely been added or removed,
197+
this still leaves lines that changed with the same amount of characters as irrelevant in the listing.
198+
199+
Storing results
200+
---------------
201+
202+
Doing a diff between two versions can be expensive, so we need to store the results.
203+
204+
We can store the results in the DB (``VersionDiff``).
205+
The information to store would contain some information about the versions compared, the builds, and the diff itself.
206+
207+
.. code:: python
208+
209+
class VersionDiff(models.Model):
210+
version_a = models.ForeignKey(
211+
Version, on_delete=models.CASCADE, related_name="diff_a"
212+
)
213+
version_b = models.ForeignKey(
214+
Version, on_delete=models.CASCADE, related_name="diff_b"
215+
)
216+
build_a = models.ForeignKey(Build, on_delete=models.CASCADE, related_name="diff_a")
217+
build_b = models.ForeignKey(Build, on_delete=models.CASCADE, related_name="diff_b")
218+
diff = JSONField()
219+
220+
The diff will be a JSON object with the files that were added, removed, or modified.
221+
With an structure like this:
222+
223+
.. code:: json
224+
225+
{
226+
"added": [{"file": "new.txt"}],
227+
"removed": [{"file": "deleted.txt"}],
228+
"modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}]
229+
}
230+
231+
The information is stored in a similar way that it will be returned by the API.
232+
Things important to note:
233+
234+
- We need to take into consideration the diff of the latest successful builds only.
235+
If any of the builds from the stored diff don't match the latest successful build of any of the versions,
236+
we need to the diff again.
237+
- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``.
238+
We can store that information as well, or just calculate it on the fly.
239+
- The list of files are objects, so we can store additional information in the future.
240+
- When a file has been modified, we also store the number of lines that changed.
241+
We could also show this for files that were added or removed.
242+
- If a project or version is deleted (or deactivated), we should delete the diff as well.
243+
- Using the DB to save this information will serve as the lock for the API,
244+
so we don't calculate the diff multiple times for the same versions.
245+
246+
We could store the changed files sorted by the number of changes, or make that an option in the API,
247+
or just let the client sort the files as they see fit.
248+
249+
API
250+
---
251+
252+
The initial diff operation can be expensive, so we may consider not exposing this feature to unauthenticated users.
253+
And a diff can only be done between versions of the same project that the user has access to.
254+
255+
The endpoint will be:
256+
257+
GET /api/v3/projects/{project_slug}/diff/?version_a={version_a}&version_b={version_b}
258+
259+
And the response will be:
260+
261+
.. code:: json
262+
263+
{
264+
"version_a": {"id": 1, "build": {"id": 1}},
265+
"version_b": {"id": 2, "build": {"id": 2}},
266+
"diff": {
267+
"added": [{"file": "new.txt"}],
268+
"removed": [{"file": "deleted.txt"}],
269+
"modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}]
270+
}
271+
}
272+
273+
The version and build can be the full objects, or just the IDs and slugs.
274+
275+
We will generate a lock on this request, to avoid multiple calls to the API for the same versions.
276+
We can reply with a ``202 Accepted`` if the diff is being calculated in another request.
277+
278+
Integrations
279+
------------
280+
281+
You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong!
282+
283+
Doing the API call is easy, but knowing *when* to call it is hard.
284+
We need to call the API after the build has finished successfully,
285+
or we will be comparing the files of an incomplete or stale build.
286+
287+
Luckily, we have a webhook that tells us when a build has finished successfully.
288+
But, we don't want users to have to implement the integration by themselves.
289+
290+
We could:
291+
292+
- Use this as an opportunity to explore using GitHub Apps.
293+
- Request additional permissions in our existing OAuth2 integration (``project`` scope). Probably not a good idea.
294+
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
295+
Maybe don't even expose the API to the public, just use it internally.
296+
- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
297+
to trigger the action from our webhook. This requires the user to do some additional setup,
298+
and for our webhooks to support custom headers.
299+
- Hit the API repeatedly from the GitHub action until the diff is ready.
300+
This is not ideal, some build may take a long time, and the action may time out.
301+
- Expose this feature in the addons API only, which will hit the service when a user views the PR preview.
302+
303+
Initial implementation
304+
----------------------
305+
306+
For the initial implementation, we will:
307+
308+
- Generate a manifest of all HTML files from the versions that we want to compare.
309+
This will be done at the end of the build.
310+
- Generate the hash based on the main content of the file,
311+
not the whole file.
312+
- MD5 will be the hashing algorithm used.
313+
- Only expose the files that were added, removed, or modified (HTML files only).
314+
The number of lines that changed wont be exposed.
315+
- Don't store the results in the DB,
316+
we can store the results in a next iteration.
317+
- Expose this feature only via the addons feature.
318+
- Allow to diff an external version against the version that points to the default branch/tag of the project only.
319+
- Use a feature flag to enable this feature on projects.
320+
321+
Other features that are not mentioned here, like exposing the number of lines that changed,
322+
or a public API, will not be implemented in the initial version,
323+
and may be considered in the future (and thier implementation is subject to change).
324+
325+
Possible issues
326+
---------------
327+
328+
In the case that we use a manifest,
329+
hashing the contents of the files may add some overhead to the build.
330+
331+
In the case that we use ``rclone``,
332+
even if we don't download files from S3, we are still making calls to S3, and AWS charges for those calls.
333+
But since we are doing this on demand, and we can cache the results, we can minimize the costs
334+
(maybe is not that much).
335+
336+
``rclone check`` returns only the list of files that changed,
337+
if we want to make additional checks over those files, we will need to make additional calls to S3.
338+
339+
We should also just check a X number of files, we don't want to run a diff of thousands of files,
340+
and also a limit on the size of the files.
341+
342+
Future improvements and ideas
343+
-----------------------------
344+
345+
- Detect moved files.
346+
This will imply checking the hashes of deleted and added files,
347+
if that same hash of a file that was deleted matches one from a file that was added,
348+
we have a move.
349+
In case we use rclone, since we don't have access to those hashes after rclone is run,
350+
we would need to re-fetch that metadata from S3.
351+
Could be a feature request for rclone.
352+
- Detect changes in sections of HTML files.
353+
We could re-use the code we have for search indexing.
354+
- Expand to other file types
355+
- Allow doing a diff between versions of different projects
356+
- Allow to configure how the main content of the file is detected
357+
(like a CSS selector).
358+
- Allow to configure content that should be ignored when hashing the file
359+
(like a CSS selector).

0 commit comments

Comments
 (0)