|
| 1 | +Version file tree diff |
| 2 | +====================== |
| 3 | + |
| 4 | +Goals |
| 5 | +----- |
| 6 | + |
| 7 | +- Compare files from two versions to identify the files that have been added, removed, or modified. |
| 8 | +- Provide an API for this feature. |
| 9 | +- Integrate this feature to suggest redirects on files that were removed. |
| 10 | +- Integrate this feature to list the files that changed in a pull request. |
| 11 | + |
| 12 | +Non-goals |
| 13 | +--------- |
| 14 | + |
| 15 | +- Replace the `docdiff <https://github.com/readthedocs/addons?tab=readme-ov-file#docdiff>`__ feature from addons. |
| 16 | + That works on the client side, and it's good for comparing the content of files. |
| 17 | + |
| 18 | +Current problems |
| 19 | +---------------- |
| 20 | + |
| 21 | +Currently, when a user opens a PRs, they need to manually search for the files of interest (new and modified files). |
| 22 | +We have a GitHub action that links to the root of the documentation preview, that helps a little, but it's not enough. |
| 23 | + |
| 24 | +When files are removed or renamed, users may not be aware that a redirect may be needed. |
| 25 | +We track 404s in our traffic analytics, but they don't keep track of the version, |
| 26 | +and it may be too late to add a redirect when users are already seeing a 404. |
| 27 | + |
| 28 | +In the past, we haven't implemented those features, because it's hard to map the source files to the generated files, |
| 29 | +since that depends on the build tool and configuration used by the project. |
| 30 | + |
| 31 | +Git providers may already offer a way to compare file trees, but again, |
| 32 | +they work on the source files, and not on the generated files. |
| 33 | + |
| 34 | +All hope was lost for having nice features like this, until now. |
| 35 | + |
| 36 | +Proposed solution |
| 37 | +----------------- |
| 38 | + |
| 39 | +Since redirects and files of interest are related to the generated files, |
| 40 | +instead of working over the source files, we will work over the generated files, which we have access to. |
| 41 | + |
| 42 | +The key points of this feature are: |
| 43 | + |
| 44 | +- Get the diff of the file tree between two versions. |
| 45 | +- Expose that as an API. |
| 46 | +- Integrate that in PR previews. |
| 47 | + |
| 48 | +Diff between two versions |
| 49 | +------------------------- |
| 50 | + |
| 51 | +Using a manifest |
| 52 | +~~~~~~~~~~~~~~~~ |
| 53 | + |
| 54 | +We can create a manifest that contains the hashes and other important metadata of the files, |
| 55 | +we can save this manifest in storage or in the DB. |
| 56 | + |
| 57 | +When a build finishes, we generate this manifest for all HTML files, and store it. |
| 58 | +When we need to compare two versions, we can just compare the manifests. |
| 59 | + |
| 60 | +This doesn't require downloading the files, but it requires building a version to generate the manifest. |
| 61 | + |
| 62 | +The manifest will be a JSON object with the following structure: |
| 63 | + |
| 64 | +.. code:: json |
| 65 | +
|
| 66 | + { |
| 67 | + "build": { |
| 68 | + "id": 1 |
| 69 | + }, |
| 70 | + "files": { |
| 71 | + "index.html": { |
| 72 | + "hash": "1234567890" |
| 73 | + }, |
| 74 | + "path/to/file.html": { |
| 75 | + "hash": "1234567890" |
| 76 | + } |
| 77 | + } |
| 78 | + } |
| 79 | +
|
| 80 | +Using rclone |
| 81 | +~~~~~~~~~~~~ |
| 82 | + |
| 83 | +.. note:: |
| 84 | + |
| 85 | + This solution won't be used in the final implementation, it's kept here for reference. |
| 86 | + |
| 87 | +We are already using ``rclone`` to speed up uploads to S3, |
| 88 | +``rclone`` has a command (``rclone check``) to return the diff between two directories. |
| 89 | +For this, it uses the metadata of the files, like size and hash |
| 90 | +(it doesn't download the files). |
| 91 | + |
| 92 | +.. code:: console |
| 93 | +
|
| 94 | + $ ls a |
| 95 | + changed.txt new.txt unchanged.txt |
| 96 | + $ ls b |
| 97 | + changed.txt deleted.txt unchanged.txt |
| 98 | + $ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b |
| 99 | + + new.txt |
| 100 | + - deleted.txt |
| 101 | + = unchanged.txt |
| 102 | + * changed.txt |
| 103 | +
|
| 104 | +The result is a list of files with a mark indicating if they were added, removed, or modified, or if they were unchanged. |
| 105 | +The result is easy to parse. |
| 106 | +There is no option to exclude the files that were unchanged when using ``--combined``, |
| 107 | +another option can be to output each type of change to a different file (``--missing-on-dst``, ``--missing-on-src``, ``--differ``). |
| 108 | + |
| 109 | +To start, we will only consider HTML files (``--include=*.html``). |
| 110 | + |
| 111 | +Changed files |
| 112 | +------------- |
| 113 | + |
| 114 | +Listing the files that were added or deleted is straightforward, |
| 115 | +but when listing the files that were modified, we want to list files that had relevant changes only. |
| 116 | + |
| 117 | +For example, if the build injects some content that changes on every build (like a timestamp or commit), |
| 118 | +we don't want to list all files as modified. |
| 119 | + |
| 120 | +We have a couple of options to improve this list. |
| 121 | + |
| 122 | +Hashing the main content |
| 123 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
| 124 | + |
| 125 | +Timestamps and other metadata is usually added in the footer of the files, outside the main content. |
| 126 | +Instead of hashing the whole file, we can hash only the main content of the file, |
| 127 | +and use that hash to compare the files. |
| 128 | + |
| 129 | +This will allow us to better detect files that were modified in a meaningful way. |
| 130 | + |
| 131 | +Since we don't need a secure hash, we can use MD5, since it's built-in in Python. |
| 132 | + |
| 133 | +Lines changed between two files |
| 134 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 135 | + |
| 136 | +.. note:: |
| 137 | + |
| 138 | + This solution won't be used in the final implementation, it's kept here for reference. |
| 139 | + |
| 140 | +In order to provide more useful information, we can sort the files by some metrics, |
| 141 | +like the number of lines that changed. |
| 142 | + |
| 143 | +Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed. |
| 144 | +This is useful to link to the most relevant files that changed in a PR. |
| 145 | + |
| 146 | +.. code:: console |
| 147 | +
|
| 148 | + $ cat a.txt |
| 149 | + One |
| 150 | + Two |
| 151 | + Three |
| 152 | + Four |
| 153 | + Five |
| 154 | + $ cat b.txt |
| 155 | + Ore |
| 156 | + Three |
| 157 | + Four |
| 158 | + Five |
| 159 | + Six |
| 160 | + $ diff --side-by-side --suppress-common-lines a.txt b.txt |
| 161 | + One | Ore |
| 162 | + Two < |
| 163 | + > Six |
| 164 | +
|
| 165 | +.. note:: |
| 166 | + |
| 167 | + Taken from https://stackoverflow.com/questions/27236891/diff-command-to-get-number-of-different-lines-only. |
| 168 | + |
| 169 | +The command will return only the lines that changed between the two files. |
| 170 | +We can just count the lines, or maybe even parse each symbol to check if the line was added or removed. |
| 171 | + |
| 172 | +Another alternative is to use the `difflib <https://docs.python.org/3/library/difflib.html>`__ module, |
| 173 | +the only downside is that it doesn't distinguish lines that were changed from lines that were added or removed. |
| 174 | +But maybe that's ok? Do we really need to know if a line was changed instead of added or removed? |
| 175 | + |
| 176 | +.. code:: python |
| 177 | +
|
| 178 | + import difflib |
| 179 | +
|
| 180 | + diff = difflib.ndiff(["one", "two", "three", "four"], ["ore", "three", "four", "five"]) |
| 181 | + print(list(diff)) |
| 182 | + # ['+ ore', '- one', '- two', ' three', ' four', '+ five'] |
| 183 | +
|
| 184 | +A good thing of using Python is that we don't need to write the files to disk, |
| 185 | +and the result is easier to parse. |
| 186 | + |
| 187 | +Alternative metrics |
| 188 | ++++++++++++++++++++ |
| 189 | + |
| 190 | +.. note:: |
| 191 | + |
| 192 | + This solution won't be used in the final implementation, it's kept here for reference. |
| 193 | + |
| 194 | +Checking the number of lines that changed is a good metric, but it requires downloading the files. |
| 195 | +Another metric we could use is the size of the files, that can be obtained from the metadata (no need of downloading the files), |
| 196 | +The most a file size has changed, the most lines have likely been added or removed, |
| 197 | +this still leaves lines that changed with the same amount of characters as irrelevant in the listing. |
| 198 | + |
| 199 | +Storing results |
| 200 | +--------------- |
| 201 | + |
| 202 | +Doing a diff between two versions can be expensive, so we need to store the results. |
| 203 | + |
| 204 | +We can store the results in the DB (``VersionDiff``). |
| 205 | +The information to store would contain some information about the versions compared, the builds, and the diff itself. |
| 206 | + |
| 207 | +.. code:: python |
| 208 | +
|
| 209 | + class VersionDiff(models.Model): |
| 210 | + version_a = models.ForeignKey( |
| 211 | + Version, on_delete=models.CASCADE, related_name="diff_a" |
| 212 | + ) |
| 213 | + version_b = models.ForeignKey( |
| 214 | + Version, on_delete=models.CASCADE, related_name="diff_b" |
| 215 | + ) |
| 216 | + build_a = models.ForeignKey(Build, on_delete=models.CASCADE, related_name="diff_a") |
| 217 | + build_b = models.ForeignKey(Build, on_delete=models.CASCADE, related_name="diff_b") |
| 218 | + diff = JSONField() |
| 219 | +
|
| 220 | +The diff will be a JSON object with the files that were added, removed, or modified. |
| 221 | +With an structure like this: |
| 222 | + |
| 223 | +.. code:: json |
| 224 | +
|
| 225 | + { |
| 226 | + "added": [{"file": "new.txt"}], |
| 227 | + "removed": [{"file": "deleted.txt"}], |
| 228 | + "modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}] |
| 229 | + } |
| 230 | +
|
| 231 | +The information is stored in a similar way that it will be returned by the API. |
| 232 | +Things important to note: |
| 233 | + |
| 234 | +- We need to take into consideration the diff of the latest successful builds only. |
| 235 | + If any of the builds from the stored diff don't match the latest successful build of any of the versions, |
| 236 | + we need to the diff again. |
| 237 | +- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``. |
| 238 | + We can store that information as well, or just calculate it on the fly. |
| 239 | +- The list of files are objects, so we can store additional information in the future. |
| 240 | +- When a file has been modified, we also store the number of lines that changed. |
| 241 | + We could also show this for files that were added or removed. |
| 242 | +- If a project or version is deleted (or deactivated), we should delete the diff as well. |
| 243 | +- Using the DB to save this information will serve as the lock for the API, |
| 244 | + so we don't calculate the diff multiple times for the same versions. |
| 245 | + |
| 246 | +We could store the changed files sorted by the number of changes, or make that an option in the API, |
| 247 | +or just let the client sort the files as they see fit. |
| 248 | + |
| 249 | +API |
| 250 | +--- |
| 251 | + |
| 252 | +The initial diff operation can be expensive, so we may consider not exposing this feature to unauthenticated users. |
| 253 | +And a diff can only be done between versions of the same project that the user has access to. |
| 254 | + |
| 255 | +The endpoint will be: |
| 256 | + |
| 257 | + GET /api/v3/projects/{project_slug}/diff/?version_a={version_a}&version_b={version_b} |
| 258 | + |
| 259 | +And the response will be: |
| 260 | + |
| 261 | +.. code:: json |
| 262 | +
|
| 263 | + { |
| 264 | + "version_a": {"id": 1, "build": {"id": 1}}, |
| 265 | + "version_b": {"id": 2, "build": {"id": 2}}, |
| 266 | + "diff": { |
| 267 | + "added": [{"file": "new.txt"}], |
| 268 | + "removed": [{"file": "deleted.txt"}], |
| 269 | + "modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}] |
| 270 | + } |
| 271 | + } |
| 272 | +
|
| 273 | +The version and build can be the full objects, or just the IDs and slugs. |
| 274 | + |
| 275 | +We will generate a lock on this request, to avoid multiple calls to the API for the same versions. |
| 276 | +We can reply with a ``202 Accepted`` if the diff is being calculated in another request. |
| 277 | + |
| 278 | +Integrations |
| 279 | +------------ |
| 280 | + |
| 281 | +You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong! |
| 282 | + |
| 283 | +Doing the API call is easy, but knowing *when* to call it is hard. |
| 284 | +We need to call the API after the build has finished successfully, |
| 285 | +or we will be comparing the files of an incomplete or stale build. |
| 286 | + |
| 287 | +Luckily, we have a webhook that tells us when a build has finished successfully. |
| 288 | +But, we don't want users to have to implement the integration by themselves. |
| 289 | + |
| 290 | +We could: |
| 291 | + |
| 292 | +- Use this as an opportunity to explore using GitHub Apps. |
| 293 | +- Request additional permissions in our existing OAuth2 integration (``project`` scope). Probably not a good idea. |
| 294 | +- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard. |
| 295 | + Maybe don't even expose the API to the public, just use it internally. |
| 296 | +- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__ |
| 297 | + to trigger the action from our webhook. This requires the user to do some additional setup, |
| 298 | + and for our webhooks to support custom headers. |
| 299 | +- Hit the API repeatedly from the GitHub action until the diff is ready. |
| 300 | + This is not ideal, some build may take a long time, and the action may time out. |
| 301 | +- Expose this feature in the addons API only, which will hit the service when a user views the PR preview. |
| 302 | + |
| 303 | +Initial implementation |
| 304 | +---------------------- |
| 305 | + |
| 306 | +For the initial implementation, we will: |
| 307 | + |
| 308 | +- Generate a manifest of all HTML files from the versions that we want to compare. |
| 309 | + This will be done at the end of the build. |
| 310 | +- Generate the hash based on the main content of the file, |
| 311 | + not the whole file. |
| 312 | +- MD5 will be the hashing algorithm used. |
| 313 | +- Only expose the files that were added, removed, or modified (HTML files only). |
| 314 | + The number of lines that changed wont be exposed. |
| 315 | +- Don't store the results in the DB, |
| 316 | + we can store the results in a next iteration. |
| 317 | +- Expose this feature only via the addons feature. |
| 318 | +- Allow to diff an external version against the version that points to the default branch/tag of the project only. |
| 319 | +- Use a feature flag to enable this feature on projects. |
| 320 | + |
| 321 | +Other features that are not mentioned here, like exposing the number of lines that changed, |
| 322 | +or a public API, will not be implemented in the initial version, |
| 323 | +and may be considered in the future (and thier implementation is subject to change). |
| 324 | + |
| 325 | +Possible issues |
| 326 | +--------------- |
| 327 | + |
| 328 | +In the case that we use a manifest, |
| 329 | +hashing the contents of the files may add some overhead to the build. |
| 330 | + |
| 331 | +In the case that we use ``rclone``, |
| 332 | +even if we don't download files from S3, we are still making calls to S3, and AWS charges for those calls. |
| 333 | +But since we are doing this on demand, and we can cache the results, we can minimize the costs |
| 334 | +(maybe is not that much). |
| 335 | + |
| 336 | +``rclone check`` returns only the list of files that changed, |
| 337 | +if we want to make additional checks over those files, we will need to make additional calls to S3. |
| 338 | + |
| 339 | +We should also just check a X number of files, we don't want to run a diff of thousands of files, |
| 340 | +and also a limit on the size of the files. |
| 341 | + |
| 342 | +Future improvements and ideas |
| 343 | +----------------------------- |
| 344 | + |
| 345 | +- Detect moved files. |
| 346 | + This will imply checking the hashes of deleted and added files, |
| 347 | + if that same hash of a file that was deleted matches one from a file that was added, |
| 348 | + we have a move. |
| 349 | + In case we use rclone, since we don't have access to those hashes after rclone is run, |
| 350 | + we would need to re-fetch that metadata from S3. |
| 351 | + Could be a feature request for rclone. |
| 352 | +- Detect changes in sections of HTML files. |
| 353 | + We could re-use the code we have for search indexing. |
| 354 | +- Expand to other file types |
| 355 | +- Allow doing a diff between versions of different projects |
| 356 | +- Allow to configure how the main content of the file is detected |
| 357 | + (like a CSS selector). |
| 358 | +- Allow to configure content that should be ignored when hashing the file |
| 359 | + (like a CSS selector). |
0 commit comments