|
| 1 | +Version file tree diff |
| 2 | +====================== |
| 3 | + |
| 4 | +Goals |
| 5 | +----- |
| 6 | + |
| 7 | +- Compare files from two versions to identify the files that have been added, removed, or modified. |
| 8 | +- Provide an API for this feature. |
| 9 | +- Integrate this feature to suggest redirects on files that were removed. |
| 10 | +- Integrate this feature to list the files that changed in a pull request. |
| 11 | + |
| 12 | +Non-goals |
| 13 | +--------- |
| 14 | + |
| 15 | +- Replace the `docdiff <https://github.com/readthedocs/addons?tab=readme-ov-file#docdiff>`__ feature from addons |
| 16 | + |
| 17 | +Current problems |
| 18 | +---------------- |
| 19 | + |
| 20 | +Currently, when a user opens a PRs, they need to manually search for the files of interest (new and modified files). |
| 21 | +We have a GitHub action that links to the root of the documentation preview, that helps a little, but it's not enough. |
| 22 | + |
| 23 | +When files are removed or renamed, users may not be aware that a redirect may be needed. |
| 24 | +We track 404s in our traffic analytics, but they don't keep track of the version, |
| 25 | +and it may be too late to add a redirect when users are already seeing a 404. |
| 26 | + |
| 27 | +In the past, we haven't implemented those features, because it's hard to map the source files to the generated files, |
| 28 | +since that depends on the build tool and configuration used by the project. |
| 29 | + |
| 30 | +Git providers may already offer a way to compare file trees, but again, |
| 31 | +they work on the source files, and not on the generated files. |
| 32 | + |
| 33 | +All hope was lost for having nice features like this, until now. |
| 34 | + |
| 35 | +Proposed solution |
| 36 | +----------------- |
| 37 | + |
| 38 | +Since redirects and files of interest are related to the generated files. |
| 39 | +Instead of working over the source files, we will work over the generated files, which we have access to. |
| 40 | + |
| 41 | +The key points of this feature are: |
| 42 | + |
| 43 | +- Get the diff of the file tree between two versions. |
| 44 | +- Expose that as an API. |
| 45 | +- Integrate that in PR previews. |
| 46 | + |
| 47 | +Diff between two S3 file trees |
| 48 | +------------------------------ |
| 49 | + |
| 50 | +We are already using ``rclone`` to speed up uploads to S3, |
| 51 | +``rclone`` has a command (``rclone check``) to return the diff between two sources. |
| 52 | +For this, it uses the metadata of the files, like size and hash |
| 53 | +(it doesn't download the files). |
| 54 | + |
| 55 | +.. code:: console |
| 56 | +
|
| 57 | + $ ls a |
| 58 | + changed.txt new.txt unchanged.txt |
| 59 | + $ ls b |
| 60 | + changed.txt deleted.txt unchanged.txt |
| 61 | + $ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b |
| 62 | + + new.txt |
| 63 | + - deleted.txt |
| 64 | + = unchanged.txt |
| 65 | + * changed.txt |
| 66 | +
|
| 67 | +The result is a list of files with a mark indicating if they were added, removed, or modified. |
| 68 | +The result is easy to parse. |
| 69 | + |
| 70 | +To start, we will only consider HTML files (``--include=*.html``). |
| 71 | + |
| 72 | +Lines changed between two files |
| 73 | +------------------------------- |
| 74 | + |
| 75 | +Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed. |
| 76 | +This is useful to link to the most relevant files that changed in a PR. |
| 77 | + |
| 78 | +.. code:: console |
| 79 | +
|
| 80 | + $ cat a.txt |
| 81 | + One |
| 82 | + Two |
| 83 | + Three |
| 84 | + Four |
| 85 | + Five |
| 86 | + $ cat b.txt |
| 87 | + Ore |
| 88 | + Three |
| 89 | + Four |
| 90 | + Five |
| 91 | + Six |
| 92 | + $ diff --side-by-side --suppress-common-lines a.txt b.txt |
| 93 | + One | Ore |
| 94 | + Two < |
| 95 | + > Six |
| 96 | +
|
| 97 | +.. note:: |
| 98 | + |
| 99 | + Taken from https://stackoverflow.com/questions/27236891/diff-command-to-get-number-of-different-lines-only. |
| 100 | + |
| 101 | +The command will return only the lines that changed between the two files. |
| 102 | +We can just count the lines, or maybe even parse each symbol to check if the line was added or removed. |
| 103 | + |
| 104 | +.. note:: |
| 105 | + |
| 106 | + Python also has a `difflib <https://docs.python.org/3/library/difflib.html>`__ module that can be used to get the diff between two files. |
| 107 | + But doesn't look like it can distinguish lines that were changed from lines that were added or removed. |
| 108 | + But maybe that's ok? Do we really need to know if a line was changed instead of added or removed? |
| 109 | + |
| 110 | +Storing results |
| 111 | +--------------- |
| 112 | + |
| 113 | +Doing a diff between two versions can be expensive, so we need to store the results. |
| 114 | + |
| 115 | +We can store the results in the DB (``VersionDiff``), or in S3 (``diff/project/a--b.json``). |
| 116 | +The information to store would contain the versions compared, and the diff. |
| 117 | + |
| 118 | +.. code:: json |
| 119 | +
|
| 120 | + { |
| 121 | + "version_a": {"id": 1, "build": {"id": 1}}, |
| 122 | + "version_b": {"id": 2, "build": {"id": 2}}, |
| 123 | + "diff": { |
| 124 | + "added": [{"file": "new.txt"}], |
| 125 | + "removed": [{"file": "deleted.txt"}], |
| 126 | + "modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}] |
| 127 | + } |
| 128 | + } |
| 129 | +
|
| 130 | +The information is stored in a similar way that it will be returned by the API. |
| 131 | +Things important to note: |
| 132 | + |
| 133 | +- We need to take into consideration the diff of the latest successful builds only. |
| 134 | + If any of the builds that we have stored don't match the latest successful build of any of the versions, |
| 135 | + we need to the diff again. |
| 136 | +- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``. |
| 137 | + We can store that information as well, or just calculate it on the fly. |
| 138 | +- The files are objects, so we can store additional information in the future. |
| 139 | +- When a file has been modified, we also store the number of lines that changed. |
| 140 | + To start we can just store the number of lines added and removed combined ``{"lines": {"changed": 99}}``. |
| 141 | +- Instead of using the slugs on storage (if we decide to use S3), we could use the IDs, that will prevent the need to update the data if the slugs change. |
| 142 | + But if the slugs change, the builds will probably be different, so we will need to update the data anyway. |
| 143 | + |
| 144 | +We could store the changed files sorted by the number of changes, or make that an option in the API, |
| 145 | +or just let the client sort the files as they see fit. |
| 146 | + |
| 147 | +API |
| 148 | +--- |
| 149 | + |
| 150 | +This operation is expensive, so it won't be available to unauthenticated users. |
| 151 | +And a diff can only be done between versions of the same project that the user has access to. |
| 152 | + |
| 153 | +The endpoint will be: |
| 154 | + |
| 155 | +.. code:: http |
| 156 | +
|
| 157 | + GET /api/v3/projects/{project_slug}/versions/{version_slug}/diff/{other_version_slug}/ |
| 158 | +
|
| 159 | +And the response will be: |
| 160 | + |
| 161 | +.. code:: json |
| 162 | +
|
| 163 | + { |
| 164 | + "version_a": {"id": 1, "build": {"id": 1}}, |
| 165 | + "version_b": {"id": 2, "build": {"id": 2}}, |
| 166 | + "diff": { |
| 167 | + "added": [{"file": "new.txt"}], |
| 168 | + "removed": [{"file": "deleted.txt"}], |
| 169 | + "modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}] |
| 170 | + } |
| 171 | + } |
| 172 | +
|
| 173 | +The version and build can be the full objects, or just the IDs and slugs. |
| 174 | + |
| 175 | +We will generate a lock on this request, to avoid multiple calls to the API for the same versions. |
| 176 | +We can reply with a ``202 Accepted`` if the diff is being calculated in another request. |
| 177 | + |
| 178 | +Integrations |
| 179 | +------------ |
| 180 | + |
| 181 | +You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong! |
| 182 | + |
| 183 | +Doing the API call is easy, but knowing *when* to call it is hard. |
| 184 | +We need to call the API after the build has finished successfully. |
| 185 | + |
| 186 | +Luckily, we have a webhook that tells us when a build has finished successfully. |
| 187 | +But, we don't want users to have to implement the integration themselves. |
| 188 | + |
| 189 | +We could: |
| 190 | + |
| 191 | +- Use this as an opportunity to explore using GitHub Apps. |
| 192 | +- Request additional permissions in our existing OAuth2 integration (``project`` scope). |
| 193 | +- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard. |
| 194 | + Maybe don't even expose the API to the public, just use it internally. |
| 195 | +- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__ |
| 196 | + to trigger the action from our webhook. This requires the user to do some additional setup, |
| 197 | + and for our webhooks to support custom headers. |
| 198 | + |
| 199 | +Possible issues |
| 200 | +--------------- |
| 201 | + |
| 202 | +Even if we don't download files from S3, we are still making calls to S3, and AWS charges for those calls. |
| 203 | +But since we are doing this on demand, and we can cache the results, we can minimize the costs. |
| 204 | + |
| 205 | +``rclone check`` returns only the list of files that changed, |
| 206 | +if we want to make additional checks over those files, we will need to make additional calls to S3. |
| 207 | + |
| 208 | +Future improvements and ideas |
| 209 | +----------------------------- |
| 210 | + |
| 211 | +- Detect moved files. |
| 212 | +- Detect changes in sections of HTML files. |
| 213 | +- Expand to other file types. Maybe images? |
| 214 | +- Integrate with addons. |
| 215 | +- Diff between versions of different projects? |
0 commit comments