-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Version file tree diff: design doc #11507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
2ba73fa
07e2dba
8fb3941
1f2b58a
299e000
71f5b18
32ce43d
e41c720
173c56d
de82f28
301c492
623ea93
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,8 +45,19 @@ The key points of this feature are: | |
- Expose that as an API. | ||
- Integrate that in PR previews. | ||
|
||
Diff between two S3 directories | ||
------------------------------- | ||
Diff between two versions | ||
------------------------- | ||
|
||
Using a manifest | ||
~~~~~~~~~~~~~~~~ | ||
|
||
We can create a manifest that contains the hashes and other important metadata of the files, | ||
we can save this manifest in storage or in the DB. | ||
|
||
When a build finishes, we generate this manifest for all HTML files, and store it. | ||
When we need to compare two versions, we can just compare the manifests. | ||
|
||
This doesn't require downloading the files, but it requires building a version to generate the manifest. | ||
|
||
Using rclone | ||
stsewd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
~~~~~~~~~~~~ | ||
|
@@ -75,20 +86,29 @@ another option can be to output each type of change to a different file (``--mis | |
|
||
To start, we will only consider HTML files (``--include=*.html``). | ||
|
||
Using a manifest | ||
~~~~~~~~~~~~~~~~ | ||
Changed files | ||
------------- | ||
|
||
Another option is to create a manifest that contains the hashes and other important metadata of the files, | ||
we can save this manifest in storage or in the DB. | ||
Listing the files that were added or deleted is straightforward, | ||
but when listing the files that were modified, we want to list files that had relevant changes only. | ||
|
||
When a build finishes, we can generate this manifest and store it. | ||
When we need to compare two versions, we can just compare the manifests. | ||
For example, if the build injects some content that changes on every build (like a timestamp or commit), | ||
we don't want to list all files as modified. | ||
|
||
We have a couple of options to improve this list. | ||
|
||
Hashing the main content | ||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Timestamps and other metadata is usually added in the footer of the files, outside the main content. | ||
Instead of hashing the whole file, we can hash only the main content of the file, | ||
and use that hash to compare the files. | ||
|
||
This will allow us to better detect files that were modified in a meaningful way. | ||
|
||
Lines changed between two files | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not convinced we need the lines changed in v1. I'd like to avoid this overhead if we can -- I think the start should just be the rclone approach, since sorting the files is a "nice to have" and not required for any of the functionality we build. It would be an obvious v2 feature once we see how much people are using it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version. I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience. |
||
------------------------------- | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Having a list of files that changed is good, but if the builds inject some content | ||
that changes on every build (like a timestamp), all files will always be marked as changed. | ||
In order to provide more useful information, we can sort the files by some metrics, | ||
like the number of lines that changed. | ||
|
||
|
@@ -137,16 +157,13 @@ A good thing of using Python is that we don't need to write the files to disk, | |
and the result is easier to parse. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's probably slower as well, but I think I'm liking the idea of doing the diff in Python. |
||
|
||
Alternative metrics | ||
~~~~~~~~~~~~~~~~~~~ | ||
+++++++++++++++++++ | ||
|
||
Checking the number of lines that changed is a good metric, but it requires downloading the files. | ||
Another metric we could use is the size of the files, that can be obtained from the metadata (no need of downloading the files), | ||
The most a file size has changed, the most lines have likely been added or removed, | ||
this still leaves lines that changed with the same amount of characters as irrelevant in the listing. | ||
|
||
Another way could be to check for lines changed in the main content of the file, | ||
we can re-use the code we have for search indexing. | ||
|
||
Storing results | ||
--------------- | ||
|
||
|
@@ -256,14 +273,17 @@ Initial implementation | |
|
||
For the initial implementation, we will: | ||
|
||
- Use the ``rclone check`` command to get the diff between two versions. | ||
- Generate a manifest of all HTML files from the versions that we want to compare. | ||
This will be done at the end of the build. | ||
- Generate the hash based on the main content of the file, | ||
not the whole file. | ||
Comment on lines
+308
to
+311
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we going to create two different manifest files? 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No. Why did you think we need two? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not saying we need/want two manifest files, but I'm confused about if the manifest will have the hash for the main content and another hash for the full file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just the main content, don't think we need the hash from the whole content for anything. |
||
- Only expose the files that were added, removed, or modified (HTML files only). | ||
The number of lines that changed wont be exposed. | ||
- Store the results in the DB. | ||
- Don't store the results in the DB, | ||
we can store the results in a next iteration. | ||
Comment on lines
+315
to
+316
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What results are you referencing here? Didn't we say to store the hash in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The results from the diff, hashes will be stored in storage in a JSON file. But we may want to store the results from the diff for faster responses. We don't need to use |
||
- Expose this feature only via the addons feature. | ||
- Allow to diff an external version against the version that points to the default branch/tag of the project only. | ||
- Use a feature flag to enable this feature on projects. | ||
- Run the diff while we have the files on disk (end of the build), if possible. | ||
|
||
Other features that are not mentioned here, like exposing the number of lines that changed, | ||
or a public API, will not be implemented in the initial version, | ||
|
@@ -272,7 +292,11 @@ and may be considered in the future (and thier implementation is subject to chan | |
Possible issues | ||
--------------- | ||
|
||
Even if we don't download files from S3, we are still making calls to S3, and AWS charges for those calls. | ||
In the case that we use a manifest, | ||
hashing the contents of the files may add some overhead to the build. | ||
|
||
In the case that we use ``rclone``, | ||
even if we don't download files from S3, we are still making calls to S3, and AWS charges for those calls. | ||
But since we are doing this on demand, and we can cache the results, we can minimize the costs | ||
(maybe is not that much). | ||
|
||
|
@@ -289,10 +313,14 @@ Future improvements and ideas | |
This will imply checking the hashes of deleted and added files, | ||
if that same hash of a file that was deleted matches one from a file that was added, | ||
we have a move. | ||
But since we don't have access to those hashes after rclone is run, | ||
In case we use rclone, since we don't have access to those hashes after rclone is run, | ||
we would need to re-fetch that metadata from S3. | ||
Could be a feature request for rclone. | ||
- Detect changes in sections of HTML files. | ||
stsewd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
We could re-use the code we have for search indexing. | ||
- Expand to other file types | ||
- Allow doing a diff between versions of different projects | ||
- Allow to configure how the main content of the file is detected | ||
(like a CSS selector). | ||
- Allow to configure content that should be ignored when hashing the file | ||
(like a CSS selector). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand these manifest will hash only the "main content" of the document. We should mention that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mentioned in the "Changed files" and "Initial implementation" section, this section doesn't touch on how the hash is calculated.