-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
File tree diff API #11319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
All of this sounds like a really good idea to me. I see how we can implement some nice features on top of this data 👍🏼
We may need to do this at S3 if that's possible. Otherwise, we will need to |
We talk about using the file hash that S3 can returns via the API: https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObjectAttributes.html and compare it with the hash we can calculate locally. |
Next step here to do some research, and figure out what the best approach will be. If it looks like it's pretty straight we can move forward with implementation after that. |
rclone provides a way to get the diff between two sources, remote or local. https://rclone.org/commands/rclone_check/
This allows us to do the comparison of two versions that are already in S3, or do the comparison of files that are about to be uploaded with a version that's in S3, this can be useful for PR build, where we know that a useful diff would be from the PR preview and the latest version. That should save a round to S3. There is also the |
Nice, that seems like a great approach to start. I assume that it's still hitting a bunch of S3 calls, but probably fine since builds happen way less frequently than doc serving, so it'll probably be a small number of total queries compared to doc serving 👍 |
Yeah, there is no way around that. Results will be cached, obviously (and diff will be generated on demand). If we are good with that, I can start writing a small design doc with the interface of the API and how it could integrate with PR builds. |
I think it's a great place to start. I think we can figure out the architecture and workflow, and if we hit scaling issues, we can always reevaluate that part, but I think the work to integrate it is definitely worth pushing forward on. |
@stsewd have you check this idea? any thoughts, pros/cons? Would this give us the number of lines changed on each file or only added/modified/deleted? |
rclone doesn't download the files to do the checks.
No, to get the lines that changed, we need to download the files. |
I've put this on the discussion list for next week, the design doc was really helpful and we've had a lot of discussion from that doc. |
* Version file tree diff: design doc ref #11319 * format * This is just plain text * After thoughts * Updates from review * Update doc * Linter * Updates from comments * Updates from recent conversations * More updates
What's the problem this feature will solve?
Currently we can't link users to a specific page that has changed in a PR preview, or we can't suggest redirects for files that were renamed/deleted.
Describe the solution you'd like
File tree diff (FTD) is a feature that allows users to see the differences between the file trees of the generated documentation from two versions. This allows users to see which files were deleted or added, it can also list of files that were changed, and sort them by the number of lines changed on each file.
Haven't done much research about this, but doing a diff over two file trees should be a problem that's solved, worst case we can just do a manual diff using a set for the file tree, and using the unix diff command for the files. We could expose this as an API.
Note that this is a different product from the diff feature for Pull requests, that's focused on doing a diff over the HTML content itself, here we just care about the files that changed and how many lines were changed.
A basic example would be:
Current content:
New content:
Our API will list the files that were removed and the ones that were added, we can scope it to just track HTML files to start, and maybe limit the number of files returned.
There may be some tools that change all the pages on each build (like updating the commit on each file), that's were the sorting by number of lines changed comes into play).
Some features ideas that we can build on top:
#foo
->#bar
Alternative solutions
We have discussed other solutions for this in the past, but they rely on the source files, not in the generated files. That's a problem since our serving and redirects work over the generated HTML files.
Alternative names
Additional context
The text was updated successfully, but these errors were encountered: