File tree diff #11646

stsewd · 2024-10-03T17:43:38Z

This is on top of #11643.

Currently, we walk the entire project directory to apply two operations: index files in ES, and keep track of index/404 files. These two operations are independent, but in our code they are kind of mixed together in order to avoid walking the project directory twice. I have abstracted the processing of the files with a "Indexer" class, which is responsible for doing an operation on a file, and at the end it can collect the results.

Closes #11319 Ref #11507

readthedocs/search/parsers.py

readthedocs/filetreediff/__init__.py

stsewd · 2024-10-03T19:56:21Z

This is still missing tests and docstrings, but it would be great to have a review if we are good with this direction.

readthedocs/projects/tasks/search.py

ericholscher

This looks like a really nice simple implementation with the Indexer refactor. Seems like a great place to start, and start seeing the edge cases where we hit them :)

readthedocs/filetreediff/__init__.py

readthedocs/search/parsers.py

ericholscher · 2024-10-07T16:56:26Z

readthedocs/projects/tasks/search.py

+        self._hashes = {}
+
+    def process(self, html_file: HTMLFile, sync_id: int):
+        self._hashes[html_file.path] = html_file.processed_json["main_content_hash"]


I do wonder if we want to be storing the hash on the HTMLFile as well. With this architecture, querying the hash of a file is going to be pretty complex and require pulling from S3. I'm not sure if we have an obvious use case for getting the hash of a specific file, so maybe something to note for the future? Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔

Suggested change

self._hashes[html_file.path] = html_file.processed_json["main_content_hash"]

self._hashes[html_file.path] = html_file.processed_json["main_content_hash"]

# TODO: Possibly store the hash on the html_file object or in cache,

# since this will allow us to query that from the DB directly.

Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔

Good point. Most of projects using PR builder will access the PR at least once immediately after the build finishes. We could consider optimizing this somehow (eg. saving the whole JSON in the cache for 30m --so, we don't have to hit S3); but I'm not too worried about this just yet. Dealing with cache is always hard due to invalidation 😄

It's a good idea to leave a comment here with these ideas, tho 👍🏼

We no longer save each HTML file in the DB, not sure if we want to go back to store that information in the DB. What I can see us doing is caching the result of the diff operation.

humitos

I like this initial implementation.

However, I'd definitely change where we are parsing and calculating the hashes. I don't like we are doing this together inside the Celery task for search indexing. I'd like this to live outside the search and generate the manifest.json file inside the build process while we have the files on disk already. Besides, this gives the ability to the user to generate their own manifest.json using the technique they want. This is something we discussed and agreed on, so I wouldn't change it unless there are specific good reasons to do it.

readthedocs/projects/constants.py

readthedocs/filetreediff/__init__.py

humitos · 2024-10-10T14:15:49Z

readthedocs/projects/tasks/search.py

+        self._hashes = {}
+
+    def process(self, html_file: HTMLFile, sync_id: int):
+        self._hashes[html_file.path] = html_file.processed_json["main_content_hash"]


Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔

Good point. Most of projects using PR builder will access the PR at least once immediately after the build finishes. We could consider optimizing this somehow (eg. saving the whole JSON in the cache for 30m --so, we don't have to hit S3); but I'm not too worried about this just yet. Dealing with cache is always hard due to invalidation 😄

It's a good idea to leave a comment here with these ideas, tho 👍🏼

readthedocs/projects/tasks/search.py

readthedocs/search/parsers.py

stsewd · 2024-10-10T15:06:22Z

However, I'd definitely change where we are parsing and calculating the hashes. I don't like we are doing this together inside the Celery task for search indexing. I'd like this to live outside the search and generate the manifest.json file inside the build process while we have the files on disk already. Besides, this gives the ability to the user to generate their own manifest.json using the technique they want. This is something we discussed and agreed on, so I wouldn't change it unless there are specific good reasons to do it.

Doing this in the build process itself has some downsides:

Build time may increase for users
To regenerate the manifest, a build will be needed. New users wanting to use this feature will need to re-build their versions (we can just re-trigger a task instead with this approach).
We already walk the file tree in the celery task, there is no need for additional code or test setup.

We still can have the "create your own manifest" feature, this implementation doesn't exclude that feature.

humitos · 2024-10-10T15:41:41Z

Build time may increase for users

This is fine to me. I think it's good to consider the generation of the manifest as part of the build time. That will give users/us a more realistic measure of the build time process. Projects with heavy CPU time consumption should pay us more money for bigger resources, instead of us hiding this behind a Celery task and paying the cost by ourselves --without even understanding who are the customers that consume these times on those tasks (we are not tracking this for these Celery tasks).

I think in most of the cases it won't affect too much, I'd say, but I'm happy to see some data for performance tests here if that helps.

To regenerate the manifest, a build will be needed. New users wanting to use this feature will need to re-build their versions (we can just re-trigger a task instead with this approach).

I'm not seeing and issue here since generating manifest for old versions isn't super useful for now. We are only considering Pull Request and latest versions; which are being built all the time.

Even if we will be able to trigger a task to generate the manifest for any version, it will be only possible using our own algorithm since we don't know where and how the user could be generating this file. So, exposing such a feature could even be confusing or even destructive of an already generated manifest using a custom process.

I'm fine focusing ourselves on new versions for now and keep the implementation simple.

We already walk the file tree in the celery task, there is no need for additional code or test setup.

Async things are always more complex to handle, maintain, debug, etc. Besides, it could even end up being a worse UX for users since the build could have finished already but the generation of the manifest doesn't, making the PR preview to not show the file tree diff (because it's not ready yet). We would need to implement polling in the client and add more UI/UX around it to handle these cases, complicating things more in both sides: backend and frontend.

On the other hand, walking a tree of files in disk is a pretty fast operation. I'm not seeing a lot of benefits in doing it only once compared with the complexity added by being inside a Celery task and entangled with the search code.

I'd start as simple as possible using a sync approach generating the build manifest inside the build process as we previously discussed all together.

ericholscher · 2024-10-12T19:47:26Z

It sounds like maybe we need to discuss when this code runs a bit more in a call this week? I don't think I was thinking too deeply about it in the design process, tbh. In general this approach mostly makes sense to me, but the UX around the delay in the indexing is a reasonable concern.

I do think slower builds are a bad UX, and we can just count this time in the build process in some other way, but making builds slower to upload isn't a good thing to design if we can avoid it.

Overall this architecture makes a ton of sense to me, since we're already doing all the parsing of the primary content in the search code. Perhaps a better question is why don't we also migrate the search indexing to be run during the build? The same logic presumably applies, and that would allow us to keep the abstractions married together.

I assume we need tasks to run this logic in an on-demand fashion (eg. search reindexing, or build manifest creation outside of a build), but it seems like it wouldn't be hard to support both: by default running while the files are on the build server, but also possible to run them against S3 if needed?

humitos · 2024-10-14T16:01:26Z

Perhaps a better question is why don't we also migrate the search indexing to be run during the build? The same logic presumably applies, and that would allow us to keep the abstractions married together.

I don't know what were the original reasons about why this process was separated from the build itself. All my opinions for the hash generations applies here as well. However, I think the search indexing makes usage of the DB (eg. ImportedFile), and that was probably the reason to run it in a Celery task on the web queue.

Removing async and Celery queues ops complexity is a win to me. If that's not possible, at least, I would try to not add more work to those queues if we can avoid it, and I think we can on the file tree diff case.

I assume we need tasks to run this logic in an on-demand fashion (eg. search reindexing, or build manifest creation outside of a build),

We won't be able to run on-demand manifest creation if we want to allow users creating their own manifest using their own rules, as we talked. This is because the commands will be executed inside the build process itself.

stsewd · 2024-10-14T16:20:38Z

We won't be able to run on-demand manifest creation if we want to allow users creating their own manifest using their own rules

We can still have that feature, the user-generated manifest will take precedence over our process.

Removing async and Celery queues ops complexity is a win to me. If that's not possible, at least, I would try to not add more work to those queues if we can avoid it, and I think we can on the file tree diff case.

We will be moving that complexity to the build itself, we can't escape from that. Executing things outside the build also keep us in the direction of isolating the builds from our application.

ericholscher · 2024-10-14T17:41:02Z

I'd be 👍 on moving forward with this approach so we can start testing it, since it's already done, and then have a larger discussion about how to rearchitect this code. I think the ideas around the UX and manifest creation are gonna take some time to get right, so I'm fine doing the indexing after the build while we work on it behind a feature flag.

There's definitely a larger philosophical discussion here, and I don't want to block our initial testing on that discussion if we can avoid it, and changing where this code runs is easy to do after we have an initial version out and tested.

agjohnson · 2024-10-14T18:19:14Z

I agree with the points @humitos shared above. I think we should aim to surface as much of our build process to users as possible, especially any part where we are giving or plan to give control to the user.

It does feel like a little bit of a miss to have the structure of build.jobs and not utilize this. Using build.jobs as a place to run and override both search indexing and hash manipulation would fit very well with the design of the rest of the build process. This also feels like stronger delineation between the user build and our application as it is explicit and not hidden magic.

I think concerns over build time are premature, though certainly not invalid. We can guess that large projects might struggle with this, but we have no figures on timing yet. Part of our plan should be to gather these figures. Are we talking 1-10s for most projects? That seems perfectly fine and I'm happy to avoid complexity added by more async tasks.

While we can find some way to give users control of hashing in this current model, I think the overall UX is of this pattern now is moving away from what we've been trying to move towards with build.jobs.

I'd be 👍 on moving forward with this approach so we can start testing it, since it's already done, and then have a larger discussion about how to rearchitect this code.

Having said all that, I'm also 👍 on this. But I will say with the caveat that this works as long as we are maintaining a neutral stance on this implementation and aren't investing more on this async pattern as we are testing.

ericholscher · 2024-10-16T18:30:46Z

From the call, sounds like we're 👍 on fixing this up for merge and starting to test the indexing and API internally.

ericholscher

This looks great. I'd love to get it merged up today if possible so we can start testing it after deploy tomorrow.

ericholscher · 2024-10-22T13:16:59Z

readthedocs/builds/models.py

@@ -299,6 +299,17 @@ def last_build(self):
    def latest_build(self):
        return self.builds.order_by("-date").first()

+    @property
+    def latest_successful_build(self):


Should we try to cache this somehow? I know we've been hitting issues with calling stuff like this repeatedly... Maybe not important in this PR, but I'd love to have a general solution for this.

readthedocs/filetreediff/__init__.py

readthedocs/filetreediff/tests/test_filetreediff.py

ericholscher · 2024-10-22T13:26:29Z

readthedocs/proxito/tests/test_hosting.py

+            state=BUILD_STATE_FINISHED,
+            success=True,
+        )
+        get_manifest.side_effect = [


Do we want side_effect here instead of a return value? I haven't used side_effect like this before.

side_effect allows you to return multiple values (one for each call), while return_value allows only returning the same value.

readthedocs/search/api/v2/views.py

ericholscher · 2024-10-22T13:36:10Z

readthedocs/proxito/views/hosting.py

+                "filetreediff": {
+                    "enabled": False,
+                },


I'm a little confused where we're setting enabled: False in the code. Is it just the default and then we're overriding it if we're enabling it?

Yeah, we default to false, and then check if the project supports it, or if the version allows it, and if the user has permissions.

stsewd added 2 commits October 2, 2024 17:07

File tree diff

2555ae6

Closes #11319 Ref #11507

stsewd commented Oct 3, 2024

View reviewed changes

readthedocs/search/parsers.py Outdated Show resolved Hide resolved

stsewd added 3 commits October 3, 2024 14:46

Hash when parsing content to avoid caching the whole content

ed5377d

We don't need this anymore

1d62412

More to delete

a5d2905

stsewd commented Oct 3, 2024

View reviewed changes

readthedocs/filetreediff/__init__.py Outdated Show resolved Hide resolved

stsewd requested review from ericholscher, humitos and agjohnson October 3, 2024 19:56

todo

5991547

ericholscher reviewed Oct 7, 2024

View reviewed changes

readthedocs/projects/tasks/search.py Outdated Show resolved Hide resolved

ericholscher reviewed Oct 7, 2024

View reviewed changes

stsewd linked an issue Oct 9, 2024 that may be closed by this pull request

File tree diff API #11319

Closed

humitos reviewed Oct 10, 2024

View reviewed changes

stsewd mentioned this pull request Oct 10, 2024

Build: refactor search indexing process #11643

Merged

Base automatically changed from refactor-search-index-process to main October 10, 2024 15:08

stsewd added 4 commits October 16, 2024 16:28

Merge branch 'main' into file-tree-diff

c909b12

Less WIP, more diff

538f9fa

Less bugs

6f154ad

Git uses deleted

518ad62

Fix tests

790a99b

This was referenced Oct 17, 2024

File tree diff: allow ignoring files #11694

Closed

File tree diff: cache diff result #11695

Open

File tree diff client readthedocs/addons#409

Closed

Search: allow searching by file names #11696

Open

File tree diff: detect moved files #11697

Open

stsewd added 9 commits October 17, 2024 11:39

More refactor

fd0db63

Fix search tests

5a824fc

Fix search tests

cb10dc6

More tests

1e55441

More tests

6a0ebdd

Bye little bug, we still need tests

4cbf9d6

Test task

10f01f9

Adapt to changes on .com

dcea1d7

Fix tests

2dd5734

stsewd marked this pull request as ready for review October 21, 2024 19:48

stsewd requested a review from a team as a code owner October 21, 2024 19:48

stsewd requested a review from ericholscher October 21, 2024 19:48

auto-assign bot assigned stsewd Oct 21, 2024

Merge branch 'main' into file-tree-diff

ea5fb2c

ericholscher approved these changes Oct 22, 2024

View reviewed changes

Updates from review

7e27b27

stsewd merged commit 5c8d47e into main Oct 22, 2024
8 checks passed

stsewd deleted the file-tree-diff branch October 22, 2024 14:54

stsewd mentioned this pull request Oct 22, 2024

Remove unrelated change #11704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File tree diff #11646

File tree diff #11646

stsewd commented Oct 3, 2024

stsewd commented Oct 3, 2024

ericholscher left a comment

ericholscher Oct 7, 2024 •

edited

Loading

humitos Oct 10, 2024

stsewd Oct 17, 2024

humitos left a comment •

edited

Loading

humitos Oct 10, 2024

stsewd commented Oct 10, 2024

humitos commented Oct 10, 2024

ericholscher commented Oct 12, 2024 •

edited

Loading

humitos commented Oct 14, 2024

stsewd commented Oct 14, 2024

ericholscher commented Oct 14, 2024

agjohnson commented Oct 14, 2024 •

edited

Loading

ericholscher commented Oct 16, 2024

ericholscher left a comment

ericholscher Oct 22, 2024

ericholscher Oct 22, 2024

stsewd Oct 22, 2024

ericholscher Oct 22, 2024

stsewd Oct 22, 2024

File tree diff #11646

File tree diff #11646

Conversation

stsewd commented Oct 3, 2024

stsewd commented Oct 3, 2024

ericholscher left a comment

Choose a reason for hiding this comment

ericholscher Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Oct 10, 2024

humitos commented Oct 10, 2024

ericholscher commented Oct 12, 2024 • edited Loading

humitos commented Oct 14, 2024

stsewd commented Oct 14, 2024

ericholscher commented Oct 14, 2024

agjohnson commented Oct 14, 2024 • edited Loading

ericholscher commented Oct 16, 2024

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher Oct 7, 2024 •

edited

Loading

humitos left a comment •

edited

Loading

ericholscher commented Oct 12, 2024 •

edited

Loading

agjohnson commented Oct 14, 2024 •

edited

Loading