Skip to content

File tree diff #11646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Oct 22, 2024
Merged

File tree diff #11646

merged 22 commits into from
Oct 22, 2024

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Oct 3, 2024

This is on top of #11643.

Closes #11319

Ref #11507

stsewd added 2 commits October 2, 2024 17:07
Currently, we walk the entire project directory to
apply two operations: index files in ES, and keep track of index/404 files.
These two operations are independent, but in our code they are kind of
mixed together in order to avoid walking the project directory twice.

I have abstracted the processing of the files with a "Indexer" class,
which is responsible for doing an operation on a file,
and at the end it can collect the results.
@stsewd
Copy link
Member Author

stsewd commented Oct 3, 2024

This is still missing tests and docstrings, but it would be great to have a review if we are good with this direction.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a really nice simple implementation with the Indexer refactor. Seems like a great place to start, and start seeing the edge cases where we hit them :)

self._hashes = {}

def process(self, html_file: HTMLFile, sync_id: int):
self._hashes[html_file.path] = html_file.processed_json["main_content_hash"]
Copy link
Member

@ericholscher ericholscher Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if we want to be storing the hash on the HTMLFile as well. With this architecture, querying the hash of a file is going to be pretty complex and require pulling from S3. I'm not sure if we have an obvious use case for getting the hash of a specific file, so maybe something to note for the future? Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔

Suggested change
self._hashes[html_file.path] = html_file.processed_json["main_content_hash"]
self._hashes[html_file.path] = html_file.processed_json["main_content_hash"]
# TODO: Possibly store the hash on the html_file object or in cache,
# since this will allow us to query that from the DB directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔

Good point. Most of projects using PR builder will access the PR at least once immediately after the build finishes. We could consider optimizing this somehow (eg. saving the whole JSON in the cache for 30m --so, we don't have to hit S3); but I'm not too worried about this just yet. Dealing with cache is always hard due to invalidation 😄

It's a good idea to leave a comment here with these ideas, tho 👍🏼

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer save each HTML file in the DB, not sure if we want to go back to store that information in the DB. What I can see us doing is caching the result of the diff operation.

@stsewd stsewd linked an issue Oct 9, 2024 that may be closed by this pull request
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this initial implementation.

However, I'd definitely change where we are parsing and calculating the hashes. I don't like we are doing this together inside the Celery task for search indexing. I'd like this to live outside the search and generate the manifest.json file inside the build process while we have the files on disk already. Besides, this gives the ability to the user to generate their own manifest.json using the technique they want. This is something we discussed and agreed on, so I wouldn't change it unless there are specific good reasons to do it.

self._hashes = {}

def process(self, html_file: HTMLFile, sync_id: int):
self._hashes[html_file.path] = html_file.processed_json["main_content_hash"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔

Good point. Most of projects using PR builder will access the PR at least once immediately after the build finishes. We could consider optimizing this somehow (eg. saving the whole JSON in the cache for 30m --so, we don't have to hit S3); but I'm not too worried about this just yet. Dealing with cache is always hard due to invalidation 😄

It's a good idea to leave a comment here with these ideas, tho 👍🏼

@stsewd
Copy link
Member Author

stsewd commented Oct 10, 2024

However, I'd definitely change where we are parsing and calculating the hashes. I don't like we are doing this together inside the Celery task for search indexing. I'd like this to live outside the search and generate the manifest.json file inside the build process while we have the files on disk already. Besides, this gives the ability to the user to generate their own manifest.json using the technique they want. This is something we discussed and agreed on, so I wouldn't change it unless there are specific good reasons to do it.

Doing this in the build process itself has some downsides:

  • Build time may increase for users
  • To regenerate the manifest, a build will be needed. New users wanting to use this feature will need to re-build their versions (we can just re-trigger a task instead with this approach).
  • We already walk the file tree in the celery task, there is no need for additional code or test setup.

We still can have the "create your own manifest" feature, this implementation doesn't exclude that feature.

Base automatically changed from refactor-search-index-process to main October 10, 2024 15:08
@humitos
Copy link
Member

humitos commented Oct 10, 2024

Build time may increase for users

This is fine to me. I think it's good to consider the generation of the manifest as part of the build time. That will give users/us a more realistic measure of the build time process. Projects with heavy CPU time consumption should pay us more money for bigger resources, instead of us hiding this behind a Celery task and paying the cost by ourselves --without even understanding who are the customers that consume these times on those tasks (we are not tracking this for these Celery tasks).

I think in most of the cases it won't affect too much, I'd say, but I'm happy to see some data for performance tests here if that helps.

To regenerate the manifest, a build will be needed. New users wanting to use this feature will need to re-build their versions (we can just re-trigger a task instead with this approach).

I'm not seeing and issue here since generating manifest for old versions isn't super useful for now. We are only considering Pull Request and latest versions; which are being built all the time.

Even if we will be able to trigger a task to generate the manifest for any version, it will be only possible using our own algorithm since we don't know where and how the user could be generating this file. So, exposing such a feature could even be confusing or even destructive of an already generated manifest using a custom process.

I'm fine focusing ourselves on new versions for now and keep the implementation simple.

We already walk the file tree in the celery task, there is no need for additional code or test setup.

Async things are always more complex to handle, maintain, debug, etc. Besides, it could even end up being a worse UX for users since the build could have finished already but the generation of the manifest doesn't, making the PR preview to not show the file tree diff (because it's not ready yet). We would need to implement polling in the client and add more UI/UX around it to handle these cases, complicating things more in both sides: backend and frontend.

On the other hand, walking a tree of files in disk is a pretty fast operation. I'm not seeing a lot of benefits in doing it only once compared with the complexity added by being inside a Celery task and entangled with the search code.


I'd start as simple as possible using a sync approach generating the build manifest inside the build process as we previously discussed all together.

@ericholscher
Copy link
Member

ericholscher commented Oct 12, 2024

It sounds like maybe we need to discuss when this code runs a bit more in a call this week? I don't think I was thinking too deeply about it in the design process, tbh. In general this approach mostly makes sense to me, but the UX around the delay in the indexing is a reasonable concern.

I do think slower builds are a bad UX, and we can just count this time in the build process in some other way, but making builds slower to upload isn't a good thing to design if we can avoid it.

Overall this architecture makes a ton of sense to me, since we're already doing all the parsing of the primary content in the search code. Perhaps a better question is why don't we also migrate the search indexing to be run during the build? The same logic presumably applies, and that would allow us to keep the abstractions married together.

I assume we need tasks to run this logic in an on-demand fashion (eg. search reindexing, or build manifest creation outside of a build), but it seems like it wouldn't be hard to support both: by default running while the files are on the build server, but also possible to run them against S3 if needed?

@humitos
Copy link
Member

humitos commented Oct 14, 2024

Perhaps a better question is why don't we also migrate the search indexing to be run during the build? The same logic presumably applies, and that would allow us to keep the abstractions married together.

I don't know what were the original reasons about why this process was separated from the build itself. All my opinions for the hash generations applies here as well. However, I think the search indexing makes usage of the DB (eg. ImportedFile), and that was probably the reason to run it in a Celery task on the web queue.

Removing async and Celery queues ops complexity is a win to me. If that's not possible, at least, I would try to not add more work to those queues if we can avoid it, and I think we can on the file tree diff case.

I assume we need tasks to run this logic in an on-demand fashion (eg. search reindexing, or build manifest creation outside of a build),

We won't be able to run on-demand manifest creation if we want to allow users creating their own manifest using their own rules, as we talked. This is because the commands will be executed inside the build process itself.

@stsewd
Copy link
Member Author

stsewd commented Oct 14, 2024

We won't be able to run on-demand manifest creation if we want to allow users creating their own manifest using their own rules

We can still have that feature, the user-generated manifest will take precedence over our process.

Removing async and Celery queues ops complexity is a win to me. If that's not possible, at least, I would try to not add more work to those queues if we can avoid it, and I think we can on the file tree diff case.

We will be moving that complexity to the build itself, we can't escape from that. Executing things outside the build also keep us in the direction of isolating the builds from our application.

@ericholscher
Copy link
Member

I'd be 👍 on moving forward with this approach so we can start testing it, since it's already done, and then have a larger discussion about how to rearchitect this code. I think the ideas around the UX and manifest creation are gonna take some time to get right, so I'm fine doing the indexing after the build while we work on it behind a feature flag.

There's definitely a larger philosophical discussion here, and I don't want to block our initial testing on that discussion if we can avoid it, and changing where this code runs is easy to do after we have an initial version out and tested.

@agjohnson
Copy link
Contributor

agjohnson commented Oct 14, 2024

I agree with the points @humitos shared above. I think we should aim to surface as much of our build process to users as possible, especially any part where we are giving or plan to give control to the user.

It does feel like a little bit of a miss to have the structure of build.jobs and not utilize this. Using build.jobs as a place to run and override both search indexing and hash manipulation would fit very well with the design of the rest of the build process. This also feels like stronger delineation between the user build and our application as it is explicit and not hidden magic.

I think concerns over build time are premature, though certainly not invalid. We can guess that large projects might struggle with this, but we have no figures on timing yet. Part of our plan should be to gather these figures. Are we talking 1-10s for most projects? That seems perfectly fine and I'm happy to avoid complexity added by more async tasks.

While we can find some way to give users control of hashing in this current model, I think the overall UX is of this pattern now is moving away from what we've been trying to move towards with build.jobs.

I'd be 👍 on moving forward with this approach so we can start testing it, since it's already done, and then have a larger discussion about how to rearchitect this code.

Having said all that, I'm also 👍 on this. But I will say with the caveat that this works as long as we are maintaining a neutral stance on this implementation and aren't investing more on this async pattern as we are testing.

@ericholscher
Copy link
Member

From the call, sounds like we're 👍 on fixing this up for merge and starting to test the indexing and API internally.

@stsewd stsewd marked this pull request as ready for review October 21, 2024 19:48
@stsewd stsewd requested a review from a team as a code owner October 21, 2024 19:48
@stsewd stsewd requested a review from ericholscher October 21, 2024 19:48
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I'd love to get it merged up today if possible so we can start testing it after deploy tomorrow.

@@ -299,6 +299,17 @@ def last_build(self):
def latest_build(self):
return self.builds.order_by("-date").first()

@property
def latest_successful_build(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to cache this somehow? I know we've been hitting issues with calling stuff like this repeatedly... Maybe not important in this PR, but I'd love to have a general solution for this.

state=BUILD_STATE_FINISHED,
success=True,
)
get_manifest.side_effect = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want side_effect here instead of a return value? I haven't used side_effect like this before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side_effect allows you to return multiple values (one for each call), while return_value allows only returning the same value.

Comment on lines +523 to +525
"filetreediff": {
"enabled": False,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused where we're setting enabled: False in the code. Is it just the default and then we're overriding it if we're enabling it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we default to false, and then check if the project supports it, or if the version allows it, and if the user has permissions.

@stsewd stsewd merged commit 5c8d47e into main Oct 22, 2024
8 checks passed
@stsewd stsewd deleted the file-tree-diff branch October 22, 2024 14:54
@stsewd stsewd mentioned this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File tree diff API
4 participants