-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
File tree diff #11646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File tree diff #11646
Conversation
Currently, we walk the entire project directory to apply two operations: index files in ES, and keep track of index/404 files. These two operations are independent, but in our code they are kind of mixed together in order to avoid walking the project directory twice. I have abstracted the processing of the files with a "Indexer" class, which is responsible for doing an operation on a file, and at the end it can collect the results.
This is still missing tests and docstrings, but it would be great to have a review if we are good with this direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a really nice simple implementation with the Indexer refactor. Seems like a great place to start, and start seeing the edge cases where we hit them :)
self._hashes = {} | ||
|
||
def process(self, html_file: HTMLFile, sync_id: int): | ||
self._hashes[html_file.path] = html_file.processed_json["main_content_hash"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if we want to be storing the hash on the HTMLFile as well. With this architecture, querying the hash of a file is going to be pretty complex and require pulling from S3. I'm not sure if we have an obvious use case for getting the hash of a specific file, so maybe something to note for the future? Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔
self._hashes[html_file.path] = html_file.processed_json["main_content_hash"] | |
self._hashes[html_file.path] = html_file.processed_json["main_content_hash"] | |
# TODO: Possibly store the hash on the html_file object or in cache, | |
# since this will allow us to query that from the DB directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔
Good point. Most of projects using PR builder will access the PR at least once immediately after the build finishes. We could consider optimizing this somehow (eg. saving the whole JSON in the cache for 30m --so, we don't have to hit S3); but I'm not too worried about this just yet. Dealing with cache is always hard due to invalidation 😄
It's a good idea to leave a comment here with these ideas, tho 👍🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer save each HTML file in the DB, not sure if we want to go back to store that information in the DB. What I can see us doing is caching the result of the diff operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this initial implementation.
However, I'd definitely change where we are parsing and calculating the hashes. I don't like we are doing this together inside the Celery task for search indexing. I'd like this to live outside the search and generate the manifest.json
file inside the build process while we have the files on disk already. Besides, this gives the ability to the user to generate their own manifest.json
using the technique they want. This is something we discussed and agreed on, so I wouldn't change it unless there are specific good reasons to do it.
self._hashes = {} | ||
|
||
def process(self, html_file: HTMLFile, sync_id: int): | ||
self._hashes[html_file.path] = html_file.processed_json["main_content_hash"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could store it in redis/cache, since the query is likely to happen pretty good after creation 🤔
Good point. Most of projects using PR builder will access the PR at least once immediately after the build finishes. We could consider optimizing this somehow (eg. saving the whole JSON in the cache for 30m --so, we don't have to hit S3); but I'm not too worried about this just yet. Dealing with cache is always hard due to invalidation 😄
It's a good idea to leave a comment here with these ideas, tho 👍🏼
Doing this in the build process itself has some downsides:
We still can have the "create your own manifest" feature, this implementation doesn't exclude that feature. |
This is fine to me. I think it's good to consider the generation of the manifest as part of the build time. That will give users/us a more realistic measure of the build time process. Projects with heavy CPU time consumption should pay us more money for bigger resources, instead of us hiding this behind a Celery task and paying the cost by ourselves --without even understanding who are the customers that consume these times on those tasks (we are not tracking this for these Celery tasks). I think in most of the cases it won't affect too much, I'd say, but I'm happy to see some data for performance tests here if that helps.
I'm not seeing and issue here since generating manifest for old versions isn't super useful for now. We are only considering Pull Request and latest versions; which are being built all the time. Even if we will be able to trigger a task to generate the manifest for any version, it will be only possible using our own algorithm since we don't know where and how the user could be generating this file. So, exposing such a feature could even be confusing or even destructive of an already generated manifest using a custom process. I'm fine focusing ourselves on new versions for now and keep the implementation simple.
Async things are always more complex to handle, maintain, debug, etc. Besides, it could even end up being a worse UX for users since the build could have finished already but the generation of the manifest doesn't, making the PR preview to not show the file tree diff (because it's not ready yet). We would need to implement polling in the client and add more UI/UX around it to handle these cases, complicating things more in both sides: backend and frontend. On the other hand, walking a tree of files in disk is a pretty fast operation. I'm not seeing a lot of benefits in doing it only once compared with the complexity added by being inside a Celery task and entangled with the search code. I'd start as simple as possible using a sync approach generating the build manifest inside the build process as we previously discussed all together. |
It sounds like maybe we need to discuss when this code runs a bit more in a call this week? I don't think I was thinking too deeply about it in the design process, tbh. In general this approach mostly makes sense to me, but the UX around the delay in the indexing is a reasonable concern. I do think slower builds are a bad UX, and we can just count this time in the build process in some other way, but making builds slower to upload isn't a good thing to design if we can avoid it. Overall this architecture makes a ton of sense to me, since we're already doing all the parsing of the primary content in the search code. Perhaps a better question is why don't we also migrate the search indexing to be run during the build? The same logic presumably applies, and that would allow us to keep the abstractions married together. I assume we need tasks to run this logic in an on-demand fashion (eg. search reindexing, or build manifest creation outside of a build), but it seems like it wouldn't be hard to support both: by default running while the files are on the build server, but also possible to run them against S3 if needed? |
I don't know what were the original reasons about why this process was separated from the build itself. All my opinions for the hash generations applies here as well. However, I think the search indexing makes usage of the DB (eg. Removing async and Celery queues ops complexity is a win to me. If that's not possible, at least, I would try to not add more work to those queues if we can avoid it, and I think we can on the file tree diff case.
We won't be able to run on-demand manifest creation if we want to allow users creating their own manifest using their own rules, as we talked. This is because the commands will be executed inside the build process itself. |
We can still have that feature, the user-generated manifest will take precedence over our process.
We will be moving that complexity to the build itself, we can't escape from that. Executing things outside the build also keep us in the direction of isolating the builds from our application. |
I'd be 👍 on moving forward with this approach so we can start testing it, since it's already done, and then have a larger discussion about how to rearchitect this code. I think the ideas around the UX and manifest creation are gonna take some time to get right, so I'm fine doing the indexing after the build while we work on it behind a feature flag. There's definitely a larger philosophical discussion here, and I don't want to block our initial testing on that discussion if we can avoid it, and changing where this code runs is easy to do after we have an initial version out and tested. |
I agree with the points @humitos shared above. I think we should aim to surface as much of our build process to users as possible, especially any part where we are giving or plan to give control to the user. It does feel like a little bit of a miss to have the structure of I think concerns over build time are premature, though certainly not invalid. We can guess that large projects might struggle with this, but we have no figures on timing yet. Part of our plan should be to gather these figures. Are we talking 1-10s for most projects? That seems perfectly fine and I'm happy to avoid complexity added by more async tasks. While we can find some way to give users control of hashing in this current model, I think the overall UX is of this pattern now is moving away from what we've been trying to move towards with
Having said all that, I'm also 👍 on this. But I will say with the caveat that this works as long as we are maintaining a neutral stance on this implementation and aren't investing more on this async pattern as we are testing. |
From the call, sounds like we're 👍 on fixing this up for merge and starting to test the indexing and API internally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I'd love to get it merged up today if possible so we can start testing it after deploy tomorrow.
@@ -299,6 +299,17 @@ def last_build(self): | |||
def latest_build(self): | |||
return self.builds.order_by("-date").first() | |||
|
|||
@property | |||
def latest_successful_build(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try to cache this somehow? I know we've been hitting issues with calling stuff like this repeatedly... Maybe not important in this PR, but I'd love to have a general solution for this.
state=BUILD_STATE_FINISHED, | ||
success=True, | ||
) | ||
get_manifest.side_effect = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want side_effect here instead of a return value? I haven't used side_effect like this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side_effect allows you to return multiple values (one for each call), while return_value allows only returning the same value.
"filetreediff": { | ||
"enabled": False, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused where we're setting enabled: False
in the code. Is it just the default and then we're overriding it if we're enabling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we default to false, and then check if the project supports it, or if the version allows it, and if the user has permissions.
This is on top of #11643.
Closes #11319
Ref #11507