Skip to content

Async github/bitbucket repository syncing. #1417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Aug 6, 2015

Conversation

gregmuellegger
Copy link
Contributor

Will eventually fix #1370.

I have two strategies in mind doing this and would like to discuss them here.

Synchronous call via AJAX to /import/github/sync/

Probably the simplest solution to avoid that users are seeing a timeout, is that we call the /import/github/sync/ page with an AJAX request from /import/github/. The syncing then still takes place in a long synchronous request. The long request runs in an AJAX call and is therefore hidden from the user. After the AJAX call finishes, we update the page with the new content.

To make this work we would simply need to increase the max timeout for the sync page to a real high number (~10 Minutes?). But otherwise no big code changes are required in the backend. In the frontend we add the necessary JavaScript and remove the direct link to the sync page, so that users don't hit it manually.

Async repository syncing using Celery + monitoring

The more advanced solution is to have a celery task that is syncing the repositories. When the user hits /import/github/ we fire the task to update the repos. The user sees the page immediatelly though since the syncing takes place asynchronously.

We need a way to then monitor the progress of the celery task somehow in the frontend. We would therefore need some kind of API that we can poll using AJAX for state changes in the celery task. Once the task is completed, we update the page with the new content.


So I would like to discuss the two approaches and which one we should go here.

We should also look at these tickets while developing it:

@gregmuellegger gregmuellegger added the PR: work in progress Pull request is not ready for full review label Jul 9, 2015
@ericholscher
Copy link
Member

I think we need to go with #2. One of the main issues with today's site slowness was around running out of web worker processes. Doing things synchronously in the web process is definitely not a scalable way of handling this. I think it needs to feed into a celery task, and have it monitoring, as you said.

@d0ugal
Copy link
Contributor

d0ugal commented Jul 9, 2015

+1 to option 2. I'd say that is the correct approach.

@ericholscher
Copy link
Member

It should be easy enough to get the celery job ID in the response when we trigger the celery job, and then do an AJAX polling (or websocket, or other fanciness) to check status at /api/v2/github/import/?job={id} or something.

@gregmuellegger gregmuellegger force-pushed the async-github-repo-syncing branch from ea0d2d5 to 1ccab53 Compare July 17, 2015 12:35
@gregmuellegger
Copy link
Contributor Author

So the PR implements now a cool framework for monitoring celery tasks via the API. It integrates permission checks so that only authorized people can view the progress.

We use that framework now for the GitHub/BitBucket async repo syncing. I had to add some messages for progress and error states in the UI. Maybe that needs some love, but I will let the reviewer decide :)

In the long run we could use the newly created monitorable PublicTask as well for making the doc build status available in the API.

The PR is featurecomplete in my sense and is therefore ready for an architectural review. However it does not include any tests so far. I'll add these next week.

@agjohnson agjohnson self-assigned this Jul 17, 2015
@agjohnson
Copy link
Contributor

Awesome, I'll take a quick first pass at this today. I'll have some feedback on the javascript, but should get to writing documentation on our front end toolchain first.

}
});
}, 2000);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include a timeout to display a "can't sync" error. My install executes the task fine, but doesn't seem to update the client correctly, which keeps me waiting forever for a sync to finish

@agjohnson
Copy link
Contributor

Some high level feedback:

  • Is there any reason readthedocs/rtd/* shouldn't be in readthedocs/core? Historically, this was supposed to be the application path for shared-only resources, though it has also accumulated some misplaced resources.
  • I'd like to be moving new javascript off of the media/ path, and into per-application static files. I've outlined some of the process behind bundling up files in static-src, though until we settle on that, it would be fine to just move your libraries to static paths.
  • I like the public/private task api -- this should translate well to handling build status updates as well.
  • Does this need a degradation if I am running a development server with celery eager task resolution? The task doesn't seem to resolve in this case.

I'll have some time to delve to a lower level tomorrow afternoon, including some UX bits. This is looking good so far though.

@gregmuellegger
Copy link
Contributor Author

Is there any reason readthedocs/rtd/* shouldn't be in readthedocs/core? Historically, this was supposed to be the application path for shared-only resources, though it has also accumulated some misplaced resources.

No, I wasn't sure were to place it and without thinking much about it, the rtd
directory was created. Moved that now into core.utils.tasks.

I'd like to be moving new javascript off of the media/ path, and into per-application static files. I've outlined some of the process behind bundling up files in static-src, though until we settle on that, it would be fine to just move your libraries to static paths.

Awesome, I looked for existing JS on the import page and media/ was were I
found it. I moved it into core/static-src/core/js/ were it makes more sense.
Thanks for clarifying.

I like the public/private task api -- this should translate well to handling build status updates as well.
Does this need a degradation if I am running a development server with celery eager task resolution? The task doesn't seem to resolve in this case.

Previously running the task failed since a connect to the celery task backend
(Redis) was made before actually starting the task. I changed that now that
when triggering the task no Redis call is made when
CELERY_ALWAYS_EAGER=True. However the task monitoring will still fail as
Redis is not reachable. I have no intend for fixing that as it's not good to
mock for local development IMO. If we really want to support local development
on these features without having celery properly with Redis running, then we
could use the DB broker for Celery, that could be preconfigured in the sqlite
settings.

I also changed the behaviour that en error of the API won't keep trying to
access the task status. It will show the error message instead. I thought it
might be a good practice to keep retrying in case the client has connectivity
problems, but as you said the user won't see any error message ever. So that's
propably worse then a to early error message.

__all__ = ('user_id_matches',)


def user_id_matches(request, state, context):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some logic around this kind of stuff in the privacy app as well -- so it might make sense to put it there.

@agjohnson
Copy link
Contributor

Testing this out against my environment, it does sync, but after the task completes, and I'm notified that the task completed, the list of projects isn't automatically updated. On reload, the list is there of course. A good first pass at this would be to blow away the current list when you hit 'sync' -- or blow away when we have results -- and repopulate the list in it's entirety.

@gregmuellegger
Copy link
Contributor Author

Indeed, the updates were not working. Sorry for this, it was a little typo. Fixed that. It should work now as expected. If we want to indicate some syncing in the actual repository list then I would overlay the repo list with a spinner. Shall I work on this?

Also I had problems with integrating the current master with the changes to the gulp buildsystem into this branch. I documented them in #1490

@gregmuellegger gregmuellegger force-pushed the async-github-repo-syncing branch from 6c88a96 to 5790f5e Compare August 3, 2015 10:48
@ericholscher
Copy link
Member

Whats the status here? Would love to see this get ready for merge.

@gregmuellegger gregmuellegger force-pushed the async-github-repo-syncing branch from 5790f5e to 0b9b520 Compare August 6, 2015 11:44
@gregmuellegger
Copy link
Contributor Author

I rebased the work on master so that it will apply cleanly. I consider the PR feature complete. @agjohnson proposed to add some loading animation to the repository list, but I wasn't sure if I shall add a spinner there.

Clarifying that is the only blocking thing.

gregmuellegger and others added 5 commits August 6, 2015 15:01
We therefore document it in the docstring that the method must be implemented
by subclasses. We do that because pylint complains that subclasses are having
a differing argument list.
@ericholscher ericholscher merged commit 417b1d5 into master Aug 6, 2015
@gregmuellegger gregmuellegger deleted the async-github-repo-syncing branch September 17, 2015 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: work in progress Pull request is not ready for full review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do GitHub Syncing in an Async way
4 participants