Skip to content

Decouple stats polling from results fetching #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
matthewwardrop opened this issue Jul 12, 2018 · 4 comments
Open

Decouple stats polling from results fetching #63

matthewwardrop opened this issue Jul 12, 2018 · 4 comments
Assignees

Comments

@matthewwardrop
Copy link

Greetings all!

I'm looking to transition a project I curate (omniduct, a library to simplify data acquisition, especially for data scientists) from pyhive to prestodb; but currently I would lose the ability to poll for query progress before actually attempting to retrieve results.

i.e. cursor.fetchone() is used to both collect results and update stats, which means that I cannot show progress of the actual execution of the query, only progress through collection of the results.

Would you welcome a patch to add support for this polling? Or are you planning to add it yourselves? Or are you opposed to adding this feature?

@matthewwardrop matthewwardrop changed the title Make stats polling asynchronous Decouple stats polling from results fetching Jul 12, 2018
@ggreg
Copy link
Contributor

ggreg commented Feb 15, 2019

@matthewwardrop yes, we would welcome a patch to add support polling stats independently of fetching results. We're not currently working on it, so your contribution would be greatly appreciated :).

  1. What stats are you the most interest in?
  2. What options are you considering to gather stats?

Regarding (2.), the client could sent a GET HTTP request to a /1/query/{query_id} endpoint.

At some point, we'll need to consider using asyncio (or a concurrent.futures executor in Python 2.7) to asynchronously perform some HTTP request as interleaving the process of getting result and stats could lead to unexpected behaviors such as queries failing with an ABANDONED error if the client takes too long poll the status of a query.

@matthewwardrop
Copy link
Author

matthewwardrop commented Feb 15, 2019

Hi @ggreg,

Thanks for responding to this.

I'm interested in all of the stats that are returned by the standard endpoints, but most especially the 'progress' field. In terms of methodology, I am imagining polling the same endpoints currently used by PrestoQuery.fetch and returning stats. At some point, this will call return data and/or will enter a finished state, and any returned data will be cached on some internal instance attribute, and further status polling will simply return the state as of that time. The user can then use the fetch methods as before, which will collect the data set aside in the local cache and then append to it any data returned by subsequent endpoint calls until the data is fully collected locally, as is the current behaviour.

This should not suffer any abandonment issues unless the user does not move on to using the fetch method within some sensible window of time after the polling indicates that the query has successfully ran its course.

Perhaps the asyncio/futures approach might belong instead in a wrapping library, such as omniduct, unless you are planning to support multiplexing of queries within this library itself.

I'll put out a PR soon.

@akhandev
Copy link

Hi @matthewwardrop,
Did you get a chance to add this functionality?

@matthewwardrop
Copy link
Author

Not yet @akhandev . I'll try and put out a PR this week. :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants