Skip to content

Add ability to set the allowLargeResults option in BigQuery #10474 #11209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

parthea
Copy link
Contributor

@parthea parthea commented Sep 30, 2015

  • Modify read_gbq() to allow users to redirect the query results to a destination table via the destination_table parameter
  • Modify read_gbq() to allow users to allow users to set the 'allowLargeResults' option in the BigQuery job configuration via the allow_large_results parameter

cc @aaront

@parthea parthea force-pushed the bq-allow-large-results branch from ac3cd4a to 19f910f Compare October 1, 2015 13:51
@parthea parthea changed the title Add ability to set the 'allowLargeResults' option in Google BigQu… Add ability to set the allowLargeResults option in BigQuery #10474 Oct 1, 2015
@parthea parthea force-pushed the bq-allow-large-results branch from 19f910f to ca84279 Compare October 2, 2015 10:21
@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

this would be confusing to a user as read_gbq should return s frame
does it here?

@parthea
Copy link
Contributor Author

parthea commented Oct 2, 2015

Yes, read_gbq() still returns a DataFrame with the query results regardless if these additional parameters are set. I've just tried the following scenarios:

  • In the first test, I set the destination_table and confirmed a destination table was created and DataFrame was returned.
  • In the second test, I set the destination_table and allow_large_results and confirmed a destination table was created and DataFrame was returned.

I will add a unit tests now for the above mentioned scenarios (I missed it the first time around)

All tests pass locally. Could this make it into the 0.17.0 release? I think it is a very useful feature.

tony@tonypc:~/pandas-parthea/pandas/io/tests$ nosetests test_gbq.py -v
test_should_be_able_to_get_a_bigquery_service (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_get_results_from_query (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_get_schema_from_query (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_get_valid_credentials (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_make_a_connector (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_bad_project_id (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_bad_table_name (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_column_order (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_column_order_plus_index (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_download_dataset_larger_than_200k_rows (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_index_column (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_malformed_query (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_redirect_query_results_to_destination_table_dataset_does_not_exist (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_redirect_query_results_to_destination_table_default (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_redirect_query_results_to_destination_table_if_table_exists_append (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_redirect_query_results_to_destination_table_if_table_exists_fail (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_redirect_query_results_to_destination_table_if_table_exists_replace (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_arbitrary_timestamp (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_empty_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_false_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_floats (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_integers (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_timestamp (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_timestamp_unix_epoch (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_true_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_valid_floats (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_valid_integers (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_valid_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_unicode_string_conversion_and_normalization (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_zero_rows (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_read_gbq_with_no_project_id_given_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_booleans_as_python_booleans (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_floats_as_python_floats (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_integers_as_python_floats (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_strings_as_python_strings (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_timestamps_as_numpy_datetime (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_that_parse_data_works_properly (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_to_gbq_should_fail_if_invalid_table_name_passed (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_to_gbq_with_no_project_id_given_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_create_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_create_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_dataset_does_not_exist (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_dataset_exists (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_delete_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_delete_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_generate_schema (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_google_upload_errors_should_raise_exception (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_list_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_list_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_list_table_zero_results (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_table_does_not_exist (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data_if_table_exists_append (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data_if_table_exists_fail (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data_if_table_exists_replace (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
pandas.io.tests.test_gbq.test_requirements ... ok
pandas.io.tests.test_gbq.test_generate_bq_schema_deprecated ... ok

----------------------------------------------------------------------
Ran 59 tests in 379.762s

OK

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

this is bloating the API

if u r returning a frame then simply use to_gbq and push it back up

@parthea
Copy link
Contributor Author

parthea commented Oct 2, 2015

I agree that it doesn't make sense to return the data in a DataFrame when a destination table is specified (since you can use to_gbq to push it back up). My preference would be to return an empty DataFrame when a destination table is specified in order to avoid the unnecessary download and upload of data when users want to create smaller datasets from larger ones. The ability to run queries and send the query results directly to a table (in an efficient manner) could be useful.

Regarding the allow_large_results parameter:
From https://cloud.google.com/bigquery/quota-policy#queries, query results > 128 MB compressed require the 'allowLargeResults' option to be set in the job configuration. One of the requirements for allowing large results is that you must specify a destination table.

@parthea
Copy link
Contributor Author

parthea commented Oct 2, 2015

Another potential solution, is to create a new function gbq.query_to_table() which does not return a DataFrame. gbq.query_to_table() would require a destination table to be specified and would support a parameter allow_large_results.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

@parthea I am not averse to these changes. But would like 0.17.0 to release and settle before considering api change.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

further crafting a nice useful, non-duplicative api is actually tricky. You want to have the limited set of things that one could 'do' in an intuitve way. So one of the big issues is how to pass in options (.e.g like allow_large_result, which is really a 'user' option.

@parthea
Copy link
Contributor Author

parthea commented Oct 2, 2015

Do you think it would be better to close this pull request, and request that we support this feature in the odo project instead (assuming that odo will support gbq) since the odo project is aimed at data migration?

The functionality in this pull request could be similar to the following pull request in odo which adds ability to append query results to a table : blaze/odo#37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants