Skip to content

Gbq service account #11881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

tworec
Copy link
Contributor

@tworec tworec commented Dec 21, 2015

This adds service account authentication while still supports standard web auth method.
It also adds some useful std out messages: progress with elapsed time and percentage + price calculation.

In RTBHouse we're using this service account auth since may. It works perfectly with remote jupyter server (iPython notebooks).

fixes #8489

from oauth2client.client import OAuth2WebServerFlow
from oauth2client.file import Storage
from oauth2client.tools import run_flow, argparser

_check_google_client_version()

flow = OAuth2WebServerFlow(client_id='495642085510-k0tmvj2m941jhre2nbqka17vqpjfddtd.apps.googleusercontent.com',
client_secret='kOc9wMptUtxkcIFbtZCcrEAc',
scope='https://www.googleapis.com/auth/bigquery',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: use self.scope here

@jreback
Copy link
Contributor

jreback commented Dec 27, 2015

the private_key should be allowed to be a sequence of bytes as well. (and rename)

@tworec
Copy link
Contributor Author

tworec commented Dec 28, 2015

OK, I've added private_key contents support. I understand it's better design decision.

@tworec tworec force-pushed the gbq_service_account branch from 6b67450 to 8fa76e7 Compare December 28, 2015 15:54
@@ -37,6 +39,10 @@ def _check_google_client_version():
logger = logging.getLogger('pandas.io.gbq')
logger.setLevel(logging.ERROR)

def _print(msg, end='\n'):
sys.stdout.write(msg + end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should pass verbose to this (and do the if inside here). That way you can simply:

_print(msg, verbose=verbose) in the code itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, good point

@jreback
Copy link
Contributor

jreback commented Dec 29, 2015

pls add a whatsnew entry (enhancements).
can you post a sample session with verbose=True.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2015

cc @parthea
cc @andrewryno
cc @andrioni
cc @jacobschaer

pls have a look

@jreback jreback added this to the 0.18.0 milestone Dec 29, 2015
@tworec tworec force-pushed the gbq_service_account branch from 8914df3 to cd45e22 Compare January 4, 2016 15:29
@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

pls rebase / squash

@tworec tworec force-pushed the gbq_service_account branch from cd45e22 to 2d12d6d Compare January 7, 2016 08:55
@tworec
Copy link
Contributor Author

tworec commented Jan 7, 2016

ok, rabased and squashed

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

cc @parthea
cc @andrewryno
cc @andrioni
cc @jacobschaer

any comments

@tworec tworec force-pushed the gbq_service_account branch from 2d12d6d to 4ba4401 Compare January 13, 2016 11:17
@parthea
Copy link
Contributor

parthea commented Jan 13, 2016

I'm receiving the following exception related to the time.monotonic() function when running the gbq integration tests.

AttributeError: 'module' object has no attribute 'monotonic'

I believe Travis will also have this issue. Travis is currently skipping integration tests that require a BigQuery project id, so the issue is not reported by Travis.

Should we add a dependency for monotonic?
https://pypi.python.org/pypi/monotonic/0.5

======================================================================
ERROR: test_should_properly_handle_valid_integers (pandas.io.tests.test_gbq.TestReadGBQIntegration)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tony/pandas-gbq_service_account/pandas/io/tests/test_gbq.py", line 322, in test_should_properly_handle_valid_integers
    df = gbq.read_gbq(query, project_id=PROJECT_ID)
  File "/home/tony/pandas-gbq_service_account/pandas/io/gbq.py", line 503, in read_gbq
    schema, pages = connector.run_query(query)
  File "/home/tony/pandas-gbq_service_account/pandas/io/gbq.py", line 269, in run_query
    self._start_timer()
  File "/home/tony/pandas-gbq_service_account/pandas/io/gbq.py", line 185, in _start_timer
    self.start = time.monotonic()
AttributeError: 'module' object has no attribute 'monotonic'

@jreback
Copy link
Contributor

jreback commented Jan 13, 2016

we don't need any more deps

use time.time()

@tworec tworec force-pushed the gbq_service_account branch from 4ba4401 to 0d036cc Compare January 13, 2016 17:08
@tworec
Copy link
Contributor Author

tworec commented Jan 15, 2016

Ad FileNotFoundError) I'm running tests on python 3 only, where this exception is built-in, hence I've missed this out. After adding support for json contents handling this exception is not needed any more. See line 168 in gbq.py, where I'm checking file exists and is regular file. So file not found exception can not happend. I'm removing it from except clause.

Doing this I've also rethinked, rewritten and tested invalid auth scenarios.

Ad json.load) it works for me. I cant reproduce it under python3 :(
I don't understand json magic in pandas.json module but I think it is imported here.
What I can do is to invoke json.loads which was used here before my changes... but its stupid, because python's built-in json module can read files...

@tworec
Copy link
Contributor Author

tworec commented Jan 15, 2016

nosetests test_gbq.py -v 
test_should_be_able_to_get_a_bigquery_service (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_get_results_from_query (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_get_schema_from_query (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_get_valid_credentials (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_make_a_connector (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok
test_should_be_able_to_get_a_bigquery_service (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok
test_should_be_able_to_get_results_from_query (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok
test_should_be_able_to_get_schema_from_query (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok
test_should_be_able_to_get_valid_credentials (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok
test_should_be_able_to_make_a_connector (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok
test_should_be_able_to_get_a_bigquery_service (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok
test_should_be_able_to_get_results_from_query (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok
test_should_be_able_to_get_schema_from_query (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok
test_should_be_able_to_get_valid_credentials (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok
test_should_be_able_to_make_a_connector (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok
test_bad_project_id (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_bad_table_name (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_column_order (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_column_order_plus_index (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_download_dataset_larger_than_200k_rows (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_index_column (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_malformed_query (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_arbitrary_timestamp (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_empty_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_false_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_floats (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_integers (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_null_timestamp (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_timestamp_unix_epoch (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_true_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_valid_floats (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_valid_integers (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_properly_handle_valid_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_read_as_service_account_with_key_contents (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_should_read_as_service_account_with_key_path (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_unicode_string_conversion_and_normalization (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_zero_rows (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok
test_read_gbq_when_private_key_json_values_has_wrong_types_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_read_gbq_with_corrupted_private_key_json_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_read_gbq_with_empty_private_key_file_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_read_gbq_with_empty_private_key_json_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_read_gbq_with_invalid_private_key_json_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_read_gbq_with_no_project_id_given_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_booleans_as_python_booleans (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_floats_as_python_floats (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_integers_as_python_floats (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_strings_as_python_strings (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_should_return_bigquery_timestamps_as_numpy_datetime (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_that_parse_data_works_properly (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_to_gbq_should_fail_if_invalid_table_name_passed (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_to_gbq_with_no_project_id_given_should_fail (pandas.io.tests.test_gbq.TestReadGBQUnitTests) ... ok
test_create_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_create_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_dataset_does_not_exist (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_dataset_exists (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_delete_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_delete_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_generate_schema (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_google_upload_errors_should_raise_exception (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_list_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_list_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_list_table_zero_results (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_table_does_not_exist (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data_if_table_exists_append (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data_if_table_exists_fail (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data_if_table_exists_replace (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok
test_upload_data_as_service_account_with_key_contents (pandas.io.tests.test_gbq.TestToGBQIntegrationServiceAccountKeyContents) ... ok
test_upload_data_as_service_account_with_key_path (pandas.io.tests.test_gbq.TestToGBQIntegrationServiceAccountKeyPath) ... ok
pandas.io.tests.test_gbq.test_requirements ... ok
pandas.io.tests.test_gbq.test_generate_bq_schema_deprecated ... ok

----------------------------------------------------------------------
Ran 73 tests in 330.985s

OK

@tworec
Copy link
Contributor Author

tworec commented Jan 15, 2016

Sample session gbq.read_gbq verbose output:

Requesting query... ok.
Query running...
  Elapsed 11.55 s. Waiting...
Query done.
Processed: 37.5 Mb

Retrieving results...
  Got page: 1; 10% done. Elapsed 17.25 s.
  Got page: 2; 19% done. Elapsed 20.32 s.
  Got page: 3; 29% done. Elapsed 24.03 s.
  Got page: 4; 39% done. Elapsed 28.5 s.
  Got page: 5; 48% done. Elapsed 33.15 s.
  Got page: 6; 58% done. Elapsed 37.19 s.
  Got page: 7; 67% done. Elapsed 40.6 s.
  Got page: 8; 77% done. Elapsed 43.95 s.
  Got page: 9; 87% done. Elapsed 48.27 s.
  Got page: 10; 96% done. Elapsed 185.45 s.
  Got page: 11; 100% done. Elapsed 187.93 s.
Got 1038579 rows.

Total time taken 191.67 s.
Finished at 2016-01-15 16:16:07.


def __init__(self, project_id, reauth=False):
def __init__(self, project_id, reauth=False, verbose=True, private_key=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's default to verbose=False. you could have an option do this I suppose, e.g. pd.options.gbq.verbose

@jreback
Copy link
Contributor

jreback commented Jan 15, 2016

pls git diff master | flake8 --diff and fix the issues

@jreback
Copy link
Contributor

jreback commented Jan 21, 2016

needs a rebase; most of the codebase underwent a PEP cleanup very recently.


You will be authenticated to the specified BigQuery account via Google's Oauth2 mechanism.

Primary auth method is as simple as following the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary authentication

@parthea
Copy link
Contributor

parthea commented Jan 21, 2016

  • We need to update the to_gbq() definition in frame.py to also support the 'private_key' parameter.
data_frame.to_gbq(DESTINATION_TABLE, project_id = PROJECT_ID, if_exists='append',  private_key = PRIVATE_KEY)

TypeError: to_gbq() got an unexpected keyword argument 'private_key'

)
except (KeyError, ValueError, TypeError, AttributeError):
raise InvalidPrivateKeyFormat("Service account private key should be valid JSON (file path or string contents) "
"with at least two keys: 'client_email' and 'private_key'. Can be obtained from google developers console. ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a small change in the error message to indicate that file may be missing?

raise InvalidPrivateKeyFormat("Private key is missing or invalid. Service account private key should be valid JSON (file path or string contents) with at least two keys: 'client_email' and 'private_key'. Can be obtained from google developers console. ")

@tworec tworec force-pushed the gbq_service_account branch from fefd245 to 2e5b404 Compare January 22, 2016 11:49
@tworec tworec force-pushed the gbq_service_account branch from 1917703 to e035747 Compare January 27, 2016 09:32

.. note::

The `'private_key'` parameter can be set to either the file path of the service account key in JSON format, or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need indenting to the note (same below)

@jreback
Copy link
Contributor

jreback commented Jan 30, 2016

@tworec just a couple of more stylistic changes to conform to the rest of the code base.

almost there!

@tworec
Copy link
Contributor Author

tworec commented Feb 1, 2016

gr8! all style fixes done. i'm running local tests and pushing it. please review.

@tworec
Copy link
Contributor Author

tworec commented Feb 1, 2016

All 73 tests have passed both on py3.5 and py2.7

@jreback jreback closed this in 6a32f10 Feb 1, 2016
@jreback
Copy link
Contributor

jreback commented Feb 1, 2016

@tworec
@parthea

thanks!

pls review the built docs and if necessary issue a followup for fixes

http://pandas-docs.github.io/pandas-docs-travis/ (prob will take a few hours, lots of things in the queue :)

@tworec
Copy link
Contributor Author

tworec commented Feb 2, 2016

It was my first PR. Thanks for feedback.

@tworec tworec deleted the gbq_service_account branch February 16, 2016 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery authentication on remote servers
3 participants