ENH: #8325 Add ability to create tables using the gbq module. #10857

ghost · 2015-08-19T21:29:11Z

Added ability to automatically create a table using the gbq.to_gbq function if destination table does not exist
Added gbq.gbq_table_exists() function to the gbq module
Added gbq.create_gbq_table() function to the gbq module

jreback · 2015-08-20T12:40:59Z

pandas/io/gbq.py

@@ -436,3 +498,57 @@ def generate_bq_schema(df, default_type='STRING'):
                       'type': type_mapping.get(dtype.kind, default_type)})

    return {'fields': fields}
+
+def gbq_table_exists(table, project_id=None):


name this table_exists or compat with other pandas names

ghost · 2015-08-22T14:37:52Z

Renamed gbq_table_exists to table_exists
Renamed create_gbq_table to create_table
Made project_id a required parameter in create_table
Reformatted descriptor text in to_gbq function into bullet points
Removed redundant checks in test_gbq.py. eg. if PROJECT_ID and not missing_bq()
Modify gbq_read to support the destination_table argument, which allows users to redirect the query results directly to a BigQuery table instead of a pandas DataFrame
Add if_exists argument togbq_readandto_gbq`` to support appending, or replacing a table in GoogleBig Query based on the argument provided.
- If destination table has a different schema and if_exists is set to append, raise InvalidSchemaError
- If destination table has a different schema and if_exists is set to replace, delete and re-create destination table, wait 120 seconds before inserting data (https://code.google.com/p/google-bigquery/issues/detail?id=191)
Cleaned up code that does not adhere to PEP8
Fixed issues with setUp and tearDown in test_gbq.py

Ready for review. Ran unit tests in TestReadGBQIntegration & TestToGBQIntegration. All tests passed.

jreback · 2015-08-22T15:51:33Z

doc/source/io.rst

+
+   data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
+                             index_col='index_column_name',
+                             col_order=['col1', 'col2', 'col3'], project_id = projectid,


this should certainly not be an argument to pd.read_gbq which by definition returns a DataFrame. pls remove this. Not again having this functionaily but it should clearly be defined and not added on to something else.

I will restore pd.read_gbq to its original functionality. I like the idea of creating a new function pd.to_gbq_large() , which is intended for larger queries. The reason for adding the word large is because the the BigQuery allowLargeResult parameter requires a destination table to be specified. Should I create a new branch/PR for this feature? Please confirm the following looks good.

function name: pd.to_gbq_large()

---------- query : str SQL-Like Query to return data values project_id : str Google BigQuery Account project ID. reauth : boolean (default False) Force Google BigQuery to reauthenticate the user. This is useful if multiple accounts are used. destination_table : string Name of table to be written, in the form 'dataset.tablename'. The results will be sent directly to provided destination table. if_exists : {'fail', 'replace', 'append'}, default 'fail' - fail: If table exists, do nothing. - replace: If table exists, drop it, recreate it, and insert data. - append: If table exists, insert data. Create if does not exist. -------

for sure create a new pr
but not sure why you actually need a new function; why wouldn't this just be part of to_gbq?

I have prepared 2 proposals in order to allow the user to choose to upload a Pandas Dataframe to a BigQuery table or run a SQL-Like query and have the results goto a BigQuery table. I will create a new PR once the decision is made.

######Proposal 1

Modify pd.to_gbq() to support both dataframe and query as arguments. Both will be optional, but at least 1 is required. In the new revision the dataframe parameter will be changed to optional so that users can either specify to upload either a pandas DataFrame or run SQL-Like Query and have the results sent directly to a BigQuery table.

function name: pd.to_gbq()

---------- dataframe : DataFrame (optional) DataFrame to be written. Note: one of dataframe or query is required. query : str (optional) SQL-Like Query to return data values. Note: one of dataframe or query is required. destination_table : string Name of table to be written, in the form 'dataset.tablename' project_id : str Google BigQuery Account project ID. chunksize : int (default 10000) Number of rows to be inserted in each chunk from the dataframe. verbose : boolean (default True) Show percentage complete reauth : boolean (default False) Force Google BigQuery to reauthenticate the user. This is useful if multiple accounts are used. if_exists : {'fail', 'replace', 'append'}, default 'fail' - fail: If table exists, do nothing. - replace: If table exists, drop it, recreate it, and insert data. - append: If table exists, insert data. Create if does not exist. """ # only one will survive if dataframe and query: raise AssertionError('Only one of dataframe and query can be provided') # at least one of dataframe or query is required if not (dataframe or query): raise AssertionError('At least one of dataframe or query must be provided')

######Modified Proposal 2

create a new pd function to support running a SQL-Like Query and having the results go directly to a BigQuery destination table. This can be used for running large queries which require the BigQuery allowLargeResults parameter.
Proposed function name : pd.read_gbq_large

My understanding is that the allowLargeResults parameter is only for running SQL-Like queries which return large results, and not for uploading data (eg. Pandas DataFrame). With my understanding of allowLargeResults, my feeling is this feature is more similar to the pd.read_gbq than pd.gbq_to

Thank you for taking the time to help me iron out this one. I think it will be a useful feature.

jreback · 2015-09-13T00:38:52Z

ok gr8!

having tests that work on Travis will be a big help

even better would be a test project id (iirc that is hard to get) - thiugh I can prob set some up

parthea · 2015-09-13T01:10:46Z

Getting all the tests working with a test project id could be done using the following steps:

Create a google account
Register for BigQuery, and create a ProjectId
Set the project id in test_gbq.py. Run the unit test and go through the authentication process.
Copy the file that is generated : pandas/io/tests/bigquery_credentials.dat
Place the bigquery_credentials.dat into pandas/io/tests/ at the start of each build.

I have seen an advertisement from BigQuery with a 2 month(or $300) free trial.
https://cloud.google.com/free-trial/

jacobschaer · 2015-09-13T02:31:05Z

We've tried several times to work something out with Google for this and haven't gotten anywhere. Ideally they would provide an account just for integration testing for this project - perhaps liming access to Travis IP, etc. @azbones - have we heard anything back from them?

@parthea - that's basically what we did.

Created BigQuery account
Ran gbq once to generate the credentials file
Put the credentials file on a server with Jenkins
Jenkins runs on commit, moving the credentials file to the correct place and doing a string substitution with the project ID (in the test files).

The catch was that Jenkins needed to run at least often enough to keep the token valid, but that turned out to not be too big of an issue.

parthea · 2015-09-13T12:05:53Z

@jacobschaer
That's great! So everything is in place except a free account for integration testing which may take some time. Is it possible to use a BigQuery trial in the meantime? That should buy us 2 months. At the same time, we can understand what the cost would be for a paid account (you can see how long the $300 lasts in the trial), and it will also give us a change to tell Google what we're using it for in hopes that we can get an account for integration testing. It may help open the communications with Google.

jreback · 2015-09-13T15:10:01Z

ci/requirements-2.7_LOCALE.pip

@@ -1 +1,3 @@
 blosc
+httplib2
+google-api-python-client == 1.2


you don't need to have this work in EVERY version. In fact let's just care about 2.7 (where you specify exact versions, e.g. the 1.2 client) and 3.4 (where you don't, e.g. the latest)

jreback · 2015-09-13T15:12:43Z

@parthea1 let's open a new issue about proper testing of this on Travis. I am working on getting some credentials.

jreback · 2015-09-13T15:13:34Z

@parthea1 update the whatsnew to show closing #10656 as well

parthea · 2015-09-13T16:06:56Z

This should close #10652

jreback · 2015-09-13T16:11:09Z

ok that is the correct issur

…um google api client version pandas-dev#10652

azbones · 2015-09-13T19:05:16Z

@jacobschaer I tried reaching out to Google's developer relations team and the BigQuery engineering manager several times, but they never came up with a solution for how to conduct testing with an account that would be public. It is unfortunate as I gave them a few different ideas about how to approach it (delete the data sets periodically, limit total data set size, etc.), but got no response. You would think they would want to encourage this kind of work...

jreback · 2015-09-13T19:51:14Z

@parthea1 ok, lgtm. ping on green.!

parthea · 2015-09-13T20:20:06Z

Thanks!

Travis is green. The only message in the skipped tests for gbq module is Cannot run integration tests without a project id

ENH: #8325 Add ability to create tables using the gbq module.

jreback · 2015-09-13T20:21:02Z

@parthea1 big thanks! this was a really nice fixup!

jreback · 2015-09-13T20:23:27Z

for your next trick...

blaze/odo#285

@cpcloud manages this project and would be straightforward to add the routines needed to support BQ (a further and more complicated project would be to actually use a Blaze expression as the query, but as a first step io to/from are quite useful)

parthea · 2015-09-14T03:28:03Z

I would definitely be up for working on the odo project as I enjoy a great challenge. I have a few work/personal commitments in the next few weeks. I will begin looking into the odo project in early October.

ghost closed this Aug 20, 2015

ghost force-pushed the allow-creation-of-gbq-tables branch from 9c27c08 to 3e35c84 Compare August 20, 2015 03:48

ghost reopened this Aug 20, 2015

ghost force-pushed the allow-creation-of-gbq-tables branch from 3474708 to a0feef4 Compare August 20, 2015 04:10

jreback added the Google I/O label Aug 20, 2015

jreback reviewed Aug 20, 2015
View reviewed changes

ghost force-pushed the allow-creation-of-gbq-tables branch 5 times, most recently from f6977cb to 60fdf42 Compare August 22, 2015 14:30

ghost mentioned this pull request Aug 22, 2015

Feature Request: Add support for 'Allow Large Results' to BigQuery connector #10474

Closed

jreback reviewed Aug 22, 2015
View reviewed changes

ghost closed this Aug 23, 2015

parthea force-pushed the allow-creation-of-gbq-tables branch from ec26f4b to 1484ef9 Compare September 13, 2015 00:52

jacobschaer mentioned this pull request Sep 13, 2015

BUG: #10652 google-api-python-client minimum version check #10656

Closed

parthea force-pushed the allow-creation-of-gbq-tables branch from 1484ef9 to d997a65 Compare September 13, 2015 11:56

parthea force-pushed the allow-creation-of-gbq-tables branch 2 times, most recently from 6469e52 to 6bd4562 Compare September 13, 2015 14:00

jreback reviewed Sep 13, 2015
View reviewed changes

ENH: Add ability to create tables using the gbq module. pandas-dev#8325

6cb6347

parthea force-pushed the allow-creation-of-gbq-tables branch from 9e1cca6 to 9ffd40d Compare September 13, 2015 15:41

parthea added 2 commits September 13, 2015 12:26

BLD: Install google-api-python-client and httplib2 using pip

2dbd38e

BUG: Use StrictVersion instead of LooseVersion when testing for minim…

2622cb3

…um google api client version pandas-dev#10652

parthea force-pushed the allow-creation-of-gbq-tables branch from 9ffd40d to 2622cb3 Compare September 13, 2015 16:30

parthea mentioned this pull request Sep 13, 2015

BigQuery integration tests are skipped on travis #11089

Closed

jreback added a commit that referenced this pull request Sep 13, 2015

Merge pull request #10857 from parthea/allow-creation-of-gbq-tables

ae28073

ENH: #8325 Add ability to create tables using the gbq module.

jreback merged commit ae28073 into pandas-dev:master Sep 13, 2015

parthea deleted the allow-creation-of-gbq-tables branch September 13, 2015 20:45

parthea mentioned this pull request Sep 15, 2015

Add Python 3 support and optional parameter "silent" for read_gbq #10572

Closed

Uh oh!

ENH: #8325 Add ability to create tables using the gbq module. #10857

ENH: #8325 Add ability to create tables using the gbq module. #10857

Uh oh!

Conversation

ghost commented Aug 19, 2015

Uh oh!

jreback Aug 20, 2015

Choose a reason for hiding this comment

Uh oh!

ghost commented Aug 22, 2015

Uh oh!

jreback Aug 22, 2015

Choose a reason for hiding this comment

Uh oh!

ghost Aug 22, 2015

Choose a reason for hiding this comment

Uh oh!

jreback Aug 22, 2015

Choose a reason for hiding this comment

Uh oh!

ghost Aug 22, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 13, 2015

Uh oh!

parthea commented Sep 13, 2015

Uh oh!

jacobschaer commented Sep 13, 2015

Uh oh!

parthea commented Sep 13, 2015

Uh oh!

jreback Sep 13, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 13, 2015

Uh oh!

jreback commented Sep 13, 2015

Uh oh!

parthea commented Sep 13, 2015

Uh oh!

jreback commented Sep 13, 2015

Uh oh!

azbones commented Sep 13, 2015

Uh oh!

jreback commented Sep 13, 2015

Uh oh!

parthea commented Sep 13, 2015

Uh oh!

jreback commented Sep 13, 2015

Uh oh!

jreback commented Sep 13, 2015

Uh oh!

parthea commented Sep 14, 2015

Uh oh!

Uh oh!