Skip to content

ENH: #8325 Add ability to create tables using the gbq module. #10857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 13, 2015

Conversation

ghost
Copy link

@ghost ghost commented Aug 19, 2015

closes #8325
closes #10652

  • Added ability to automatically create a table using the gbq.to_gbq function if destination table does not exist
  • Added gbq.gbq_table_exists() function to the gbq module
  • Added gbq.create_gbq_table() function to the gbq module

@ghost ghost closed this Aug 20, 2015
@ghost ghost force-pushed the allow-creation-of-gbq-tables branch from 9c27c08 to 3e35c84 Compare August 20, 2015 03:48
@ghost ghost reopened this Aug 20, 2015
@ghost ghost force-pushed the allow-creation-of-gbq-tables branch from 3474708 to a0feef4 Compare August 20, 2015 04:10
@@ -436,3 +498,57 @@ def generate_bq_schema(df, default_type='STRING'):
'type': type_mapping.get(dtype.kind, default_type)})

return {'fields': fields}

def gbq_table_exists(table, project_id=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this table_exists or compat with other pandas names

@ghost ghost force-pushed the allow-creation-of-gbq-tables branch 5 times, most recently from f6977cb to 60fdf42 Compare August 22, 2015 14:30
@ghost
Copy link
Author

ghost commented Aug 22, 2015

  • Renamed gbq_table_exists to table_exists
  • Renamed create_gbq_table to create_table
  • Made project_id a required parameter in create_table
  • Reformatted descriptor text in to_gbq function into bullet points
  • Removed redundant checks in test_gbq.py. eg. if PROJECT_ID and not missing_bq()
  • Modify gbq_read to support the destination_table argument, which allows users to redirect the query results directly to a BigQuery table instead of a pandas DataFrame
  • Add if_exists argument togbq_readandto_gbq`` to support appending, or replacing a table in GoogleBig Query based on the argument provided.
    • If destination table has a different schema and if_exists is set to append, raise InvalidSchemaError
    • If destination table has a different schema and if_exists is set to replace, delete and re-create destination table, wait 120 seconds before inserting data (https://code.google.com/p/google-bigquery/issues/detail?id=191)
  • Cleaned up code that does not adhere to PEP8
  • Fixed issues with setUp and tearDown in test_gbq.py

Ready for review. Ran unit tests in TestReadGBQIntegration & TestToGBQIntegration. All tests passed.


data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
index_col='index_column_name',
col_order=['col1', 'col2', 'col3'], project_id = projectid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should certainly not be an argument to pd.read_gbq which by definition returns a DataFrame. pls remove this. Not again having this functionaily but it should clearly be defined and not added on to something else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will restore pd.read_gbq to its original functionality. I like the idea of creating a new function pd.to_gbq_large() , which is intended for larger queries. The reason for adding the word large is because the the BigQuery allowLargeResult parameter requires a destination table to be specified. Should I create a new branch/PR for this feature? Please confirm the following looks good.

function name: pd.to_gbq_large()

    ----------
    query : str
        SQL-Like Query to return data values
    project_id : str
        Google BigQuery Account project ID.
    reauth : boolean (default False)
        Force Google BigQuery to reauthenticate the user. This is useful
        if multiple accounts are used.
    destination_table :  string
        Name of table to be written, in the form 'dataset.tablename'.
        The results will be sent directly to provided destination table.
    if_exists : {'fail', 'replace', 'append'}, default 'fail'
            - fail: If table exists, do nothing.
            - replace: If table exists, drop it, recreate it, and insert data.
            - append: If table exists, insert data. Create if does not exist.

     -------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure create a new pr
but not sure why you actually need a new function; why wouldn't this just be part of to_gbq?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have prepared 2 proposals in order to allow the user to choose to upload a Pandas Dataframe to a BigQuery table or run a SQL-Like query and have the results goto a BigQuery table. I will create a new PR once the decision is made.

######Proposal 1

Modify pd.to_gbq() to support both dataframe and query as arguments. Both will be optional, but at least 1 is required. In the new revision the dataframe parameter will be changed to optional so that users can either specify to upload either a pandas DataFrame or run SQL-Like Query and have the results sent directly to a BigQuery table.

function name: pd.to_gbq()

    ----------
   dataframe : DataFrame (optional)
        DataFrame to be written. Note: one of dataframe or query is required.
    query : str (optional)
        SQL-Like Query to return data values. Note: one of dataframe or query is required.
    destination_table : string 
        Name of table to be written, in the form 'dataset.tablename'
    project_id : str
        Google BigQuery Account project ID.
    chunksize : int (default 10000)
        Number of rows to be inserted in each chunk from the dataframe.
    verbose : boolean (default True)
        Show percentage complete
    reauth : boolean (default False)
        Force Google BigQuery to reauthenticate the user. This is useful
        if multiple accounts are used.
    if_exists : {'fail', 'replace', 'append'}, default 'fail'
            - fail: If table exists, do nothing.
            - replace: If table exists, drop it, recreate it, and insert data.
            - append: If table exists, insert data. Create if does not exist.
    """

    # only one will survive
    if dataframe and query:
        raise AssertionError('Only one of dataframe and query can be provided')

   # at least one of dataframe or query is required
      if not (dataframe or query):
         raise AssertionError('At least one of dataframe or query must be provided')

######Modified Proposal 2

  • create a new pd function to support running a SQL-Like Query and having the results go directly to a BigQuery destination table. This can be used for running large queries which require the BigQuery allowLargeResults parameter.
    Proposed function name : pd.read_gbq_large

My understanding is that the allowLargeResults parameter is only for running SQL-Like queries which return large results, and not for uploading data (eg. Pandas DataFrame). With my understanding of allowLargeResults, my feeling is this feature is more similar to the pd.read_gbq than pd.gbq_to

Thank you for taking the time to help me iron out this one. I think it will be a useful feature.

@ghost ghost closed this Aug 23, 2015
@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

ok gr8!

having tests that work on Travis will be a big help

even better would be a test project id (iirc that is hard to get) - thiugh I can prob set some up

@parthea parthea force-pushed the allow-creation-of-gbq-tables branch from ec26f4b to 1484ef9 Compare September 13, 2015 00:52
@parthea
Copy link
Contributor

parthea commented Sep 13, 2015

Getting all the tests working with a test project id could be done using the following steps:

  • Create a google account
  • Register for BigQuery, and create a ProjectId
  • Set the project id in test_gbq.py. Run the unit test and go through the authentication process.
  • Copy the file that is generated : pandas/io/tests/bigquery_credentials.dat
  • Place the bigquery_credentials.dat into pandas/io/tests/ at the start of each build.

I have seen an advertisement from BigQuery with a 2 month(or $300) free trial.
https://cloud.google.com/free-trial/

@jacobschaer
Copy link
Contributor

We've tried several times to work something out with Google for this and haven't gotten anywhere. Ideally they would provide an account just for integration testing for this project - perhaps liming access to Travis IP, etc. @azbones - have we heard anything back from them?

@parthea - that's basically what we did.

  1. Created BigQuery account
  2. Ran gbq once to generate the credentials file
  3. Put the credentials file on a server with Jenkins
  4. Jenkins runs on commit, moving the credentials file to the correct place and doing a string substitution with the project ID (in the test files).

The catch was that Jenkins needed to run at least often enough to keep the token valid, but that turned out to not be too big of an issue.

@parthea
Copy link
Contributor

parthea commented Sep 13, 2015

@jacobschaer
That's great! So everything is in place except a free account for integration testing which may take some time. Is it possible to use a BigQuery trial in the meantime? That should buy us 2 months. At the same time, we can understand what the cost would be for a paid account (you can see how long the $300 lasts in the trial), and it will also give us a change to tell Google what we're using it for in hopes that we can get an account for integration testing. It may help open the communications with Google.

@parthea parthea force-pushed the allow-creation-of-gbq-tables branch 2 times, most recently from 6469e52 to 6bd4562 Compare September 13, 2015 14:00
@@ -1 +1,3 @@
blosc
httplib2
google-api-python-client == 1.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to have this work in EVERY version. In fact let's just care about 2.7 (where you specify exact versions, e.g. the 1.2 client) and 3.4 (where you don't, e.g. the latest)

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

@parthea1 let's open a new issue about proper testing of this on Travis. I am working on getting some credentials.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

@parthea1 update the whatsnew to show closing #10656 as well

@parthea parthea force-pushed the allow-creation-of-gbq-tables branch from 9e1cca6 to 9ffd40d Compare September 13, 2015 15:41
@parthea
Copy link
Contributor

parthea commented Sep 13, 2015

This should close #10652

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

ok that is the correct issur

@azbones
Copy link

azbones commented Sep 13, 2015

@jacobschaer I tried reaching out to Google's developer relations team and the BigQuery engineering manager several times, but they never came up with a solution for how to conduct testing with an account that would be public. It is unfortunate as I gave them a few different ideas about how to approach it (delete the data sets periodically, limit total data set size, etc.), but got no response. You would think they would want to encourage this kind of work...

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

@parthea1 ok, lgtm. ping on green.!

@parthea
Copy link
Contributor

parthea commented Sep 13, 2015

Thanks!

Travis is green. The only message in the skipped tests for gbq module is Cannot run integration tests without a project id

jreback added a commit that referenced this pull request Sep 13, 2015
ENH: #8325 Add ability to create tables using the gbq module.
@jreback jreback merged commit ae28073 into pandas-dev:master Sep 13, 2015
@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

@parthea1 big thanks! this was a really nice fixup!

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

for your next trick...

blaze/odo#285

@cpcloud manages this project and would be straightforward to add the routines needed to support BQ (a further and more complicated project would be to actually use a Blaze expression as the query, but as a first step io to/from are quite useful)

@parthea parthea deleted the allow-creation-of-gbq-tables branch September 13, 2015 20:45
@parthea
Copy link
Contributor

parthea commented Sep 14, 2015

I would definitely be up for working on the odo project as I enjoy a great challenge. I have a few work/personal commitments in the next few weeks. I will begin looking into the odo project in early October.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants