-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: #8325 Add ability to create tables using the gbq module. #10857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: #8325 Add ability to create tables using the gbq module. #10857
Conversation
9c27c08
to
3e35c84
Compare
3474708
to
a0feef4
Compare
@@ -436,3 +498,57 @@ def generate_bq_schema(df, default_type='STRING'): | |||
'type': type_mapping.get(dtype.kind, default_type)}) | |||
|
|||
return {'fields': fields} | |||
|
|||
def gbq_table_exists(table, project_id=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name this table_exists
or compat with other pandas names
f6977cb
to
60fdf42
Compare
Ready for review. Ran unit tests in TestReadGBQIntegration & TestToGBQIntegration. All tests passed. |
|
||
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', | ||
index_col='index_column_name', | ||
col_order=['col1', 'col2', 'col3'], project_id = projectid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should certainly not be an argument to pd.read_gbq
which by definition returns a DataFrame
. pls remove this. Not again having this functionaily but it should clearly be defined and not added on to something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will restore pd.read_gbq to its original functionality. I like the idea of creating a new function pd.to_gbq_large()
, which is intended for larger queries. The reason for adding the word large is because the the BigQuery allowLargeResult parameter requires a destination table to be specified. Should I create a new branch/PR for this feature? Please confirm the following looks good.
function name: pd.to_gbq_large()
----------
query : str
SQL-Like Query to return data values
project_id : str
Google BigQuery Account project ID.
reauth : boolean (default False)
Force Google BigQuery to reauthenticate the user. This is useful
if multiple accounts are used.
destination_table : string
Name of table to be written, in the form 'dataset.tablename'.
The results will be sent directly to provided destination table.
if_exists : {'fail', 'replace', 'append'}, default 'fail'
- fail: If table exists, do nothing.
- replace: If table exists, drop it, recreate it, and insert data.
- append: If table exists, insert data. Create if does not exist.
-------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure create a new pr
but not sure why you actually need a new function; why wouldn't this just be part of to_gbq?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have prepared 2 proposals in order to allow the user to choose to upload a Pandas Dataframe to a BigQuery table or run a SQL-Like query and have the results goto a BigQuery table. I will create a new PR once the decision is made.
######Proposal 1
Modify pd.to_gbq()
to support both dataframe
and query
as arguments. Both will be optional, but at least 1 is required. In the new revision the dataframe
parameter will be changed to optional so that users can either specify to upload either a pandas DataFrame or run SQL-Like Query and have the results sent directly to a BigQuery table.
function name: pd.to_gbq()
----------
dataframe : DataFrame (optional)
DataFrame to be written. Note: one of dataframe or query is required.
query : str (optional)
SQL-Like Query to return data values. Note: one of dataframe or query is required.
destination_table : string
Name of table to be written, in the form 'dataset.tablename'
project_id : str
Google BigQuery Account project ID.
chunksize : int (default 10000)
Number of rows to be inserted in each chunk from the dataframe.
verbose : boolean (default True)
Show percentage complete
reauth : boolean (default False)
Force Google BigQuery to reauthenticate the user. This is useful
if multiple accounts are used.
if_exists : {'fail', 'replace', 'append'}, default 'fail'
- fail: If table exists, do nothing.
- replace: If table exists, drop it, recreate it, and insert data.
- append: If table exists, insert data. Create if does not exist.
"""
# only one will survive
if dataframe and query:
raise AssertionError('Only one of dataframe and query can be provided')
# at least one of dataframe or query is required
if not (dataframe or query):
raise AssertionError('At least one of dataframe or query must be provided')
######Modified Proposal 2
- create a new
pd
function to support running a SQL-Like Query and having the results go directly to a BigQuery destination table. This can be used for running large queries which require the BigQuery allowLargeResults parameter.
Proposed function name :pd.read_gbq_large
My understanding is that the allowLargeResults parameter is only for running SQL-Like queries which return large results, and not for uploading data (eg. Pandas DataFrame). With my understanding of allowLargeResults, my feeling is this feature is more similar to the pd.read_gbq
than pd.gbq_to
Thank you for taking the time to help me iron out this one. I think it will be a useful feature.
ok gr8! having tests that work on Travis will be a big help even better would be a test project id (iirc that is hard to get) - thiugh I can prob set some up |
ec26f4b
to
1484ef9
Compare
Getting all the tests working with a test project id could be done using the following steps:
I have seen an advertisement from BigQuery with a 2 month(or $300) free trial. |
We've tried several times to work something out with Google for this and haven't gotten anywhere. Ideally they would provide an account just for integration testing for this project - perhaps liming access to Travis IP, etc. @azbones - have we heard anything back from them? @parthea - that's basically what we did.
The catch was that Jenkins needed to run at least often enough to keep the token valid, but that turned out to not be too big of an issue. |
1484ef9
to
d997a65
Compare
@jacobschaer |
6469e52
to
6bd4562
Compare
@@ -1 +1,3 @@ | |||
blosc | |||
httplib2 | |||
google-api-python-client == 1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to have this work in EVERY version. In fact let's just care about 2.7 (where you specify exact versions, e.g. the 1.2 client) and 3.4 (where you don't, e.g. the latest)
@parthea1 let's open a new issue about proper testing of this on Travis. I am working on getting some credentials. |
@parthea1 update the whatsnew to show closing #10656 as well |
9e1cca6
to
9ffd40d
Compare
This should close #10652 |
ok that is the correct issur |
9ffd40d
to
2622cb3
Compare
@jacobschaer I tried reaching out to Google's developer relations team and the BigQuery engineering manager several times, but they never came up with a solution for how to conduct testing with an account that would be public. It is unfortunate as I gave them a few different ideas about how to approach it (delete the data sets periodically, limit total data set size, etc.), but got no response. You would think they would want to encourage this kind of work... |
@parthea1 ok, lgtm. ping on green.! |
Thanks! Travis is green. The only message in the skipped tests for gbq module is |
ENH: #8325 Add ability to create tables using the gbq module.
@parthea1 big thanks! this was a really nice fixup! |
for your next trick... @cpcloud manages this project and would be straightforward to add the routines needed to support BQ (a further and more complicated project would be to actually use a Blaze expression as the query, but as a first step io to/from are quite useful) |
I would definitely be up for working on the odo project as I enjoy a great challenge. I have a few work/personal commitments in the next few weeks. I will begin looking into the odo project in early October. |
closes #8325
closes #10652
gbq.gbq_table_exists()
function to the gbq modulegbq.create_gbq_table()
function to the gbq module