Skip to content

Commit ae28073

Browse files
committed
Merge pull request #10857 from parthea/allow-creation-of-gbq-tables
ENH: #8325 Add ability to create tables using the gbq module.
2 parents 840f88e + 2622cb3 commit ae28073

File tree

10 files changed

+745
-249
lines changed

10 files changed

+745
-249
lines changed

ci/requirements-2.7.pip

+2
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
11
blosc
2+
httplib2
3+
google-api-python-client == 1.2

ci/requirements-2.7.txt

-2
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,4 @@ patsy
2020
pymysql=0.6.3
2121
html5lib=1.0b2
2222
beautiful-soup=4.2.1
23-
httplib2=0.8
2423
python-gflags=2.0
25-
google-api-python-client=1.2

ci/requirements-2.7_SLOW.txt

-2
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,4 @@ psycopg2
2020
pymysql
2121
html5lib
2222
beautiful-soup
23-
httplib2
2423
python-gflags
25-
google-api-python-client

ci/requirements-3.4.pip

+2
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
11
blosc
2+
httplib2
3+
google-api-python-client

doc/source/api.rst

+4
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,10 @@ Google BigQuery
110110

111111
read_gbq
112112
to_gbq
113+
generate_bq_schema
114+
create_table
115+
delete_table
116+
table_exists
113117

114118
.. currentmodule:: pandas
115119

doc/source/io.rst

+178-46
Original file line numberDiff line numberDiff line change
@@ -3951,29 +3951,50 @@ The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
39513951
analytics web service to simplify retrieving results from BigQuery tables
39523952
using SQL-like queries. Result sets are parsed into a pandas
39533953
DataFrame with a shape and data types derived from the source table.
3954-
Additionally, DataFrames can be appended to existing BigQuery tables if
3955-
the destination table is the same shape as the DataFrame.
3954+
Additionally, DataFrames can be inserted into new BigQuery tables or appended
3955+
to existing tables.
39563956

3957-
For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__
3957+
.. warning::
3958+
3959+
To use this module, you will need a valid BigQuery account. Refer to the
3960+
`BigQuery Documentation <https://developers.google.com/bigquery/>`__ for details on the service itself.
3961+
3962+
The key functions are:
39583963

3959-
As an example, suppose you want to load all data from an existing BigQuery
3960-
table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq`
3961-
function.
3964+
.. currentmodule:: pandas.io.gbq
3965+
3966+
.. autosummary::
3967+
:toctree: generated/
3968+
3969+
read_gbq
3970+
to_gbq
3971+
generate_bq_schema
3972+
create_table
3973+
delete_table
3974+
table_exists
3975+
3976+
.. currentmodule:: pandas
3977+
3978+
Querying
3979+
''''''''
3980+
3981+
Suppose you want to load all data from an existing BigQuery table : `test_dataset.test_table`
3982+
into a DataFrame using the :func:`~pandas.io.gbq.read_gbq` function.
39623983

39633984
.. code-block:: python
39643985
39653986
# Insert your BigQuery Project ID Here
39663987
# Can be found in the Google web console
39673988
projectid = "xxxxxxxx"
39683989
3969-
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
3990+
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', projectid)
39703991
39713992
You will then be authenticated to the specified BigQuery account
39723993
via Google's Oauth2 mechanism. In general, this is as simple as following the
39733994
prompts in a browser window which will be opened for you. Should the browser not
39743995
be available, or fail to launch, a code will be provided to complete the process
39753996
manually. Additional information on the authentication mechanism can be found
3976-
`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__
3997+
`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__.
39773998

39783999
You can define which column from BigQuery to use as an index in the
39794000
destination DataFrame as well as a preferred column order as follows:
@@ -3982,56 +4003,167 @@ destination DataFrame as well as a preferred column order as follows:
39824003
39834004
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
39844005
index_col='index_column_name',
3985-
col_order=['col1', 'col2', 'col3'], project_id = projectid)
3986-
3987-
Finally, you can append data to a BigQuery table from a pandas DataFrame
3988-
using the :func:`~pandas.io.to_gbq` function. This function uses the
3989-
Google streaming API which requires that your destination table exists in
3990-
BigQuery. Given the BigQuery table already exists, your DataFrame should
3991-
match the destination table in column order, structure, and data types.
3992-
DataFrame indexes are not supported. By default, rows are streamed to
3993-
BigQuery in chunks of 10,000 rows, but you can pass other chuck values
3994-
via the ``chunksize`` argument. You can also see the progess of your
3995-
post via the ``verbose`` flag which defaults to ``True``. The http
3996-
response code of Google BigQuery can be successful (200) even if the
3997-
append failed. For this reason, if there is a failure to append to the
3998-
table, the complete error response from BigQuery is returned which
3999-
can be quite long given it provides a status for each row. You may want
4000-
to start with smaller chunks to test that the size and types of your
4001-
dataframe match your destination table to make debugging simpler.
4006+
col_order=['col1', 'col2', 'col3'], projectid)
4007+
4008+
.. note::
4009+
4010+
You can find your project id in the `BigQuery management console <https://code.google.com/apis/console/b/0/?noredirect>`__.
4011+
4012+
4013+
.. note::
4014+
4015+
You can toggle the verbose output via the ``verbose`` flag which defaults to ``True``.
4016+
4017+
Writing DataFrames
4018+
''''''''''''''''''
4019+
4020+
Assume we want to write a DataFrame ``df`` into a BigQuery table using :func:`~pandas.DataFrame.to_gbq`.
4021+
4022+
.. ipython:: python
4023+
4024+
df = pd.DataFrame({'my_string': list('abc'),
4025+
'my_int64': list(range(1, 4)),
4026+
'my_float64': np.arange(4.0, 7.0),
4027+
'my_bool1': [True, False, True],
4028+
'my_bool2': [False, True, False],
4029+
'my_dates': pd.date_range('now', periods=3)})
4030+
4031+
df
4032+
df.dtypes
40024033
40034034
.. code-block:: python
40044035
4005-
df = pandas.DataFrame({'string_col_name' : ['hello'],
4006-
'integer_col_name' : [1],
4007-
'boolean_col_name' : [True]})
4008-
df.to_gbq('my_dataset.my_table', project_id = projectid)
4036+
df.to_gbq('my_dataset.my_table', projectid)
4037+
4038+
.. note::
4039+
4040+
If the destination table does not exist, a new table will be created. The
4041+
destination dataset id must already exist in order for a new table to be created.
4042+
4043+
The ``if_exists`` argument can be used to dictate whether to ``'fail'``, ``'replace'``
4044+
or ``'append'`` if the destination table already exists. The default value is ``'fail'``.
4045+
4046+
For example, assume that ``if_exists`` is set to ``'fail'``. The following snippet will raise
4047+
a ``TableCreationError`` if the destination table already exists.
4048+
4049+
.. code-block:: python
40094050
4010-
The BigQuery SQL query language has some oddities, see `here <https://developers.google.com/bigquery/query-reference>`__
4051+
df.to_gbq('my_dataset.my_table', projectid, if_exists='fail')
40114052
4012-
While BigQuery uses SQL-like syntax, it has some important differences
4013-
from traditional databases both in functionality, API limitations (size and
4014-
quantity of queries or uploads), and how Google charges for use of the service.
4015-
You should refer to Google documentation often as the service seems to
4016-
be changing and evolving. BiqQuery is best for analyzing large sets of
4017-
data quickly, but it is not a direct replacement for a transactional database.
4053+
.. note::
40184054

4019-
You can access the management console to determine project id's by:
4020-
<https://code.google.com/apis/console/b/0/?noredirect>
4055+
If the ``if_exists`` argument is set to ``'append'``, the destination dataframe will
4056+
be written to the table using the defined table schema and column types. The
4057+
dataframe must match the destination table in column order, structure, and
4058+
data types.
4059+
If the ``if_exists`` argument is set to ``'replace'``, and the existing table has a
4060+
different schema, a delay of 2 minutes will be forced to ensure that the new schema
4061+
has propagated in the Google environment. See
4062+
`Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
40214063

4022-
As of 0.15.2, the gbq module has a function ``generate_bq_schema`` which
4023-
will produce the dictionary representation of the schema.
4064+
Writing large DataFrames can result in errors due to size limitations being exceeded.
4065+
This can be avoided by setting the ``chunksize`` argument when calling :func:`~pandas.DataFrame.to_gbq`.
4066+
For example, the following writes ``df`` to a BigQuery table in batches of 10000 rows at a time:
40244067

40254068
.. code-block:: python
40264069
4027-
df = pandas.DataFrame({'A': [1.0]})
4028-
gbq.generate_bq_schema(df, default_type='STRING')
4070+
df.to_gbq('my_dataset.my_table', projectid, chunksize=10000)
40294071
4030-
.. warning::
4072+
You can also see the progress of your post via the ``verbose`` flag which defaults to ``True``.
4073+
For example:
4074+
4075+
.. code-block:: python
4076+
4077+
In [8]: df.to_gbq('my_dataset.my_table', projectid, chunksize=10000, verbose=True)
4078+
4079+
Streaming Insert is 10% Complete
4080+
Streaming Insert is 20% Complete
4081+
Streaming Insert is 30% Complete
4082+
Streaming Insert is 40% Complete
4083+
Streaming Insert is 50% Complete
4084+
Streaming Insert is 60% Complete
4085+
Streaming Insert is 70% Complete
4086+
Streaming Insert is 80% Complete
4087+
Streaming Insert is 90% Complete
4088+
Streaming Insert is 100% Complete
4089+
4090+
.. note::
4091+
4092+
If an error occurs while streaming data to BigQuery, see
4093+
`Troubleshooting BigQuery Errors <https://cloud.google.com/bigquery/troubleshooting-errors>`__.
4094+
4095+
.. note::
4096+
4097+
The BigQuery SQL query language has some oddities, see the
4098+
`BigQuery Query Reference Documentation <https://developers.google.com/bigquery/query-reference>`__.
4099+
4100+
.. note::
4101+
4102+
While BigQuery uses SQL-like syntax, it has some important differences from traditional
4103+
databases both in functionality, API limitations (size and quantity of queries or uploads),
4104+
and how Google charges for use of the service. You should refer to `Google BigQuery documentation <https://developers.google.com/bigquery/>`__
4105+
often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
4106+
sets of data quickly, but it is not a direct replacement for a transactional database.
4107+
4108+
4109+
Creating BigQuery Tables
4110+
''''''''''''''''''''''''
4111+
4112+
As of 0.17.0, the gbq module has a function :func:`~pandas.io.gbq.create_table` which allows users
4113+
to create a table in BigQuery. The only requirement is that the dataset must already exist.
4114+
The schema may be generated from a pandas DataFrame using the :func:`~pandas.io.gbq.generate_bq_schema` function below.
4115+
4116+
For example:
4117+
4118+
.. code-block:: python
4119+
4120+
gbq.create_table('my_dataset.my_table', schema, projectid)
4121+
4122+
As of 0.15.2, the gbq module has a function :func:`~pandas.io.gbq.generate_bq_schema` which will
4123+
produce the dictionary representation schema of the specified pandas DataFrame.
4124+
4125+
.. code-block:: python
4126+
4127+
In [10]: gbq.generate_bq_schema(df, default_type='STRING')
4128+
4129+
Out[10]: {'fields': [{'name': 'my_bool1', 'type': 'BOOLEAN'},
4130+
{'name': 'my_bool2', 'type': 'BOOLEAN'},
4131+
{'name': 'my_dates', 'type': 'TIMESTAMP'},
4132+
{'name': 'my_float64', 'type': 'FLOAT'},
4133+
{'name': 'my_int64', 'type': 'INTEGER'},
4134+
{'name': 'my_string', 'type': 'STRING'}]}
4135+
4136+
Deleting BigQuery Tables
4137+
''''''''''''''''''''''''
4138+
4139+
As of 0.17.0, the gbq module has a function :func:`~pandas.io.gbq.delete_table` which allows users to delete a table
4140+
in Google BigQuery.
4141+
4142+
For example:
4143+
4144+
.. code-block:: python
4145+
4146+
gbq.delete_table('my_dataset.my_table', projectid)
4147+
4148+
The following function can be used to check whether a table exists prior to calling ``table_exists``:
4149+
4150+
:func:`~pandas.io.gbq.table_exists`.
4151+
4152+
The return value will be of type boolean.
4153+
4154+
For example:
4155+
4156+
.. code-block:: python
4157+
4158+
In [12]: gbq.table_exists('my_dataset.my_table', projectid)
4159+
Out[12]: True
4160+
4161+
.. note::
40314162

4032-
To use this module, you will need a valid BigQuery account. See
4033-
<https://cloud.google.com/products/big-query> for details on the
4034-
service.
4163+
If you delete and re-create a BigQuery table with the same name, but different table schema,
4164+
you must wait 2 minutes before streaming data into the table. As a workaround, consider creating
4165+
the new table with a different name. Refer to
4166+
`Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
40354167

40364168
.. _io.stata:
40374169

doc/source/whatsnew/v0.17.0.txt

+10
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,15 @@ has been changed to make this keyword unnecessary - the change is shown below.
320320
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,
321321
but the ``has_index_names`` argument must specified to ``True``.
322322

323+
.. _whatsnew_0170.gbq:
324+
325+
Google BigQuery Enhancements
326+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
327+
- Added ability to automatically create a table using the :func:`pandas.io.gbq.to_gbq` function if destination table does not exist. (:issue:`8325`).
328+
- Added ability to replace an existing table and schema when calling the :func:`pandas.io.gbq.to_gbq` function via the ``if_exists`` argument. See the :ref:`docs <io.bigquery>` for more details (:issue:`8325`).
329+
- Added the following functions to the gbq module: :func:`pandas.io.gbq.table_exists`, :func:`pandas.io.gbq.create_table`, and :func:`pandas.io.gbq.delete_table`. See the :ref:`docs <io.bigquery>` for more details (:issue:`8325`).
330+
- ``InvalidColumnOrder`` and ``InvalidPageToken`` in the gbq module will raise ``ValueError`` instead of ``IOError``.
331+
323332
.. _whatsnew_0170.enhancements.other:
324333

325334
Other enhancements
@@ -1138,3 +1147,4 @@ Bug Fixes
11381147
- Bug in ``DatetimeIndex`` cannot infer negative freq (:issue:`11018`)
11391148
- Remove use of some deprecated numpy comparison operations, mainly in tests. (:issue:`10569`)
11401149
- Bug in ``Index`` dtype may not applied properly (:issue:`11017`)
1150+
- Bug in ``io.gbq`` when testing for minimum google api client version (:issue:`10652`)

pandas/core/frame.py

+8-11
Original file line numberDiff line numberDiff line change
@@ -811,20 +811,12 @@ def to_dict(self, orient='dict'):
811811
else:
812812
raise ValueError("orient '%s' not understood" % orient)
813813

814-
def to_gbq(self, destination_table, project_id=None, chunksize=10000,
815-
verbose=True, reauth=False):
814+
def to_gbq(self, destination_table, project_id, chunksize=10000,
815+
verbose=True, reauth=False, if_exists='fail'):
816816
"""Write a DataFrame to a Google BigQuery table.
817817
818818
THIS IS AN EXPERIMENTAL LIBRARY
819819
820-
If the table exists, the dataframe will be written to the table using
821-
the defined table schema and column types. For simplicity, this method
822-
uses the Google BigQuery streaming API. The to_gbq method chunks data
823-
into a default chunk size of 10,000. Failures return the complete error
824-
response which can be quite long depending on the size of the insert.
825-
There are several important limitations of the Google streaming API
826-
which are `here <https://developers.google.com/bigquery/streaming-data-into-bigquery>`__
827-
828820
Parameters
829821
----------
830822
dataframe : DataFrame
@@ -840,13 +832,18 @@ def to_gbq(self, destination_table, project_id=None, chunksize=10000,
840832
reauth : boolean (default False)
841833
Force Google BigQuery to reauthenticate the user. This is useful
842834
if multiple accounts are used.
835+
if_exists : {'fail', 'replace', 'append'}, default 'fail'
836+
'fail': If table exists, do nothing.
837+
'replace': If table exists, drop it, recreate it, and insert data.
838+
'append': If table exists, insert data. Create if does not exist.
843839
840+
.. versionadded:: 0.17.0
844841
"""
845842

846843
from pandas.io import gbq
847844
return gbq.to_gbq(self, destination_table, project_id=project_id,
848845
chunksize=chunksize, verbose=verbose,
849-
reauth=reauth)
846+
reauth=reauth, if_exists=if_exists)
850847

851848
@classmethod
852849
def from_records(cls, data, index=None, exclude=None, columns=None,

0 commit comments

Comments
 (0)