Merge pull request #10857 from parthea/allow-creation-of-gbq-tables

jreback · jreback · commit ae28073613e0 · 2015-09-13T16:20:52.000-04:00
ENH: #8325 Add ability to create tables using the gbq module.
diff --git a/ci/requirements-2.7.pip b/ci/requirements-2.7.pip
@@ -1 +1,3 @@
 blosc
+httplib2
+google-api-python-client == 1.2
diff --git a/ci/requirements-2.7.txt b/ci/requirements-2.7.txt
@@ -20,6 +20,4 @@ patsy
 pymysql=0.6.3
 html5lib=1.0b2
 beautiful-soup=4.2.1
-httplib2=0.8
 python-gflags=2.0
-google-api-python-client=1.2
diff --git a/ci/requirements-2.7_SLOW.txt b/ci/requirements-2.7_SLOW.txt
@@ -20,6 +20,4 @@ psycopg2
 pymysql
 html5lib
 beautiful-soup
-httplib2
 python-gflags
-google-api-python-client
diff --git a/ci/requirements-3.4.pip b/ci/requirements-3.4.pip
@@ -1 +1,3 @@
 blosc
+httplib2
+google-api-python-client
diff --git a/doc/source/api.rst b/doc/source/api.rst
@@ -110,6 +110,10 @@ Google BigQuery
 
    read_gbq
    to_gbq
+   generate_bq_schema
+   create_table
+   delete_table
+   table_exists
 
 .. currentmodule:: pandas
 
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -3951,29 +3951,50 @@ The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
 analytics web service to simplify retrieving results from BigQuery tables
 using SQL-like queries. Result sets are parsed into a pandas
 DataFrame with a shape and data types derived from the source table.
-Additionally, DataFrames can be appended to existing BigQuery tables if
-the destination table is the same shape as the DataFrame.
+Additionally, DataFrames can be inserted into new BigQuery tables or appended
+to existing tables.
 
-For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__
+.. warning::
+
+   To use this module, you will need a valid BigQuery account. Refer to the
+   `BigQuery Documentation <https://developers.google.com/bigquery/>`__ for details on the service itself.
+
+The key functions are:
 
-As an example, suppose you want to load all data from an existing BigQuery
-table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq`
-function.
+.. currentmodule:: pandas.io.gbq
+
+.. autosummary::
+    :toctree: generated/
+
+    read_gbq
+    to_gbq
+    generate_bq_schema
+    create_table
+    delete_table
+    table_exists
+
+.. currentmodule:: pandas
+
+Querying
+''''''''
+
+Suppose you want to load all data from an existing BigQuery table : `test_dataset.test_table`
+into a DataFrame using the :func:`~pandas.io.gbq.read_gbq` function.
 
 .. code-block:: python
 
    # Insert your BigQuery Project ID Here
    # Can be found in the Google web console
    projectid = "xxxxxxxx"
 
-   data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
+   data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', projectid)
 
 You will then be authenticated to the specified BigQuery account
 via Google's Oauth2 mechanism. In general, this is as simple as following the
 prompts in a browser window which will be opened for you. Should the browser not
 be available, or fail to launch, a code will be provided to complete the process
 manually.  Additional information on the authentication mechanism can be found
-`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__
+`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__.
 
 You can define which column from BigQuery to use as an index in the
 destination DataFrame as well as a preferred column order as follows:
@@ -3982,56 +4003,167 @@ destination DataFrame as well as a preferred column order as follows:
 
    data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
                              index_col='index_column_name',
-                             col_order=['col1', 'col2', 'col3'], project_id = projectid)
-
-Finally, you can append data to a BigQuery table from a pandas DataFrame
-using the :func:`~pandas.io.to_gbq` function. This function uses the
-Google streaming API which requires that your destination table exists in
-BigQuery. Given the BigQuery table already exists, your DataFrame should
-match the destination table in column order, structure, and data types.
-DataFrame indexes are not supported. By default, rows are streamed to
-BigQuery in chunks of 10,000 rows, but you can pass other chuck values
-via the ``chunksize`` argument. You can also see the progess of your
-post via the ``verbose`` flag which defaults to ``True``. The http
-response code of Google BigQuery can be successful (200) even if the
-append failed. For this reason, if there is a failure to append to the
-table, the complete error response from BigQuery is returned which
-can be quite long given it provides a status for each row. You may want
-to start with smaller chunks to test that the size and types of your
-dataframe match your destination table to make debugging simpler.
+                             col_order=['col1', 'col2', 'col3'], projectid)
+
+.. note::
+
+   You can find your project id in the `BigQuery management console <https://code.google.com/apis/console/b/0/?noredirect>`__.
+
+
+.. note::
+
+   You can toggle the verbose output via the ``verbose`` flag which defaults to ``True``.
+
+Writing DataFrames
+''''''''''''''''''
+
+Assume we want to write a DataFrame ``df`` into a BigQuery table using :func:`~pandas.DataFrame.to_gbq`.
+
+.. ipython:: python
+
+   df = pd.DataFrame({'my_string': list('abc'),
+                      'my_int64': list(range(1, 4)),
+                      'my_float64': np.arange(4.0, 7.0),
+                      'my_bool1': [True, False, True],
+                      'my_bool2': [False, True, False],
+                      'my_dates': pd.date_range('now', periods=3)})
+
+   df
+   df.dtypes
 
 .. code-block:: python
 
-   df = pandas.DataFrame({'string_col_name' : ['hello'],
-         'integer_col_name' : [1],
-         'boolean_col_name' : [True]})
-   df.to_gbq('my_dataset.my_table', project_id = projectid)
+   df.to_gbq('my_dataset.my_table', projectid)
+
+.. note::
+
+   If the destination table does not exist, a new table will be created. The
+   destination dataset id must already exist in order for a new table to be created.
+
+The ``if_exists`` argument can be used to dictate whether to ``'fail'``, ``'replace'``
+or ``'append'`` if the destination table already exists. The default value is ``'fail'``.
+
+For example, assume that ``if_exists`` is set to ``'fail'``. The following snippet will raise
+a ``TableCreationError`` if the destination table already exists.
+
+.. code-block:: python
 
-The BigQuery SQL query language has some oddities, see `here <https://developers.google.com/bigquery/query-reference>`__
+   df.to_gbq('my_dataset.my_table', projectid, if_exists='fail')
 
-While BigQuery uses SQL-like syntax, it has some important differences
-from traditional databases both in functionality, API limitations (size and
-quantity of queries or uploads), and how Google charges for use of the service.
-You should refer to Google documentation often as the service seems to
-be changing and evolving. BiqQuery is best for analyzing large sets of
-data quickly, but it is not a direct replacement for a transactional database.
+.. note::
 
-You can access the management console to determine project id's by:
-<https://code.google.com/apis/console/b/0/?noredirect>
+   If the ``if_exists`` argument is set to ``'append'``, the destination dataframe will
+   be written to the table using the defined table schema and column types. The
+   dataframe must match the destination table in column order, structure, and
+   data types.
+   If the ``if_exists`` argument is set to ``'replace'``, and the existing table has a
+   different schema, a delay of 2 minutes will be forced to ensure that the new schema
+   has propagated in the Google environment. See
+   `Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
 
-As of 0.15.2, the gbq module has a function ``generate_bq_schema`` which
-will produce the dictionary representation of the schema.
+Writing large DataFrames can result in errors due to size limitations being exceeded.
+This can be avoided by setting the ``chunksize`` argument when calling :func:`~pandas.DataFrame.to_gbq`.
+For example, the following writes ``df`` to a BigQuery table in batches of 10000 rows at a time:
 
 .. code-block:: python
 
-   df = pandas.DataFrame({'A': [1.0]})
-   gbq.generate_bq_schema(df, default_type='STRING')
+   df.to_gbq('my_dataset.my_table', projectid, chunksize=10000)
 
-.. warning::
+You can also see the progress of your post via the ``verbose`` flag which defaults to ``True``.
+For example:
+
+.. code-block:: python
+
+   In [8]: df.to_gbq('my_dataset.my_table', projectid, chunksize=10000, verbose=True)
+
+           Streaming Insert is 10% Complete
+           Streaming Insert is 20% Complete
+           Streaming Insert is 30% Complete
+           Streaming Insert is 40% Complete
+           Streaming Insert is 50% Complete
+           Streaming Insert is 60% Complete
+           Streaming Insert is 70% Complete
+           Streaming Insert is 80% Complete
+           Streaming Insert is 90% Complete
+           Streaming Insert is 100% Complete
+
+.. note::
+
+   If an error occurs while streaming data to BigQuery, see
+   `Troubleshooting BigQuery Errors <https://cloud.google.com/bigquery/troubleshooting-errors>`__.
+
+.. note::
+
+   The BigQuery SQL query language has some oddities, see the
+   `BigQuery Query Reference Documentation <https://developers.google.com/bigquery/query-reference>`__.
+
+.. note::
+
+   While BigQuery uses SQL-like syntax, it has some important differences from traditional
+   databases both in functionality, API limitations (size and quantity of queries or uploads),
+   and how Google charges for use of the service. You should refer to `Google BigQuery documentation <https://developers.google.com/bigquery/>`__
+   often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
+   sets of data quickly, but it is not a direct replacement for a transactional database.
+
+
+Creating BigQuery Tables
+''''''''''''''''''''''''
+
+As of 0.17.0, the gbq module has a function :func:`~pandas.io.gbq.create_table` which allows users
+to create a table in BigQuery. The only requirement is that the dataset must already exist.
+The schema may be generated from a pandas DataFrame using the :func:`~pandas.io.gbq.generate_bq_schema` function below.
+
+For example:
+
+.. code-block:: python
+
+   gbq.create_table('my_dataset.my_table', schema, projectid)
+
+As of 0.15.2, the gbq module has a function :func:`~pandas.io.gbq.generate_bq_schema` which will
+produce the dictionary representation schema of the specified pandas DataFrame.
+
+.. code-block:: python
+
+   In [10]: gbq.generate_bq_schema(df, default_type='STRING')
+
+   Out[10]: {'fields': [{'name': 'my_bool1', 'type': 'BOOLEAN'},
+            {'name': 'my_bool2', 'type': 'BOOLEAN'},
+            {'name': 'my_dates', 'type': 'TIMESTAMP'},
+            {'name': 'my_float64', 'type': 'FLOAT'},
+            {'name': 'my_int64', 'type': 'INTEGER'},
+            {'name': 'my_string', 'type': 'STRING'}]}
+
+Deleting BigQuery Tables
+''''''''''''''''''''''''
+
+As of 0.17.0, the gbq module has a function :func:`~pandas.io.gbq.delete_table` which allows users to delete a table
+in Google BigQuery.
+
+For example:
+
+.. code-block:: python
+
+   gbq.delete_table('my_dataset.my_table', projectid)
+
+The following function can be used to check whether a table exists prior to calling ``table_exists``:
+
+:func:`~pandas.io.gbq.table_exists`.
+
+The return value will be of type boolean.
+
+For example:
+
+.. code-block:: python
+
+   In [12]: gbq.table_exists('my_dataset.my_table', projectid)
+   Out[12]: True
+
+.. note::
 
-   To use this module, you will need a valid BigQuery account. See
-   <https://cloud.google.com/products/big-query> for details on the
-   service.
+   If you delete and re-create a BigQuery table with the same name, but different table schema,
+   you must wait 2 minutes before streaming data into the table. As a workaround, consider creating
+   the new table with a different name. Refer to
+   `Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
 
 .. _io.stata:
 
diff --git a/doc/source/whatsnew/v0.17.0.txt b/doc/source/whatsnew/v0.17.0.txt
@@ -320,6 +320,15 @@ has been changed to make this keyword unnecessary - the change is shown below.
    Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,
    but the ``has_index_names`` argument must specified to ``True``.
 
+.. _whatsnew_0170.gbq:
+
+Google BigQuery Enhancements
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+- Added ability to automatically create a table using the :func:`pandas.io.gbq.to_gbq` function if destination table does not exist. (:issue:`8325`).
+- Added ability to replace an existing table and schema when calling the :func:`pandas.io.gbq.to_gbq` function via the ``if_exists`` argument. See the :ref:`docs <io.bigquery>` for more details (:issue:`8325`).
+- Added the following functions to the gbq module: :func:`pandas.io.gbq.table_exists`, :func:`pandas.io.gbq.create_table`, and :func:`pandas.io.gbq.delete_table`. See the :ref:`docs <io.bigquery>` for more details (:issue:`8325`).
+- ``InvalidColumnOrder`` and ``InvalidPageToken`` in the gbq module will raise ``ValueError`` instead of ``IOError``.
+
 .. _whatsnew_0170.enhancements.other:
 
 Other enhancements
@@ -1138,3 +1147,4 @@ Bug Fixes
 - Bug in ``DatetimeIndex`` cannot infer negative freq (:issue:`11018`)
 - Remove use of some deprecated numpy comparison operations, mainly in tests. (:issue:`10569`)
 - Bug in ``Index`` dtype may not applied properly (:issue:`11017`)
+- Bug in ``io.gbq`` when testing for minimum google api client version (:issue:`10652`)
diff --git a/pandas/core/frame.py b/pandas/core/frame.py
@@ -811,20 +811,12 @@ def to_dict(self, orient='dict'):
         else:
             raise ValueError("orient '%s' not understood" % orient)
 
-    def to_gbq(self, destination_table, project_id=None, chunksize=10000,
-               verbose=True, reauth=False):
+    def to_gbq(self, destination_table, project_id, chunksize=10000,
+               verbose=True, reauth=False, if_exists='fail'):
         """Write a DataFrame to a Google BigQuery table.
 
         THIS IS AN EXPERIMENTAL LIBRARY
 
-        If the table exists, the dataframe will be written to the table using
-        the defined table schema and column types. For simplicity, this method
-        uses the Google BigQuery streaming API. The to_gbq method chunks data
-        into a default chunk size of 10,000. Failures return the complete error
-        response which can be quite long depending on the size of the insert.
-        There are several important limitations of the Google streaming API
-        which are `here <https://developers.google.com/bigquery/streaming-data-into-bigquery>`__
-
         Parameters
         ----------
         dataframe : DataFrame
@@ -840,13 +832,18 @@ def to_gbq(self, destination_table, project_id=None, chunksize=10000,
         reauth : boolean (default False)
             Force Google BigQuery to reauthenticate the user. This is useful
             if multiple accounts are used.
+        if_exists : {'fail', 'replace', 'append'}, default 'fail'
+            'fail': If table exists, do nothing.
+            'replace': If table exists, drop it, recreate it, and insert data.
+            'append': If table exists, insert data. Create if does not exist.
 
+            .. versionadded:: 0.17.0
         """
 
         from pandas.io import gbq
         return gbq.to_gbq(self, destination_table, project_id=project_id,
                           chunksize=chunksize, verbose=verbose,
-                          reauth=reauth)
+                          reauth=reauth, if_exists=if_exists)
 
     @classmethod
     def from_records(cls, data, index=None, exclude=None, columns=None,
diff --git a/pandas/io/gbq.py b/pandas/io/gbq.py
diff --git a/pandas/io/tests/test_gbq.py b/pandas/io/tests/test_gbq.py

Original file line number	Diff line number	Diff line change
`@@ -1 +1,3 @@`
`1`	`1`	`blosc`
	`2`	`+httplib2`
	`3`	`+google-api-python-client == 1.2`