Skip to content

ENH: Google BigQuery IO Module #4140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/requirements-2.6.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ python-dateutil==1.5
pytz==2013b
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz
html5lib==1.0b2
bigquery==2.0.15
1 change: 1 addition & 0 deletions ci/requirements-2.7.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ MySQL-python==1.2.4
scipy==0.10.0
beautifulsoup4==4.2.1
statsmodels==0.5.0
bigquery==2.0.15
1 change: 1 addition & 0 deletions ci/requirements-2.7_LOCALE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ lxml==3.2.1
scipy==0.10.0
beautifulsoup4==4.2.1
statsmodels==0.5.0
bigquery==2.0.15
1 change: 1 addition & 0 deletions doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ Optional Dependencies
:func:`~pandas.io.clipboard.read_clipboard`. Most package managers on Linux
distributions will have xclip and/or xsel immediately available for
installation.
* `Google bq Command Line Tool <https://developers.google.com/bigquery/bq-command-line-tool/>`__: Needed for :mod:`pandas.io.gbq`
* One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.io.html.read_html` function:

Expand Down
65 changes: 65 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ object.
* ``read_json``
* ``read_msgpack`` (experimental)
* ``read_html``
* ``read_gbq`` (experimental)
* ``read_stata``
* ``read_clipboard``
* ``read_pickle``
Expand All @@ -51,6 +52,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
* ``to_json``
* ``to_msgpack`` (experimental)
* ``to_html``
* ``to_gbq`` (experimental)
* ``to_stata``
* ``to_clipboard``
* ``to_pickle``
Expand Down Expand Up @@ -2905,7 +2907,70 @@ There are a few other available functions:
For now, writing your DataFrame into a database works only with
**SQLite**. Moreover, the **index** will currently be **dropped**.

Google BigQuery (Experimental)
------------------------------

The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
analytics web service to simplify retrieving results from BigQuery tables
using SQL-like queries. Result sets are parsed into a pandas
DataFrame with a shape derived from the source table. Additionally,
DataFrames can be uploaded into BigQuery datasets as tables
if the source datatypes are compatible with BigQuery ones. The general
structure of this module and its provided functions are based loosely on those in
:mod:`pandas.io.sql`.

For specifics on the service itself, see: <https://developers.google.com/bigquery/>

As an example, suppose you want to load all data from an existing table
: `test_dataset.test_table`
into BigQuery and pull it into a DataFrame.

.. code-block:: python

from pandas.io import gbq
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table')

The user will then be authenticated by the `bq` command line client -
this usually involves the default browser opening to a login page,
though the process can be done entirely from command line if necessary.
Datasets and additional parameters can be either configured with `bq`,
passed in as options to `read_gbq`, or set using Google's gflags (this
is not officially supported by this module, though care was taken
to ensure that they should be followed regardless of how you call the
method).

Additionally, you can define which column to use as an index as well as a preferred column order as follows:

.. code-block:: python

data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table', index_col='index_column_name', col_order='[col1, col2, col3,...]')

Finally, if you would like to create a BigQuery table, `my_dataset.my_table`, from the rows of DataFrame, `df`:

.. code-block:: python

df = pandas.DataFrame({'string_col_name' : ['hello'],
'integer_col_name' : [1],
'boolean_col_name' : [True]})
schema = ['STRING', 'INTEGER', 'BOOLEAN']
data_frame = gbq.to_gbq(df, 'my_dataset.my_table', if_exists='fail', schema = schema)

To add more rows to this, simply:

.. code-block:: python

df2 = pandas.DataFrame({'string_col_name' : ['hello2'],
'integer_col_name' : [2],
'boolean_col_name' : [False]})
data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append')



.. note::

* There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
see: <https://developers.google.com/bigquery/query-reference>

STATA Format
------------

Expand Down
1 change: 1 addition & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ Experimental Features
- Add msgpack support via ``pd.read_msgpack()`` and ``pd.to_msgpack()`` / ``df.to_msgpack()`` for serialization
of arbitrary pandas (and python objects) in a lightweight portable binary format (:issue:`686`)
- Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.
- Added :mod:`pandas.io.gbq` for reading from (and writing to) Google BigQuery into a DataFrame. (:issue:`4140`)

Improvements to existing features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
Loading