Skip to content

Commit 0b927c0

Browse files
author
Jacob Schaer
committed
Updated Documentation for Google BigQuery Module, and Added to_gbq to frame.py
1 parent d686154 commit 0b927c0

File tree

4 files changed

+156
-27
lines changed

4 files changed

+156
-27
lines changed

doc/source/api.rst

+11
Original file line numberDiff line numberDiff line change
@@ -89,8 +89,19 @@ SQL
8989
read_frame
9090
write_frame
9191

92+
Google BigQuery
93+
~~~~~~~~~~~~~~~
94+
.. currentmodule:: pandas.io.gbq
95+
96+
.. autosummary::
97+
:toctree: generated/
98+
99+
read_gbq
100+
to_gbq
101+
92102
.. currentmodule:: pandas
93103

104+
94105
STATA
95106
~~~~~
96107

doc/source/io.rst

+46-26
Original file line numberDiff line numberDiff line change
@@ -2932,56 +2932,76 @@ if the source datatypes are compatible with BigQuery ones.
29322932
For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__
29332933

29342934
As an example, suppose you want to load all data from an existing table
2935-
``test_dataset.test_table`` into BigQuery and pull it into a ``DataFrame``.
2935+
: `test_dataset.test_table`
2936+
into BigQuery and pull it into a DataFrame.
29362937

2937-
::
2938+
.. code-block:: python
29382939
29392940
from pandas.io import gbq
2940-
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table')
2941+
2942+
# Insert your BigQuery Project ID Here
2943+
# Can be found in the web console, or
2944+
# using the command line tool `bq ls`
2945+
projectid = "xxxxxxxx"
2946+
2947+
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
29412948
2942-
The user will then be authenticated by the ``bq`` command line client -
2949+
The user will then be authenticated by the `bq` command line client -
29432950
this usually involves the default browser opening to a login page,
29442951
though the process can be done entirely from command line if necessary.
2945-
Datasets and additional parameters can be either configured with ``bq``,
2946-
passed in as options to :func:`~pandas.read_gbq`, or set using Google's
2947-
``gflags`` (this is not officially supported by this module, though care was
2948-
taken to ensure that they should be followed regardless of how you call the
2952+
Datasets and additional parameters can be either configured with `bq`,
2953+
passed in as options to `read_gbq`, or set using Google's gflags (this
2954+
is not officially supported by this module, though care was taken
2955+
to ensure that they should be followed regardless of how you call the
29492956
method).
29502957

29512958
Additionally, you can define which column to use as an index as well as a preferred column order as follows:
29522959

2953-
::
2960+
.. code-block:: python
29542961
29552962
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table',
29562963
index_col='index_column_name',
2957-
col_order='[col1, col2, col3,...]')
2964+
col_order='[col1, col2, col3,...]', project_id = projectid)
29582965
2959-
Finally, if you would like to create a BigQuery table, `my_dataset.my_table`,
2960-
from the rows of DataFrame, `df`:
2966+
Finally, if you would like to create a BigQuery table, `my_dataset.my_table`, from the rows of DataFrame, `df`:
29612967

2962-
::
2968+
.. code-block:: python
29632969
2964-
df = pandas.DataFrame({'string_col_name': ['hello'],
2965-
'integer_col_name': [1],
2966-
'boolean_col_name': [True]})
2970+
df = pandas.DataFrame({'string_col_name' : ['hello'],
2971+
'integer_col_name' : [1],
2972+
'boolean_col_name' : [True]})
29672973
schema = ['STRING', 'INTEGER', 'BOOLEAN']
2968-
data_frame = gbq.to_gbq(df, 'my_dataset.my_table', if_exists='fail',
2969-
schema=schema)
2974+
data_frame = gbq.to_gbq(df, 'my_dataset.my_table',
2975+
if_exists='fail', schema = schema, project_id = projectid)
29702976
29712977
To add more rows to this, simply:
29722978

2973-
::
2979+
.. code-block:: python
29742980
2975-
df2 = pandas.DataFrame({'string_col_name': ['hello2'],
2976-
'integer_col_name': [2],
2977-
'boolean_col_name': [False]})
2978-
data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append')
2981+
df2 = pandas.DataFrame({'string_col_name' : ['hello2'],
2982+
'integer_col_name' : [2],
2983+
'boolean_col_name' : [False]})
2984+
data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append', project_id = projectid)
29792985
29802986
.. note::
29812987

2982-
There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the
2983-
BigQuery SQL query language has some oddities, see `here
2984-
<https://developers.google.com/bigquery/query-reference>`__
2988+
A default project id can be set using the command line:
2989+
`bq init`.
2990+
2991+
There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
2992+
see `here <https://developers.google.com/bigquery/query-reference>`__
2993+
2994+
You can access the management console to determine project id's by:
2995+
<https://code.google.com/apis/console/b/0/?noredirect>
2996+
2997+
.. warning::
2998+
2999+
To use this module, you will need a BigQuery account. See
3000+
<https://cloud.google.com/products/big-query> for details.
3001+
3002+
As of 10/10/13, there is a bug in Google's API preventing result sets
3003+
from being larger than 100,000 rows. A patch is scheduled for the week of
3004+
10/14/13.
29853005

29863006
.. _io.stata:
29873007

doc/source/v0.13.0.txt

+64-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ enhancements along with a large number of bug fixes.
88

99
Highlights include support for a new index type ``Float64Index``, support for new methods of interpolation, updated ``timedelta`` operations, and a new string manipulation method ``extract``.
1010
Several experimental features are added, including new ``eval/query`` methods for expression evaluation, support for ``msgpack`` serialization,
11-
and an io interface to google's ``BigQuery``.
11+
and an io interface to Google's ``BigQuery``.
1212

1313
.. warning::
1414

@@ -648,6 +648,69 @@ Experimental
648648

649649
os.remove('foo.msg')
650650

651+
- ``pandas.io.gbq`` provides a simple way to extract from, and load data into,
652+
Google's BigQuery Data Sets by way of pandas DataFrames. BigQuery is a high
653+
performance SQL-like database service, useful for performing ad-hoc queries
654+
against extremely large datasets. :ref:`See the docs<io.gbq>`
655+
656+
.. code-block:: python
657+
658+
from pandas.io import gbq
659+
660+
# A query to select the average monthly temperatures in the
661+
# in the year 2000 across the USA. The dataset,
662+
# publicata:samples.gsod, is available on all BigQuery accounts,
663+
# and is based on NOAA gsod data.
664+
665+
query = """SELECT station_number as STATION,
666+
month as MONTH, AVG(mean_temp) as MEAN_TEMP
667+
FROM publicdata:samples.gsod
668+
WHERE YEAR = 2000
669+
GROUP BY STATION, MONTH
670+
ORDER BY STATION, MONTH ASC"""
671+
672+
# Fetch the result set for this query
673+
674+
# Your Google BigQuery Project ID
675+
# To find this, see your dashboard:
676+
# https://code.google.com/apis/console/b/0/?noredirect
677+
projectid = xxxxxxxxx;
678+
679+
df = gbq.read_gbq(query, project_id = projectid)
680+
681+
# Use pandas to process and reshape the dataset
682+
683+
df2 = df.pivot(index='STATION', columns='MONTH', values='MEAN_TEMP')
684+
df3 = pandas.concat([df2.min(), df2.mean(), df2.max()],
685+
axis=1,keys=["Min Tem", "Mean Temp", "Max Temp"])
686+
687+
The resulting dataframe is:
688+
689+
```
690+
Min Tem Mean Temp Max Temp
691+
MONTH
692+
1 -53.336667 39.827892 89.770968
693+
2 -49.837500 43.685219 93.437932
694+
3 -77.926087 48.708355 96.099998
695+
4 -82.892858 55.070087 97.317240
696+
5 -92.378261 61.428117 102.042856
697+
6 -77.703334 65.858888 102.900000
698+
7 -87.821428 68.169663 106.510714
699+
8 -89.431999 68.614215 105.500000
700+
9 -86.611112 63.436935 107.142856
701+
10 -78.209677 56.880838 92.103333
702+
11 -50.125000 48.861228 94.996428
703+
12 -50.332258 42.286879 94.396774
704+
```
705+
.. warning::
706+
707+
To use this module, you will need a BigQuery account. See
708+
<https://cloud.google.com/products/big-query> for details.
709+
710+
As of 10/10/13, there is a bug in Google's API preventing result sets
711+
from being larger than 100,000 rows. A patch is scheduled for the week of
712+
10/14/13.
713+
651714
.. _whatsnew_0130.refactoring:
652715

653716
Internal Refactoring

pandas/core/frame.py

+35
Original file line numberDiff line numberDiff line change
@@ -671,6 +671,41 @@ def to_dict(self, outtype='dict'):
671671
else: # pragma: no cover
672672
raise ValueError("outtype %s not understood" % outtype)
673673

674+
def to_gbq(self, destination_table, schema=None, col_order=None, if_exists='fail', **kwargs):
675+
"""
676+
Write a DataFrame to a Google BigQuery table. If the table exists,
677+
the DataFrame will be appended. If not, a new table will be created,
678+
in which case the schema will have to be specified. By default,
679+
rows will be written in the order they appear in the DataFrame, though
680+
the user may specify an alternative order.
681+
682+
Parameters
683+
---------------
684+
destination_table: string
685+
name of table to be written, in the form 'dataset.tablename'
686+
schema : sequence (optional)
687+
list of column types in order for data to be inserted, e.g. ['INTEGER', 'TIMESTAMP', 'BOOLEAN']
688+
col_order: sequence (optional)
689+
order which columns are to be inserted, e.g. ['primary_key', 'birthday', 'username']
690+
if_exists: {'fail', 'replace', 'append'} (optional)
691+
fail: If table exists, do nothing.
692+
replace: If table exists, drop it, recreate it, and insert data.
693+
append: If table exists, insert data. Create if does not exist.
694+
kwargs are passed to the Client constructor
695+
696+
Raises:
697+
------
698+
SchemaMissing:
699+
Raised if the 'if_exists' parameter is set to 'replace', but no schema is specified
700+
TableExists:
701+
Raised if the specified 'destination_table' exists but the 'if_exists' parameter is set to 'fail' (the default)
702+
InvalidSchema:
703+
Raised if the 'schema' parameter does not match the provided DataFrame
704+
"""
705+
706+
from pandas.io import gbq
707+
return gbq.to_gbq(self, destination_table, schema=None, col_order=None, if_exists='fail', **kwargs)
708+
674709
@classmethod
675710
def from_records(cls, data, index=None, exclude=None, columns=None,
676711
coerce_float=False, nrows=None):

0 commit comments

Comments
 (0)