@@ -3951,29 +3951,50 @@ The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
3951
3951
analytics web service to simplify retrieving results from BigQuery tables
3952
3952
using SQL-like queries. Result sets are parsed into a pandas
3953
3953
DataFrame with a shape and data types derived from the source table.
3954
- Additionally, DataFrames can be appended to existing BigQuery tables if
3955
- the destination table is the same shape as the DataFrame .
3954
+ Additionally, DataFrames can be inserted into new BigQuery tables or appended
3955
+ to existing tables .
3956
3956
3957
- For specifics on the service itself, see `here <https://developers.google.com/bigquery/ >`__
3957
+ .. warning ::
3958
+
3959
+ To use this module, you will need a valid BigQuery account. Refer to the
3960
+ `BigQuery Documentation <https://developers.google.com/bigquery/ >`__ for details on the service itself.
3961
+
3962
+ The key functions are:
3958
3963
3959
- As an example, suppose you want to load all data from an existing BigQuery
3960
- table : `test_dataset.test_table ` into a DataFrame using the :func: `~pandas.io.read_gbq `
3961
- function.
3964
+ .. currentmodule :: pandas.io.gbq
3965
+
3966
+ .. autosummary ::
3967
+ :toctree: generated/
3968
+
3969
+ read_gbq
3970
+ to_gbq
3971
+ generate_bq_schema
3972
+ create_table
3973
+ delete_table
3974
+ table_exists
3975
+
3976
+ .. currentmodule :: pandas
3977
+
3978
+ Querying
3979
+ ''''''''
3980
+
3981
+ Suppose you want to load all data from an existing BigQuery table : `test_dataset.test_table `
3982
+ into a DataFrame using the :func: `~pandas.io.gbq.read_gbq ` function.
3962
3983
3963
3984
.. code-block :: python
3964
3985
3965
3986
# Insert your BigQuery Project ID Here
3966
3987
# Can be found in the Google web console
3967
3988
projectid = " xxxxxxxx"
3968
3989
3969
- data_frame = pd.read_gbq(' SELECT * FROM test_dataset.test_table' , project_id = projectid)
3990
+ data_frame = pd.read_gbq(' SELECT * FROM test_dataset.test_table' , projectid)
3970
3991
3971
3992
You will then be authenticated to the specified BigQuery account
3972
3993
via Google's Oauth2 mechanism. In general, this is as simple as following the
3973
3994
prompts in a browser window which will be opened for you. Should the browser not
3974
3995
be available, or fail to launch, a code will be provided to complete the process
3975
3996
manually. Additional information on the authentication mechanism can be found
3976
- `here <https://developers.google.com/accounts/docs/OAuth2#clientside/ >`__
3997
+ `here <https://developers.google.com/accounts/docs/OAuth2#clientside/ >`__.
3977
3998
3978
3999
You can define which column from BigQuery to use as an index in the
3979
4000
destination DataFrame as well as a preferred column order as follows:
@@ -3982,56 +4003,167 @@ destination DataFrame as well as a preferred column order as follows:
3982
4003
3983
4004
data_frame = pd.read_gbq(' SELECT * FROM test_dataset.test_table' ,
3984
4005
index_col = ' index_column_name' ,
3985
- col_order = [' col1' , ' col2' , ' col3' ], project_id = projectid)
3986
-
3987
- Finally, you can append data to a BigQuery table from a pandas DataFrame
3988
- using the :func: `~pandas.io.to_gbq ` function. This function uses the
3989
- Google streaming API which requires that your destination table exists in
3990
- BigQuery. Given the BigQuery table already exists, your DataFrame should
3991
- match the destination table in column order, structure, and data types.
3992
- DataFrame indexes are not supported. By default, rows are streamed to
3993
- BigQuery in chunks of 10,000 rows, but you can pass other chuck values
3994
- via the ``chunksize `` argument. You can also see the progess of your
3995
- post via the ``verbose `` flag which defaults to ``True ``. The http
3996
- response code of Google BigQuery can be successful (200) even if the
3997
- append failed. For this reason, if there is a failure to append to the
3998
- table, the complete error response from BigQuery is returned which
3999
- can be quite long given it provides a status for each row. You may want
4000
- to start with smaller chunks to test that the size and types of your
4001
- dataframe match your destination table to make debugging simpler.
4006
+ col_order = [' col1' , ' col2' , ' col3' ], projectid)
4007
+
4008
+ .. note ::
4009
+
4010
+ You can find your project id in the `BigQuery management console <https://code.google.com/apis/console/b/0/?noredirect >`__.
4011
+
4012
+
4013
+ .. note ::
4014
+
4015
+ You can toggle the verbose output via the ``verbose `` flag which defaults to ``True ``.
4016
+
4017
+ Writing DataFrames
4018
+ ''''''''''''''''''
4019
+
4020
+ Assume we want to write a DataFrame ``df `` into a BigQuery table using :func: `~pandas.DataFrame.to_gbq `.
4021
+
4022
+ .. ipython :: python
4023
+
4024
+ df = pd.DataFrame({' my_string' : list (' abc' ),
4025
+ ' my_int64' : list (range (1 , 4 )),
4026
+ ' my_float64' : np.arange(4.0 , 7.0 ),
4027
+ ' my_bool1' : [True , False , True ],
4028
+ ' my_bool2' : [False , True , False ],
4029
+ ' my_dates' : pd.date_range(' now' , periods = 3 )})
4030
+
4031
+ df
4032
+ df.dtypes
4002
4033
4003
4034
.. code-block :: python
4004
4035
4005
- df = pandas.DataFrame({' string_col_name' : [' hello' ],
4006
- ' integer_col_name' : [1 ],
4007
- ' boolean_col_name' : [True ]})
4008
- df.to_gbq(' my_dataset.my_table' , project_id = projectid)
4036
+ df.to_gbq(' my_dataset.my_table' , projectid)
4037
+
4038
+ .. note ::
4039
+
4040
+ If the destination table does not exist, a new table will be created. The
4041
+ destination dataset id must already exist in order for a new table to be created.
4042
+
4043
+ The ``if_exists `` argument can be used to dictate whether to ``'fail' ``, ``'replace' ``
4044
+ or ``'append' `` if the destination table already exists. The default value is ``'fail' ``.
4045
+
4046
+ For example, assume that ``if_exists `` is set to ``'fail' ``. The following snippet will raise
4047
+ a ``TableCreationError `` if the destination table already exists.
4048
+
4049
+ .. code-block :: python
4009
4050
4010
- The BigQuery SQL query language has some oddities, see ` here < https://developers.google.com/bigquery/query-reference >`__
4051
+ df.to_gbq( ' my_dataset.my_table ' , projectid, if_exists = ' fail ' )
4011
4052
4012
- While BigQuery uses SQL-like syntax, it has some important differences
4013
- from traditional databases both in functionality, API limitations (size and
4014
- quantity of queries or uploads), and how Google charges for use of the service.
4015
- You should refer to Google documentation often as the service seems to
4016
- be changing and evolving. BiqQuery is best for analyzing large sets of
4017
- data quickly, but it is not a direct replacement for a transactional database.
4053
+ .. note ::
4018
4054
4019
- You can access the management console to determine project id's by:
4020
- <https://code.google.com/apis/console/b/0/?noredirect>
4055
+ If the ``if_exists `` argument is set to ``'append' ``, the destination dataframe will
4056
+ be written to the table using the defined table schema and column types. The
4057
+ dataframe must match the destination table in column order, structure, and
4058
+ data types.
4059
+ If the ``if_exists `` argument is set to ``'replace' ``, and the existing table has a
4060
+ different schema, a delay of 2 minutes will be forced to ensure that the new schema
4061
+ has propagated in the Google environment. See
4062
+ `Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191 >`__.
4021
4063
4022
- As of 0.15.2, the gbq module has a function ``generate_bq_schema `` which
4023
- will produce the dictionary representation of the schema.
4064
+ Writing large DataFrames can result in errors due to size limitations being exceeded.
4065
+ This can be avoided by setting the ``chunksize `` argument when calling :func: `~pandas.DataFrame.to_gbq `.
4066
+ For example, the following writes ``df `` to a BigQuery table in batches of 10000 rows at a time:
4024
4067
4025
4068
.. code-block :: python
4026
4069
4027
- df = pandas.DataFrame({' A' : [1.0 ]})
4028
- gbq.generate_bq_schema(df, default_type = ' STRING' )
4070
+ df.to_gbq(' my_dataset.my_table' , projectid, chunksize = 10000 )
4029
4071
4030
- .. warning ::
4072
+ You can also see the progress of your post via the ``verbose `` flag which defaults to ``True ``.
4073
+ For example:
4074
+
4075
+ .. code-block :: python
4076
+
4077
+ In [8 ]: df.to_gbq(' my_dataset.my_table' , projectid, chunksize = 10000 , verbose = True )
4078
+
4079
+ Streaming Insert is 10 % Complete
4080
+ Streaming Insert is 20 % Complete
4081
+ Streaming Insert is 30 % Complete
4082
+ Streaming Insert is 40 % Complete
4083
+ Streaming Insert is 50 % Complete
4084
+ Streaming Insert is 60 % Complete
4085
+ Streaming Insert is 70 % Complete
4086
+ Streaming Insert is 80 % Complete
4087
+ Streaming Insert is 90 % Complete
4088
+ Streaming Insert is 100 % Complete
4089
+
4090
+ .. note ::
4091
+
4092
+ If an error occurs while streaming data to BigQuery, see
4093
+ `Troubleshooting BigQuery Errors <https://cloud.google.com/bigquery/troubleshooting-errors >`__.
4094
+
4095
+ .. note ::
4096
+
4097
+ The BigQuery SQL query language has some oddities, see the
4098
+ `BigQuery Query Reference Documentation <https://developers.google.com/bigquery/query-reference >`__.
4099
+
4100
+ .. note ::
4101
+
4102
+ While BigQuery uses SQL-like syntax, it has some important differences from traditional
4103
+ databases both in functionality, API limitations (size and quantity of queries or uploads),
4104
+ and how Google charges for use of the service. You should refer to `Google BigQuery documentation <https://developers.google.com/bigquery/ >`__
4105
+ often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
4106
+ sets of data quickly, but it is not a direct replacement for a transactional database.
4107
+
4108
+
4109
+ Creating BigQuery Tables
4110
+ ''''''''''''''''''''''''
4111
+
4112
+ As of 0.17.0, the gbq module has a function :func: `~pandas.io.gbq.create_table ` which allows users
4113
+ to create a table in BigQuery. The only requirement is that the dataset must already exist.
4114
+ The schema may be generated from a pandas DataFrame using the :func: `~pandas.io.gbq.generate_bq_schema ` function below.
4115
+
4116
+ For example:
4117
+
4118
+ .. code-block :: python
4119
+
4120
+ gbq.create_table(' my_dataset.my_table' , schema, projectid)
4121
+
4122
+ As of 0.15.2, the gbq module has a function :func: `~pandas.io.gbq.generate_bq_schema ` which will
4123
+ produce the dictionary representation schema of the specified pandas DataFrame.
4124
+
4125
+ .. code-block :: python
4126
+
4127
+ In [10 ]: gbq.generate_bq_schema(df, default_type = ' STRING' )
4128
+
4129
+ Out[10 ]: {' fields' : [{' name' : ' my_bool1' , ' type' : ' BOOLEAN' },
4130
+ {' name' : ' my_bool2' , ' type' : ' BOOLEAN' },
4131
+ {' name' : ' my_dates' , ' type' : ' TIMESTAMP' },
4132
+ {' name' : ' my_float64' , ' type' : ' FLOAT' },
4133
+ {' name' : ' my_int64' , ' type' : ' INTEGER' },
4134
+ {' name' : ' my_string' , ' type' : ' STRING' }]}
4135
+
4136
+ Deleting BigQuery Tables
4137
+ ''''''''''''''''''''''''
4138
+
4139
+ As of 0.17.0, the gbq module has a function :func: `~pandas.io.gbq.delete_table ` which allows users to delete a table
4140
+ in Google BigQuery.
4141
+
4142
+ For example:
4143
+
4144
+ .. code-block :: python
4145
+
4146
+ gbq.delete_table(' my_dataset.my_table' , projectid)
4147
+
4148
+ The following function can be used to check whether a table exists prior to calling ``table_exists ``:
4149
+
4150
+ :func: `~pandas.io.gbq.table_exists `.
4151
+
4152
+ The return value will be of type boolean.
4153
+
4154
+ For example:
4155
+
4156
+ .. code-block :: python
4157
+
4158
+ In [12 ]: gbq.table_exists(' my_dataset.my_table' , projectid)
4159
+ Out[12 ]: True
4160
+
4161
+ .. note ::
4031
4162
4032
- To use this module, you will need a valid BigQuery account. See
4033
- <https://cloud.google.com/products/big-query> for details on the
4034
- service.
4163
+ If you delete and re-create a BigQuery table with the same name, but different table schema,
4164
+ you must wait 2 minutes before streaming data into the table. As a workaround, consider creating
4165
+ the new table with a different name. Refer to
4166
+ `Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191 >`__.
4035
4167
4036
4168
.. _io.stata :
4037
4169
0 commit comments