@@ -4652,293 +4652,12 @@ And then issue the following queries:
4652
4652
Google BigQuery
4653
4653
---------------
4654
4654
4655
- .. versionadded :: 0.13.0
4656
-
4657
- The :mod: `pandas.io.gbq ` module provides a wrapper for Google's BigQuery
4658
- analytics web service to simplify retrieving results from BigQuery tables
4659
- using SQL-like queries. Result sets are parsed into a pandas
4660
- DataFrame with a shape and data types derived from the source table.
4661
- Additionally, DataFrames can be inserted into new BigQuery tables or appended
4662
- to existing tables.
4663
-
4664
- .. warning ::
4665
-
4666
- To use this module, you will need a valid BigQuery account. Refer to the
4667
- `BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery >`__
4668
- for details on the service itself.
4669
-
4670
- The key functions are:
4671
-
4672
- .. currentmodule :: pandas.io.gbq
4673
-
4674
- .. autosummary ::
4675
- :toctree: generated/
4676
-
4677
- read_gbq
4678
- to_gbq
4679
-
4680
- .. currentmodule :: pandas
4681
-
4682
-
4683
- Supported Data Types
4684
- ''''''''''''''''''''
4685
-
4686
- Pandas supports all these `BigQuery data types <https://cloud.google.com/bigquery/data-types >`__:
4687
- ``STRING ``, ``INTEGER `` (64bit), ``FLOAT `` (64 bit), ``BOOLEAN `` and
4688
- ``TIMESTAMP `` (microsecond precision). Data types ``BYTES `` and ``RECORD ``
4689
- are not supported.
4690
-
4691
- Integer and boolean ``NA `` handling
4692
- '''''''''''''''''''''''''''''''''''
4693
-
4694
- .. versionadded :: 0.20
4695
-
4696
- Since all columns in BigQuery queries are nullable, and NumPy lacks of ``NA ``
4697
- support for integer and boolean types, this module will store ``INTEGER `` or
4698
- ``BOOLEAN `` columns with at least one ``NULL `` value as ``dtype=object ``.
4699
- Otherwise those columns will be stored as ``dtype=int64 `` or ``dtype=bool ``
4700
- respectively.
4701
-
4702
- This is opposite to default pandas behaviour which will promote integer
4703
- type to float in order to store NAs. See the :ref: `gotchas<gotchas.intna> `
4704
- for detailed explaination.
4705
-
4706
- While this trade-off works well for most cases, it breaks down for storing
4707
- values greater than 2**53. Such values in BigQuery can represent identifiers
4708
- and unnoticed precision lost for identifier is what we want to avoid.
4709
-
4710
- .. _io.bigquery_deps :
4711
-
4712
- Dependencies
4713
- ''''''''''''
4714
-
4715
- This module requires following additional dependencies:
4716
-
4717
- - `httplib2 <https://github.com/httplib2/httplib2 >`__: HTTP client
4718
- - `google-api-python-client <http://github.com/google/google-api-python-client >`__: Google's API client
4719
- - `oauth2client <https://github.com/google/oauth2client >`__: authentication and authorization for Google's API
4720
-
4721
- .. _io.bigquery_authentication :
4722
-
4723
- Authentication
4724
- ''''''''''''''
4725
-
4726
- .. versionadded :: 0.18.0
4727
-
4728
- Authentication to the Google ``BigQuery `` service is via ``OAuth 2.0 ``.
4729
- Is possible to authenticate with either user account credentials or service account credentials.
4730
-
4731
- Authenticating with user account credentials is as simple as following the prompts in a browser window
4732
- which will be automatically opened for you. You will be authenticated to the specified
4733
- ``BigQuery `` account using the product name ``pandas GBQ ``. It is only possible on local host.
4734
- The remote authentication using user account credentials is not currently supported in pandas.
4735
- Additional information on the authentication mechanism can be found
4736
- `here <https://developers.google.com/identity/protocols/OAuth2#clientside/ >`__.
4737
-
4738
- Authentication with service account credentials is possible via the `'private_key' ` parameter. This method
4739
- is particularly useful when working on remote servers (eg. jupyter iPython notebook on remote host).
4740
- Additional information on service accounts can be found
4741
- `here <https://developers.google.com/identity/protocols/OAuth2#serviceaccount >`__.
4742
-
4743
- Authentication via ``application default credentials `` is also possible. This is only valid
4744
- if the parameter ``private_key `` is not provided. This method also requires that
4745
- the credentials can be fetched from the environment the code is running in.
4746
- Otherwise, the OAuth2 client-side authentication is used.
4747
- Additional information on
4748
- `application default credentials <https://developers.google.com/identity/protocols/application-default-credentials >`__.
4749
-
4750
- .. versionadded :: 0.19.0
4751
-
4752
- .. note ::
4753
-
4754
- The `'private_key' ` parameter can be set to either the file path of the service account key
4755
- in JSON format, or key contents of the service account key in JSON format.
4756
-
4757
- .. note ::
4758
-
4759
- A private key can be obtained from the Google developers console by clicking
4760
- `here <https://console.developers.google.com/permissions/serviceaccounts >`__. Use JSON key type.
4761
-
4762
- .. _io.bigquery_reader :
4763
-
4764
- Querying
4765
- ''''''''
4766
-
4767
- Suppose you want to load all data from an existing BigQuery table : `test_dataset.test_table `
4768
- into a DataFrame using the :func: `~pandas.io.gbq.read_gbq ` function.
4769
-
4770
- .. code-block :: python
4771
-
4772
- # Insert your BigQuery Project ID Here
4773
- # Can be found in the Google web console
4774
- projectid = " xxxxxxxx"
4775
-
4776
- data_frame = pd.read_gbq(' SELECT * FROM test_dataset.test_table' , projectid)
4777
-
4778
-
4779
- You can define which column from BigQuery to use as an index in the
4780
- destination DataFrame as well as a preferred column order as follows:
4781
-
4782
- .. code-block :: python
4783
-
4784
- data_frame = pd.read_gbq(' SELECT * FROM test_dataset.test_table' ,
4785
- index_col = ' index_column_name' ,
4786
- col_order = [' col1' , ' col2' , ' col3' ], projectid)
4787
-
4788
-
4789
- Starting with 0.20.0, you can specify the query config as parameter to use additional options of your job.
4790
- For more information about query configuration parameters see
4791
- `here <https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query >`__.
4792
-
4793
- .. code-block :: python
4794
-
4795
- configuration = {
4796
- ' query' : {
4797
- " useQueryCache" : False
4798
- }
4799
- }
4800
- data_frame = pd.read_gbq(' SELECT * FROM test_dataset.test_table' ,
4801
- configuration = configuration, projectid)
4802
-
4803
-
4804
- .. note ::
4805
-
4806
- You can find your project id in the `Google developers console <https://console.developers.google.com >`__.
4807
-
4808
-
4809
- .. note ::
4810
-
4811
- You can toggle the verbose output via the ``verbose `` flag which defaults to ``True ``.
4812
-
4813
- .. note ::
4814
-
4815
- The ``dialect `` argument can be used to indicate whether to use BigQuery's ``'legacy' `` SQL
4816
- or BigQuery's ``'standard' `` SQL (beta). The default value is ``'legacy' ``. For more information
4817
- on BigQuery's standard SQL, see `BigQuery SQL Reference
4818
- <https://cloud.google.com/bigquery/sql-reference/> `__
4819
-
4820
- .. _io.bigquery_writer :
4821
-
4822
- Writing DataFrames
4823
- ''''''''''''''''''
4824
-
4825
- Assume we want to write a DataFrame ``df `` into a BigQuery table using :func: `~pandas.DataFrame.to_gbq `.
4826
-
4827
- .. ipython :: python
4828
-
4829
- df = pd.DataFrame({' my_string' : list (' abc' ),
4830
- ' my_int64' : list (range (1 , 4 )),
4831
- ' my_float64' : np.arange(4.0 , 7.0 ),
4832
- ' my_bool1' : [True , False , True ],
4833
- ' my_bool2' : [False , True , False ],
4834
- ' my_dates' : pd.date_range(' now' , periods = 3 )})
4835
-
4836
- df
4837
- df.dtypes
4838
-
4839
- .. code-block :: python
4840
-
4841
- df.to_gbq(' my_dataset.my_table' , projectid)
4842
-
4843
- .. note ::
4844
-
4845
- The destination table and destination dataset will automatically be created if they do not already exist.
4846
-
4847
- The ``if_exists `` argument can be used to dictate whether to ``'fail' ``, ``'replace' ``
4848
- or ``'append' `` if the destination table already exists. The default value is ``'fail' ``.
4849
-
4850
- For example, assume that ``if_exists `` is set to ``'fail' ``. The following snippet will raise
4851
- a ``TableCreationError `` if the destination table already exists.
4852
-
4853
- .. code-block :: python
4854
-
4855
- df.to_gbq(' my_dataset.my_table' , projectid, if_exists = ' fail' )
4856
-
4857
- .. note ::
4858
-
4859
- If the ``if_exists `` argument is set to ``'append' ``, the destination dataframe will
4860
- be written to the table using the defined table schema and column types. The
4861
- dataframe must match the destination table in structure and data types.
4862
- If the ``if_exists `` argument is set to ``'replace' ``, and the existing table has a
4863
- different schema, a delay of 2 minutes will be forced to ensure that the new schema
4864
- has propagated in the Google environment. See
4865
- `Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191 >`__.
4866
-
4867
- Writing large DataFrames can result in errors due to size limitations being exceeded.
4868
- This can be avoided by setting the ``chunksize `` argument when calling :func: `~pandas.DataFrame.to_gbq `.
4869
- For example, the following writes ``df `` to a BigQuery table in batches of 10000 rows at a time:
4870
-
4871
- .. code-block :: python
4872
-
4873
- df.to_gbq(' my_dataset.my_table' , projectid, chunksize = 10000 )
4874
-
4875
- You can also see the progress of your post via the ``verbose `` flag which defaults to ``True ``.
4876
- For example:
4877
-
4878
- .. code-block :: python
4879
-
4880
- In [8 ]: df.to_gbq(' my_dataset.my_table' , projectid, chunksize = 10000 , verbose = True )
4881
-
4882
- Streaming Insert is 10 % Complete
4883
- Streaming Insert is 20 % Complete
4884
- Streaming Insert is 30 % Complete
4885
- Streaming Insert is 40 % Complete
4886
- Streaming Insert is 50 % Complete
4887
- Streaming Insert is 60 % Complete
4888
- Streaming Insert is 70 % Complete
4889
- Streaming Insert is 80 % Complete
4890
- Streaming Insert is 90 % Complete
4891
- Streaming Insert is 100 % Complete
4892
-
4893
- .. note ::
4894
-
4895
- If an error occurs while streaming data to BigQuery, see
4896
- `Troubleshooting BigQuery Errors <https://cloud.google.com/bigquery/troubleshooting-errors >`__.
4897
-
4898
- .. note ::
4899
-
4900
- The BigQuery SQL query language has some oddities, see the
4901
- `BigQuery Query Reference Documentation <https://cloud.google.com/bigquery/query-reference >`__.
4902
-
4903
- .. note ::
4904
-
4905
- While BigQuery uses SQL-like syntax, it has some important differences from traditional
4906
- databases both in functionality, API limitations (size and quantity of queries or uploads),
4907
- and how Google charges for use of the service. You should refer to `Google BigQuery documentation <https://cloud.google.com/bigquery/what-is-bigquery >`__
4908
- often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
4909
- sets of data quickly, but it is not a direct replacement for a transactional database.
4910
-
4911
- .. _io.bigquery_create_tables :
4912
-
4913
- Creating BigQuery Tables
4914
- ''''''''''''''''''''''''
4915
-
4916
4655
.. warning ::
4917
4656
4918
- As of 0.17, the function :func: `~pandas.io.gbq.generate_bq_schema ` has been deprecated and will be
4919
- removed in a future version.
4920
-
4921
- As of 0.15.2, the gbq module has a function :func: `~pandas.io.gbq.generate_bq_schema ` which will
4922
- produce the dictionary representation schema of the specified pandas DataFrame.
4923
-
4924
- .. code-block :: ipython
4925
-
4926
- In [10]: gbq.generate_bq_schema(df, default_type='STRING')
4927
-
4928
- Out[10]: {'fields': [{'name': 'my_bool1', 'type': 'BOOLEAN'},
4929
- {'name': 'my_bool2', 'type': 'BOOLEAN'},
4930
- {'name': 'my_dates', 'type': 'TIMESTAMP'},
4931
- {'name': 'my_float64', 'type': 'FLOAT'},
4932
- {'name': 'my_int64', 'type': 'INTEGER'},
4933
- {'name': 'my_string', 'type': 'STRING'}]}
4934
-
4935
- .. note ::
4936
-
4937
- If you delete and re-create a BigQuery table with the same name, but different table schema,
4938
- you must wait 2 minutes before streaming data into the table. As a workaround, consider creating
4939
- the new table with a different name. Refer to
4940
- `Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191 >`__.
4657
+ Starting in 0.20.0, pandas has split off Google BigQuery support into the
4658
+ separate package ``pandas-gbq ``. You can ``pip install pandas-gbq `` to get it.
4941
4659
4660
+ Documentation is now hosted `here <https://pandas-gbq.readthedocs.io/ >`__
4942
4661
4943
4662
.. _io.stata :
4944
4663
0 commit comments