Skip to content

ENH: Use tz-aware dtype for timestamp columns #263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ci/requirements-2.7.pip
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
mock
pandas==0.19.0
pandas==0.20.0
google-auth==1.4.1
google-auth-oauthlib==0.0.1
google-cloud-bigquery==1.9.0
Expand Down
2 changes: 1 addition & 1 deletion ci/requirements-3.5.pip
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
pandas==0.19.0
pandas==0.20.0
google-auth==1.4.1
google-auth-oauthlib==0.0.1
google-cloud-bigquery==1.9.0
Expand Down
14 changes: 11 additions & 3 deletions docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,12 @@ Changelog
0.10.0 / TBD
------------

- This fixes a bug where pandas-gbq could not upload an empty database. (:issue:`237`)

Dependency updates
~~~~~~~~~~~~~~~~~~

- Update the minimum version of ``google-cloud-bigquery`` to 1.9.0.
(:issue:`247`)
- Update the minimum version of ``pandas`` to 0.19.0. (:issue:`262`)
- Update the minimum version of ``pandas`` to 0.20.0. (:issue:`263`)

Internal changes
~~~~~~~~~~~~~~~~
Expand All @@ -23,11 +21,21 @@ Internal changes

Enhancements
~~~~~~~~~~~~

- Allow ``table_schema`` in :func:`to_gbq` to contain only a subset of columns,
with the rest being populated using the DataFrame dtypes (:issue:`218`)
(contributed by @johnpaton)
- Read ``project_id`` in :func:`to_gbq` from provided ``credentials`` if
available (contributed by @daureg)
- ``read_gbq`` uses the timezone-aware ``DatetimeTZDtype(unit='ns',
tz='UTC')`` dtype for BigQuery ``TIMESTAMP`` columns. (:issue:`263`)

Bug fixes
~~~~~~~~~

- Fix a bug where pandas-gbq could not upload an empty database.
(:issue:`237`)


.. _changelog-0.9.0:

Expand Down
64 changes: 47 additions & 17 deletions docs/source/reading.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,32 @@ Suppose you want to load all data from an existing BigQuery table

.. code-block:: python

# Insert your BigQuery Project ID Here
# Can be found in the Google web console
import pandas_gbq

# TODO: Set your BigQuery Project ID.
projectid = "xxxxxxxx"

data_frame = read_gbq('SELECT * FROM test_dataset.test_table', projectid)
data_frame = pandas_gbq.read_gbq(
'SELECT * FROM `test_dataset.test_table`',
project_id=projectid)

.. note::

A project ID is sometimes optional if it can be inferred during
authentication, but it is required when authenticating with user
credentials. You can find your project ID in the `Google Cloud console
<https://console.cloud.google.com>`__.

You can define which column from BigQuery to use as an index in the
destination DataFrame as well as a preferred column order as follows:

.. code-block:: python

data_frame = read_gbq('SELECT * FROM test_dataset.test_table',
index_col='index_column_name',
col_order=['col1', 'col2', 'col3'], projectid)
data_frame = pandas_gbq.read_gbq(
'SELECT * FROM `test_dataset.test_table`',
project_id=projectid,
index_col='index_column_name',
col_order=['col1', 'col2', 'col3'])


You can specify the query config as parameter to use additional options of
Expand All @@ -37,20 +48,39 @@ your job. For more information about query configuration parameters see `here
"useQueryCache": False
}
}
data_frame = read_gbq('SELECT * FROM test_dataset.test_table',
configuration=configuration, projectid)
data_frame = read_gbq(
'SELECT * FROM `test_dataset.test_table`',
project_id=projectid,
configuration=configuration)


.. note::
The ``dialect`` argument can be used to indicate whether to use
BigQuery's ``'legacy'`` SQL or BigQuery's ``'standard'`` SQL (beta). The
default value is ``'standard'`` For more information on BigQuery's standard
SQL, see `BigQuery SQL Reference
<https://cloud.google.com/bigquery/docs/reference/standard-sql/>`__

You can find your project id in the `Google developers console
<https://console.developers.google.com>`__.
.. code-block:: python

data_frame = pandas_gbq.read_gbq(
'SELECT * FROM [test_dataset.test_table]',
project_id=projectid,
dialect='legacy')

.. note::

The ``dialect`` argument can be used to indicate whether to use BigQuery's ``'legacy'`` SQL
or BigQuery's ``'standard'`` SQL (beta). The default value is ``'legacy'``, though this will change
in a subsequent release to ``'standard'``. For more information
on BigQuery's standard SQL, see `BigQuery SQL Reference
<https://cloud.google.com/bigquery/sql-reference/>`__
.. _reading-dtypes:

Inferring the DataFrame's dtypes
--------------------------------

The :func:`~pandas_gbq.read_gbq` method infers the pandas dtype for each column, based on the BigQuery table schema.

================== =========================
BigQuery Data Type dtype
================== =========================
FLOAT float
TIMESTAMP DatetimeTZDtype(unit='ns', tz='UTC')
DATETIME datetime64[ns]
TIME datetime64[ns]
DATE datetime64[ns]
================== =========================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

25 changes: 14 additions & 11 deletions pandas_gbq/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -644,21 +644,24 @@ def delete_and_recreate_table(self, dataset_id, table_id, table_schema):


def _bqschema_to_nullsafe_dtypes(schema_fields):
# Only specify dtype when the dtype allows nulls. Otherwise, use pandas's
# default dtype choice.
#
# See:
# http://pandas.pydata.org/pandas-docs/dev/missing_data.html
# #missing-data-casting-rules-and-indexing
"""Specify explicit dtypes based on BigQuery schema.

This function only specifies a dtype when the dtype allows nulls.
Otherwise, use pandas's default dtype choice.

See: http://pandas.pydata.org/pandas-docs/dev/missing_data.html
#missing-data-casting-rules-and-indexing
"""
import pandas

# If you update this mapping, also update the table at
# `docs/source/reading.rst`.
dtype_map = {
"FLOAT": np.dtype(float),
# Even though TIMESTAMPs are timezone-aware in BigQuery, pandas doesn't
# support datetime64[ns, UTC] as dtype in DataFrame constructors. See:
# https://github.com/pandas-dev/pandas/issues/12513
"TIMESTAMP": "datetime64[ns]",
"TIMESTAMP": pandas.DatetimeTZDtype(tz="UTC"),
"DATETIME": "datetime64[ns]",
"TIME": "datetime64[ns]",
"DATE": "datetime64[ns]",
"DATETIME": "datetime64[ns]",
}

dtypes = {}
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def readme():

INSTALL_REQUIRES = [
"setuptools",
"pandas>=0.19.0",
"pandas>=0.20.0",
"pydata-google-auth",
"google-auth",
"google-auth-oauthlib",
Expand Down
2 changes: 1 addition & 1 deletion tests/system/test_gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ def test_should_properly_handle_arbitrary_datetime(self, project_id):
"expression, is_expected_dtype",
[
("current_date()", pandas.api.types.is_datetime64_ns_dtype),
("current_timestamp()", pandas.api.types.is_datetime64_ns_dtype),
("current_timestamp()", pandas.api.types.is_datetime64tz_dtype),
("current_datetime()", pandas.api.types.is_datetime64_ns_dtype),
("TRUE", pandas.api.types.is_bool_dtype),
("FALSE", pandas.api.types.is_bool_dtype),
Expand Down