-
Notifications
You must be signed in to change notification settings - Fork 6
fix: support converting empty time
Series to pyarrow Array
#11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently getting |
d225329
to
e793ec6
Compare
time
Seriestime
Series to pyarrow Array
I'm a bit surprised this doesn't happen for |
Seems to work fine on latest pandas. 🤔
|
@@ -98,7 +98,8 @@ def astype(self, dtype, copy=True): | |||
|
|||
def __arrow_array__(self, type=None): | |||
return pyarrow.array( | |||
self.to_numpy(), type=type if type is not None else pyarrow.time64("ns"), | |||
self.to_numpy(dtype="object"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might have performance implications, but it does seem to prevent the cast to float64 for empty arrays. Also, the dtype
seems to be object
whenever there are any values in the array, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approached this by adding the missing to_numpy()
for pandas <1, that just uses astype('object')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet. That seems to have done the trick. I copied the implementation from your PR https://github.com/googleapis/python-db-dtypes-pandas/pull/9/files#diff-1956943f14005805ef968dfc37c26fe3eee995786f62a1c66dee5a29d9b1a251R111
), | ||
), | ||
( | ||
pandas.Series( | ||
[ | ||
dt.time(0, 0, 0, 0), | ||
dt.time(12, 30, 15, 125_000), | ||
dt.time(23, 59, 59, 999_999), | ||
], | ||
dtype="time", | ||
), | ||
pyarrow.array( | ||
[ | ||
dt.time(0, 0, 0, 0), | ||
dt.time(12, 30, 15, 125_000), | ||
dt.time(23, 59, 59, 999_999), | ||
], | ||
type=pyarrow.time64("ns"), | ||
), | ||
), | ||
), | ||
) | ||
def test_to_arrow(series, expected): | ||
array = pyarrow.array(series) | ||
assert array.equals(expected) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
("series", "expected"), | ||
( | ||
(pandas.Series([], dtype="date"), pyarrow.array([], type=pyarrow.date64())), | ||
( | ||
pandas.Series([None, None, None], dtype="date"), | ||
pyarrow.array([None, None, None], type=pyarrow.date64()), | ||
), | ||
( | ||
pandas.Series( | ||
[dt.date(2021, 9, 27), None, dt.date(2011, 9, 27)], dtype="date" | ||
), | ||
pyarrow.array( | ||
[dt.date(2021, 9, 27), None, dt.date(2011, 9, 27)], | ||
type=pyarrow.date64(), | ||
), | ||
), | ||
( | ||
pandas.Series( | ||
[dt.date(1677, 9, 22), dt.date(1970, 1, 1), dt.date(2262, 4, 11)], | ||
dtype="date", | ||
), | ||
pyarrow.array( | ||
[dt.date(1677, 9, 22), dt.date(1970, 1, 1), dt.date(2262, 4, 11)], | ||
type=pyarrow.date64(), | ||
), | ||
), | ||
(pandas.Series([], dtype="time"), pyarrow.array([], type=pyarrow.time32("ms"))), | ||
( | ||
pandas.Series([None, None, None], dtype="time"), | ||
pyarrow.array([None, None, None], type=pyarrow.time32("ms")), | ||
), | ||
( | ||
pandas.Series( | ||
[dt.time(0, 0, 0, 0), None, dt.time(23, 59, 59, 999_000)], dtype="time" | ||
), | ||
pyarrow.array( | ||
[dt.time(0, 0, 0, 0), None, dt.time(23, 59, 59, 999_000)], | ||
type=pyarrow.time32("ms"), | ||
), | ||
), | ||
( | ||
pandas.Series( | ||
[dt.time(0, 0, 0, 0), None, dt.time(23, 59, 59, 999_999)], dtype="time" | ||
), | ||
pyarrow.array( | ||
[dt.time(0, 0, 0, 0), None, dt.time(23, 59, 59, 999_999)], | ||
type=pyarrow.time64("us"), | ||
), | ||
), | ||
( | ||
pandas.Series( | ||
[ | ||
dt.time(0, 0, 0, 0), | ||
dt.time(12, 30, 15, 125_000), | ||
dt.time(23, 59, 59, 999_999), | ||
], | ||
dtype="time", | ||
), | ||
pyarrow.array( | ||
[ | ||
dt.time(0, 0, 0, 0), | ||
dt.time(12, 30, 15, 125_000), | ||
dt.time(23, 59, 59, 999_999), | ||
], | ||
type=pyarrow.time64("us"), | ||
), | ||
), | ||
), | ||
) | ||
def test_to_arrow_w_arrow_type(series, expected): | ||
array = pyarrow.array(series, type=expected.type) | ||
assert array.equals(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your tests are nicer than mine. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! They did catch a bug with empty arrays, so I'm glad I wrote them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I just can't say about the possible performance impact of self.to_numpy(dtype="object")
.
Are we going to use that conditionally for pandas<1 as Jim suggested?
Thankfully I think this approach with using an extension dtype leaves us room for optimization. We might even switch to backing everything with pyarrow arrays instead of numpy arrays someday.
Yes, that seems to be the right approach. Just pushed a commit. |
|
||
|
||
@for_date_and_time | ||
def test_date___arrow__array__(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because it's redundant with the new test_arrow.py
tests.
🤖 I have created a release \*beep\* \*boop\* --- ## 0.1.0 (2021-09-29) ### Features * add `time` and `date` dtypes ([f104171](https://www.github.com/googleapis/python-db-dtypes-pandas/commit/f10417111642e8f5f4b9af790367af930d15a056)) ### Bug Fixes * support converting empty `time` Series to pyarrow Array ([#11](https://www.github.com/googleapis/python-db-dtypes-pandas/issues/11)) ([7675b15](https://www.github.com/googleapis/python-db-dtypes-pandas/commit/7675b157feb842628fa731cc6a472aa9e6b92903)) * support Pandas 0.24 ([#8](https://www.github.com/googleapis/python-db-dtypes-pandas/issues/8)) ([e996883](https://www.github.com/googleapis/python-db-dtypes-pandas/commit/e996883bc9c76fe5f593e9c19a9d2a1c13501f5e)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #10 🦕