Skip to content

BUG: preserve timezone info when writing empty tz-aware series to HDF5 #37072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Oct 31, 2020

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Oct 12, 2020

xref #20595 (stale PR that looked into this)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i am not sure how to grok this one as i think its on top of the other; in the future put them both in 1 PR is ok for this size change, or now, wait till the other is merged then rebase this one (or make this orthogonal)

@@ -2960,11 +2960,17 @@ def write_array_empty(self, key: str, value: ArrayLike):
node._v_attrs.value_type = str(value.dtype)
node._v_attrs.shape = value.shape

def write_array(self, key: str, value: ArrayLike, items: Optional[Index] = None):
# TODO: we only have one test that gets here, the only EA
def write_array(self, key: str, obj, items: Optional[Index] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the typing changed? can you be explicit here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a type hint.

AFAIU for series to_numpy() drops timezone info so we need to pass the original object in SeriesFixed.read(...)and special case here on timezone-aware data

xref (from stale PR that worked on this) https://github.com/pandas-dev/pandas/pull/20595/files#r179816447

elif is_datetime64tz_dtype(obj):
value = obj.dt._get_values()
else:
value = obj.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .to_numpy()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback jreback added IO HDF5 read_hdf, HDFStore Timezones Timezone data dtype labels Oct 14, 2020
@arw2019
Copy link
Member Author

arw2019 commented Oct 15, 2020

ok i am not sure how to grok this one as i think its on top of the other; in the future put them both in 1 PR is ok for this size change, or now, wait till the other is merged then rebase this one (or make this orthogonal)

Ok will do! For this one I'll wait until the other is merge (since this is a dependant)

@arw2019 arw2019 force-pushed the GH20594 branch 2 times, most recently from d30f5bf to 2ea93bb Compare October 15, 2020 12:39
@arw2019
Copy link
Member Author

arw2019 commented Oct 16, 2020

@jreback this is ready for re-review

# that gets passed is DatetimeArray, and we never have
# both self._filters and EA
assert isinstance(value, (np.ndarray, ABCExtensionArray)), type(value)

if isinstance(obj, (np.ndarray, ABCExtensionArray)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the annotation of obj as a Series accurate? if so, this branch is never hit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not. Should be SeriesOrFrame (will fix)

if isinstance(obj, (np.ndarray, ABCExtensionArray)):
value = obj
elif is_datetime64tz_dtype(obj):
value = obj.dt._get_values()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont know off the top of my head what _get_values returns, is it a DatetimeArray? The standard pattern would be to use extract_array(obj, extract_numpy=True) at the beginning of this method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_values returns a DatetimeIndex

https://github.com/pandas-dev/pandas/blob/10e5ad7109f1bb3c47b48b6f0df185e108c5493d/pandas/core/indexes/accessors.py#L47-54

But actually the extract_array pattern does the job of this entire if/elif/else block here - thanks for suggesting! (I'll remember for future)

@arw2019
Copy link
Member Author

arw2019 commented Oct 31, 2020

CI Green + ready for re-review

@jreback jreback added this to the 1.2 milestone Oct 31, 2020
@jreback jreback merged commit 99d7fad into pandas-dev:master Oct 31, 2020
@jreback
Copy link
Contributor

jreback commented Oct 31, 2020

thanks @arw2019

@arw2019 arw2019 deleted the GH20594 branch November 1, 2020 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: HDFStore failures on timezone-aware data
3 participants