-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Handle numpy strings in index names in HDF #13492 #16444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Handle numpy strings in index names in HDF #13492 #16444
Conversation
I found that pytables is returning a numpy.str_ as the index name for both indices. I don't think this has to do with how we write it, and pytables just chooses to write / read it that way. So we can handle this on the read side. There is some trickiness around numpy types in py2/3 and python string types in py2/3. So I decided to split the problem in two steps:
The latter is done by _ensure_decoded, so for the former I added an _ensure_python_type. This way we should cover most situations. |
The numpy type situation is complicated across py2/3 and string types, but luckily they have a function that handles all that: https://github.com/numpy/numpy/blob/v1.12.1/numpy/core/numerictypes.py#L74 |
as an additional test, try using a unicode index names as well :> |
I prefer [17] for the conversion I think. |
@jreback so that was the original plan (and it works on my python 3.6 install) but the worry is that the behavior for what the heck happens with bytes vs strings (is it always going to return str_?) or py2/py3 (where in py2 np.str_ means bytes and in py3 np.str_ means unicode), and then how that interacts with .decode("utf-8") in the _ensure_decoded? If you're confident the above approach will work, more than happy to change. Will add the test! |
So all string data is stored as bytes in PyTables. I think we encode to utf-8 (and decode the same). Though this is really just for table format. I dont' really remember what happens for fixed format for meta data. You might want to trace it and see. |
Codecov Report
@@ Coverage Diff @@
## master #16444 +/- ##
==========================================
+ Coverage 90.75% 90.92% +0.16%
==========================================
Files 161 161
Lines 51089 49244 -1845
==========================================
- Hits 46368 44777 -1591
+ Misses 4721 4467 -254
Continue to review full report at Codecov.
|
@makmanalp it seems like all the tests passed, so that's good. Would you mind including the example from import pandas as pd
import numpy as np
import datetime
idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)
with pd.HDFStore("test.h5", "w") as store:
store.put("test", s, "fixed")
with pd.HDFStore("test.h5", "r") as store:
s1 = store["test"] and ensure that |
@jreback OK, if it's bytes only then that should simplify things a bit. |
@makmanalp will you have a chance to update this in the next day or so? We're releasing 0.20.2 by the end of the week. Not a big deal to push. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a test and a whatsnew entry.
@TomAugspurger sorry for dropping the ball on this. I'm going to give it a shot tomorrow during the day. The test and whatsnew is easy, but double checking how things work in fixed format is a bit harder for me. Feel free to push it off the release if so. |
OK, oddly enough, in the test cases, I can't seem to use the ensure_clean_store() helper, perhaps there is something about the closing / opening of the file that changes things? I'm removing the bugfix that I made, and expecting the test to fail: with ensure_clean_store(self.path) as store:
store.put('frame', df, format='fixed')
recons = store['frame']
tm.assert_frame_equal(recons, df)
assert type(df.index.name) == str
assert type(df.columns.name) == str
# Won't fail (surprising)
with pd.HDFStore("test.h5", "w") as store:
store.put("test", df, "fixed")
with pd.HDFStore("test.h5", "r") as store:
df2 = store["test"]
assert type(df2.index.name) == str
assert type(df2.columns.name) == str
# Will fail! (which is what I want) I thought of using |
48b91c5
to
40c8f1a
Compare
OK so I fixed the issue with ensure_clean_store because I found ensure_clean_path. Re: picking a unicode index name for the test, how are we going to make this work for py2/py3 again? Do we want a |
Ah, OK, so I found the |
I think the |
@TomAugspurger yeah - I've now realized we don't need to care about because I can use the u_safe thing helps because you can use the |
Just for posterity, conversation on numpy strings / bytestrings / unicode in numpy gitter channel:
|
So I don't think the types converted / encoded / decoded at all for index names in fixed format, at least not within pandas: Line 2485 in fb47ee5
and also Line 2529 in fb47ee5
But I may not be understanding this code fully. edit: It's also odd that calling type() on the on the index name is what's blowing up with an unicode error, and not printing it out or something: |
pandas/io/pytables.py
Outdated
@@ -2567,7 +2575,7 @@ def read_index_node(self, node, start=None, stop=None): | |||
name = None | |||
|
|||
if 'name' in node._v_attrs: | |||
name = node._v_attrs.name | |||
name = _ensure_decoded(_ensure_python_type(node._v_attrs.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try replacing this with name = compat.text_type(node._v_attrs.name)
. That will be unicode
on python 2 and str
on python 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need the _ensure_python_type
stuff then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you can't just do this unconditionally, as the name could be something like None
or the literal 10
(not a string). So it'll be something like
name = node._v_attrs.name
if isinstance(name, compat.string_types):
name = compat.text_type(name)
pandas/tests/io/test_pytables.py
Outdated
df2 = read_hdf(path, 'df') | ||
assert_frame_equal(df, df2) | ||
|
||
assert type(df2.index.name) == str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then replace str
here and below with compat.text_type
. But I would also add an assert that the exact string matches.
pandas/tests/io/test_pytables.py
Outdated
assert type(df2.columns.name) == str | ||
|
||
with ensure_clean_path(self.path) as path: | ||
df.to_hdf(path, 'df', format='table') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can parametrize format
I think
@pytest.mark.parametrized('format', ['table', 'fixed'])
def test_store_index_name_numpy_str(self, format):
...
e11c90d
to
a90b215
Compare
Thanks a bunch @TomAugspurger! The latest push should handle all your comments correctly, and pass the 2.7 test this time! |
pandas/tests/io/test_pytables.py
Outdated
assert_frame_equal(df, df2) | ||
|
||
assert type(df2.index.name) == text_type | ||
assert df2.index.name == u('rows\u05d0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I guess this is unnecessary since you have the assert_frame_equal
up above. Oh well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this too!
a90b215
to
ab75d27
Compare
ab75d27
to
90f63b0
Compare
(last push fixes a minor lint error that escaped me after the latest changes: |
@@ -70,6 +70,7 @@ I/O | |||
- Bug that raised IndexError HTML-rendering an empty DataFrame (:issue:`15953`) | |||
- Bug in ``pd.read_csv()`` in which tarfile object inputs were raising an error in Python 2.x for the C engine (:issue:`16530`) | |||
- Bug where ``DataFrame.to_html()`` ignored the ``index_names`` parameter (:issue:`16493`) | |||
- Bug where ``pd.read_hdf()`` returns numpy strings for index names (:issue:`13492`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify this is for nvm I see that this is for both.format='fixed'
take this back. I see its only for fixed stores, so mention that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where numpy strings were returned, rather than python strings for a named Index.
pandas/io/pytables.py
Outdated
@@ -2568,6 +2568,8 @@ def read_index_node(self, node, start=None, stop=None): | |||
|
|||
if 'name' in node._v_attrs: | |||
name = node._v_attrs.name | |||
if isinstance(name, compat.string_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write this as _ensure_str
(like _ensure_decoded
) and use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment (and issue reference) of why this is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor code change. ping on green.
@makmanalp can you update |
Got it, merging on green and will cherry pick for 0.20.2; |
Thanks @makmanalp |
…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str (cherry picked from commit 18c316b)
…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str
…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str
Version 0.20.2 * tag 'v0.20.2': (68 commits) RLS: v0.20.2 DOC: Update release.rst DOC: Whatsnew fixups (pandas-dev#16596) ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460) BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444) PERF: vectorize _interp_limit (pandas-dev#16592) DOC: whatsnew 0.20.2 edits (pandas-dev#16587) API: Make is_strictly_monotonic_* private (pandas-dev#16576) BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565) Strictly monotonic (pandas-dev#16555) ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026) fix linting BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244) BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317) return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486) BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549) BUG: Fixed tput output on windows (pandas-dev#16496) Strictly monotonic (pandas-dev#16555) BUG: fixed wrong order of ordered labels in pd.cut() BUG: Fixed to_html ignoring index_names parameter ...
Work in progress!
git diff upstream/master --name-only -- '*.py' | flake8 --diff