BUG: Handle numpy strings in index names in HDF #13492 #16444

makmanalp · 2017-05-23T00:11:36Z

Work in progress!

closes ERR: raise on invalid coulmns using a fixed HDFStore #13492
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

makmanalp · 2017-05-23T00:17:20Z

I found that pytables is returning a numpy.str_ as the index name for both indices. I don't think this has to do with how we write it, and pytables just chooses to write / read it that way. So we can handle this on the read side. There is some trickiness around numpy types in py2/3 and python string types in py2/3. So I decided to split the problem in two steps:

Handle converting numpy to python string types
Handle converting byte types to unicode codepoints.

The latter is done by _ensure_decoded, so for the former I added an _ensure_python_type. This way we should cover most situations.

makmanalp · 2017-05-23T00:17:58Z

The numpy type situation is complicated across py2/3 and string types, but luckily they have a function that handles all that: https://github.com/numpy/numpy/blob/v1.12.1/numpy/core/numerictypes.py#L74

jreback · 2017-05-23T00:23:11Z

as an additional test, try using a unicode index names as well :>

jreback · 2017-05-23T00:25:11Z

In [14]: s = np.str_('foo')

In [15]: type(s)
Out[15]: numpy.str_

In [16]: type(s.item())
Out[16]: str

In [17]: type(str(s))
Out[17]: str

I prefer [17] for the conversion I think.

makmanalp · 2017-05-23T00:29:13Z

@jreback so that was the original plan (and it works on my python 3.6 install) but the worry is that the behavior for what the heck happens with bytes vs strings (is it always going to return str_?) or py2/py3 (where in py2 np.str_ means bytes and in py3 np.str_ means unicode), and then how that interacts with .decode("utf-8") in the _ensure_decoded? If you're confident the above approach will work, more than happy to change.

Will add the test!

jreback · 2017-05-23T00:32:22Z

@makmanalp

So all string data is stored as bytes in PyTables. I think we encode to utf-8 (and decode the same). Though this is really just for table format. I dont' really remember what happens for fixed format for meta data. You might want to trace it and see.

codecov · 2017-05-23T05:25:57Z

Codecov Report

Merging #16444 into master will increase coverage by 0.16%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16444      +/-   ##
==========================================
+ Coverage   90.75%   90.92%   +0.16%     
==========================================
  Files         161      161              
  Lines       51089    49244    -1845     
==========================================
- Hits        46368    44777    -1591     
+ Misses       4721     4467     -254

Flag	Coverage Δ
#multiple	`88.68% <80%> (+0.08%)`	⬆️
#single	`40.24% <100%> (+0.07%)`	⬆️

Impacted Files	Coverage Δ
pandas/io/pytables.py	`93.12% <100%> (+0.01%)`	⬆️
pandas/core/computation/expressions.py	`0% <0%> (-90.48%)`	⬇️
pandas/io/parsers.py	`95.42% <0%> (-0.27%)`	⬇️
pandas/io/msgpack/_version.py	`44.65% <0%> (-0.22%)`	⬇️
pandas/io/excel.py	`80.55% <0%> (-0.09%)`	⬇️
pandas/core/indexes/multi.py	`96.65% <0%> (ø)`	⬆️
pandas/core/groupby.py	`92.07% <0%> (+0.03%)`	⬆️
pandas/io/formats/style.py	`100% <0%> (+4.07%)`	⬆️
pandas/core/util/hashing.py	`98.34% <0%> (+5.37%)`	⬆️
pandas/io/json/json.py	`100% <0%> (+9.63%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ec98d8...70a55ec. Read the comment docs.

TomAugspurger · 2017-05-23T13:05:18Z

@makmanalp it seems like all the tests passed, so that's good.

Would you mind including the example from

import pandas as pd
import numpy as np
import datetime

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

with pd.HDFStore("test.h5", "w") as store:
    store.put("test", s, "fixed")

with pd.HDFStore("test.h5", "r") as store:
    s1 = store["test"]

and ensure that type(s1.index.name) is indeed a str. And maybe use a unicode name that isn't encodable as ASCII?

makmanalp · 2017-05-23T13:10:10Z

@jreback OK, if it's bytes only then that should simplify things a bit.
@TomAugspurger will do!

TomAugspurger · 2017-05-30T21:37:13Z

@makmanalp will you have a chance to update this in the next day or so? We're releasing 0.20.2 by the end of the week. Not a big deal to push.

jreback

needs a test and a whatsnew entry.

makmanalp · 2017-05-31T04:26:06Z

@TomAugspurger sorry for dropping the ball on this. I'm going to give it a shot tomorrow during the day. The test and whatsnew is easy, but double checking how things work in fixed format is a bit harder for me. Feel free to push it off the release if so.

makmanalp · 2017-06-01T03:18:27Z

OK, oddly enough, in the test cases, I can't seem to use the ensure_clean_store() helper, perhaps there is something about the closing / opening of the file that changes things? I'm removing the bugfix that I made, and expecting the test to fail:

    with ensure_clean_store(self.path) as store:
        store.put('frame', df, format='fixed')
        recons = store['frame']

        tm.assert_frame_equal(recons, df)

        assert type(df.index.name) == str
        assert type(df.columns.name) == str
        # Won't fail (surprising)

    with pd.HDFStore("test.h5", "w") as store:
        store.put("test", df, "fixed")

    with pd.HDFStore("test.h5", "r") as store:
        df2 = store["test"]
        assert type(df2.index.name) == str
        assert type(df2.columns.name) == str
        # Will fail! (which is what I want)

I thought of using self._check_roundtrip but then I don't have access to the dataframe to be able to assert the assertions about the name types. Any pointers on the right way to do this?

makmanalp · 2017-06-01T15:43:23Z

OK so I fixed the issue with ensure_clean_store because I found ensure_clean_path. Re: picking a unicode index name for the test, how are we going to make this work for py2/py3 again? Do we want a u"blah" string? Do we need a # coding=utf-8 on top of the file?

makmanalp · 2017-06-01T15:57:21Z

Ah, OK, so I found the compat module and u_safe and text_type, will give a shot at those.

TomAugspurger · 2017-06-01T15:57:33Z

I think the # coding is only required if you use non-ascii characters in the source.

makmanalp · 2017-06-01T16:01:08Z

@TomAugspurger yeah - I've now realized we don't need to care about because I can use the u_safe thing helps because you can use the \u05d0 style notation. Thank you!

makmanalp · 2017-06-01T17:57:37Z

Just for posterity, conversation on numpy strings / bytestrings / unicode in numpy gitter channel:


Mehmet Ali "Mali" Akmanalp @makmanalp May 22 19:12
Hi, does anyone know the difference between numpy.str and numpy.string (i.e. numpy.bytes_) and how to handle them in different python versions?
I'm trying to do a pull request for pandas where pytables is creating a np.str_ object that we need to decode into a regular ol' string

Julian Taylor @juliantaylor May 23 04:52
np.str_ is the equivalent to python str and maps to bytes in python2 and unicode in python3
np.string is always bytes and should not be used
the only real string type numpy has is np.unicode
that works in both python2 and 3

Mehmet Ali "Mali" Akmanalp @makmanalp May 23 09:27
@juliantaylor and how do conversions within those numpy types and/or from those to python types work?
(and thank you for the clarification)
My current understanding is that .item() does the right thing in terms of numpy->python, regardless of python version

Julian Taylor @juliantaylor May 23 11:31
@makmanalp no, item will return bytes in python3, which happens to be string in python2
the 'S' dtype is flat out broken, if you need actual bytes you are better of using int8 arrays, if you need strings you need to use unicode
the 'S' type has no concept of encoding, so even putting encoded data into it is troublesome as it strips trailing nulls

makmanalp · 2017-06-01T18:23:01Z

So I don't think the types converted / encoded / decoded at all for index names in fixed format, at least not within pandas:

pandas/pandas/io/pytables.py

Line 2485 in fb47ee5

node._v_attrs.name = index.name

and also

pandas/pandas/io/pytables.py

Line 2529 in fb47ee5

node._v_attrs.name = name

But I may not be understanding this code fully.

edit: It's also odd that calling type() on the on the index name is what's blowing up with an unicode error, and not printing it out or something:
https://travis-ci.org/pandas-dev/pandas/jobs/238406379#L1371

TomAugspurger · 2017-06-01T18:56:03Z

pandas/io/pytables.py

@@ -2567,7 +2575,7 @@ def read_index_node(self, node, start=None, stop=None):
        name = None

        if 'name' in node._v_attrs:
-            name = node._v_attrs.name
+            name = _ensure_decoded(_ensure_python_type(node._v_attrs.name))


Try replacing this with name = compat.text_type(node._v_attrs.name). That will be unicode on python 2 and str on python 3.

You shouldn't need the _ensure_python_type stuff then.

I guess you can't just do this unconditionally, as the name could be something like None or the literal 10 (not a string). So it'll be something like

name = node._v_attrs.name if isinstance(name, compat.string_types): name = compat.text_type(name)

TomAugspurger · 2017-06-01T19:03:42Z

pandas/tests/io/test_pytables.py

+            df2 = read_hdf(path, 'df')
+            assert_frame_equal(df, df2)
+
+            assert type(df2.index.name) == str


And then replace str here and below with compat.text_type. But I would also add an assert that the exact string matches.

TomAugspurger · 2017-06-01T19:05:11Z

pandas/tests/io/test_pytables.py

+            assert type(df2.columns.name) == str
+
+        with ensure_clean_path(self.path) as path:
+            df.to_hdf(path, 'df', format='table')


You can parametrize format I think

@pytest.mark.parametrized('format', ['table', 'fixed'])
def test_store_index_name_numpy_str(self, format):
...

makmanalp · 2017-06-01T19:38:40Z

Thanks a bunch @TomAugspurger! The latest push should handle all your comments correctly, and pass the 2.7 test this time!

TomAugspurger · 2017-06-01T19:43:31Z

pandas/tests/io/test_pytables.py

+            assert_frame_equal(df, df2)
+
+            assert type(df2.index.name) == text_type
+            assert df2.index.name == u('rows\u05d0')


Oh, I guess this is unnecessary since you have the assert_frame_equal up above. Oh well.

Fixed this too!

makmanalp · 2017-06-01T20:53:32Z

(last push fixes a minor lint error that escaped me after the latest changes: pandas/tests/io/test_pytables.py:2931:24: E128 continuation line under-indented for visual indent)

jreback · 2017-06-02T00:30:14Z

doc/source/whatsnew/v0.20.2.txt

@@ -70,6 +70,7 @@ I/O
 - Bug that raised IndexError HTML-rendering an empty DataFrame (:issue:`15953`)
 - Bug in ``pd.read_csv()`` in which tarfile object inputs were raising an error in Python 2.x for the C engine (:issue:`16530`)
 - Bug where ``DataFrame.to_html()`` ignored the ``index_names`` parameter (:issue:`16493`)
+- Bug where ``pd.read_hdf()`` returns numpy strings for index names (:issue:`13492`)


~~specify this is for format='fixed'~~ nvm I see that this is for both.

take this back. I see its only for fixed stores, so mention that.

where numpy strings were returned, rather than python strings for a named Index.

jreback · 2017-06-02T00:31:50Z

pandas/io/pytables.py

@@ -2568,6 +2568,8 @@ def read_index_node(self, node, start=None, stop=None):

        if 'name' in node._v_attrs:
            name = node._v_attrs.name
+            if isinstance(name, compat.string_types):


write this as _ensure_str (like _ensure_decoded) and use it here.

add a comment (and issue reference) of why this is needed

jreback

minor code change. ping on green.

jreback · 2017-06-02T22:31:12Z

@makmanalp can you update

TomAugspurger · 2017-06-04T00:59:49Z

Got it, merging on green and will cherry pick for 0.20.2;

TomAugspurger · 2017-06-04T10:44:32Z

Thanks @makmanalp

…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str (cherry picked from commit 18c316b)

* BUG: Handle numpy strings in index names in HDF5 #13492 * REF: refactor to _ensure_str (cherry picked from commit 18c316b)

…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str

Version 0.20.2 * tag 'v0.20.2': (68 commits) RLS: v0.20.2 DOC: Update release.rst DOC: Whatsnew fixups (pandas-dev#16596) ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460) BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444) PERF: vectorize _interp_limit (pandas-dev#16592) DOC: whatsnew 0.20.2 edits (pandas-dev#16587) API: Make is_strictly_monotonic_* private (pandas-dev#16576) BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565) Strictly monotonic (pandas-dev#16555) ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026) fix linting BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244) BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317) return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486) BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549) BUG: Fixed tput output on windows (pandas-dev#16496) Strictly monotonic (pandas-dev#16555) BUG: fixed wrong order of ordered labels in pd.cut() BUG: Fixed to_html ignoring index_names parameter ...

jreback added the IO HDF5 read_hdf, HDFStore label May 23, 2017

jreback added the 32bit 32-bit systems label May 23, 2017

TomAugspurger added this to the 0.20.2 milestone May 23, 2017

jreback requested changes May 30, 2017

View reviewed changes

jreback removed this from the 0.20.2 milestone May 30, 2017

makmanalp force-pushed the bug-13492-hdf-invalid-columns branch from 48b91c5 to 40c8f1a Compare June 1, 2017 15:36

TomAugspurger reviewed Jun 1, 2017

View reviewed changes

makmanalp force-pushed the bug-13492-hdf-invalid-columns branch from e11c90d to a90b215 Compare June 1, 2017 19:36

TomAugspurger reviewed Jun 1, 2017

View reviewed changes

TomAugspurger approved these changes Jun 1, 2017

View reviewed changes

makmanalp force-pushed the bug-13492-hdf-invalid-columns branch from a90b215 to ab75d27 Compare June 1, 2017 19:52

makmanalp changed the title ~~BUG: Handle np strings in index.name in HDF #13492 (Work in Progress)~~ BUG: Handle np strings in index.name in HDF #13492 Jun 1, 2017

makmanalp changed the title ~~BUG: Handle np strings in index.name in HDF #13492~~ BUG: Handle numpy strings in index names in HDF #13492 Jun 1, 2017

BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

90f63b0

makmanalp force-pushed the bug-13492-hdf-invalid-columns branch from ab75d27 to 90f63b0 Compare June 1, 2017 20:52

jreback reviewed Jun 2, 2017

View reviewed changes

jreback approved these changes Jun 2, 2017

View reviewed changes

jreback added the Needs Backport label Jun 2, 2017

jreback added this to the 0.20.2 milestone Jun 2, 2017

REF: refactor to _ensure_str

70a55ec

TomAugspurger merged commit 18c316b into pandas-dev:master Jun 4, 2017

TomAugspurger pushed a commit that referenced this pull request Jun 4, 2017

BUG: convert numpy strings in index names in HDF #13492 (#16444)

7286bc7

* BUG: Handle numpy strings in index names in HDF5 #13492 * REF: refactor to _ensure_str (cherry picked from commit 18c316b)

TomAugspurger removed the Needs Backport label Jun 4, 2017

Kiv pushed a commit to Kiv/pandas that referenced this pull request Jun 11, 2017

BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pa…

a0174eb

…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str

stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pa…

05e41e4

…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str

BUG: Handle numpy strings in index names in HDF #13492 #16444

BUG: Handle numpy strings in index names in HDF #13492 #16444

Conversation

makmanalp commented May 23, 2017 • edited Loading

makmanalp commented May 23, 2017

makmanalp commented May 23, 2017

jreback commented May 23, 2017

jreback commented May 23, 2017

makmanalp commented May 23, 2017

jreback commented May 23, 2017

codecov bot commented May 23, 2017 • edited Loading

Codecov Report

TomAugspurger commented May 23, 2017

makmanalp commented May 23, 2017

TomAugspurger commented May 30, 2017

jreback left a comment

Choose a reason for hiding this comment

makmanalp commented May 31, 2017

makmanalp commented Jun 1, 2017

makmanalp commented Jun 1, 2017

makmanalp commented Jun 1, 2017

TomAugspurger commented Jun 1, 2017

makmanalp commented Jun 1, 2017

makmanalp commented Jun 1, 2017

makmanalp commented Jun 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makmanalp commented Jun 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makmanalp commented Jun 1, 2017

jreback Jun 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jun 2, 2017

TomAugspurger commented Jun 4, 2017

TomAugspurger commented Jun 4, 2017

makmanalp commented May 23, 2017 •

edited

Loading

codecov bot commented May 23, 2017 •

edited

Loading

makmanalp commented Jun 1, 2017 •

edited

Loading

jreback Jun 2, 2017 •

edited

Loading