Skip to content

BUG: Handle numpy strings in index names in HDF #13492 #16444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

makmanalp
Copy link
Contributor

@makmanalp makmanalp commented May 23, 2017

Work in progress!

@jreback jreback added the IO HDF5 read_hdf, HDFStore label May 23, 2017
@makmanalp
Copy link
Contributor Author

I found that pytables is returning a numpy.str_ as the index name for both indices. I don't think this has to do with how we write it, and pytables just chooses to write / read it that way. So we can handle this on the read side. There is some trickiness around numpy types in py2/3 and python string types in py2/3. So I decided to split the problem in two steps:

  1. Handle converting numpy to python string types
  2. Handle converting byte types to unicode codepoints.

The latter is done by _ensure_decoded, so for the former I added an _ensure_python_type. This way we should cover most situations.

@makmanalp
Copy link
Contributor Author

The numpy type situation is complicated across py2/3 and string types, but luckily they have a function that handles all that: https://github.com/numpy/numpy/blob/v1.12.1/numpy/core/numerictypes.py#L74

@jreback
Copy link
Contributor

jreback commented May 23, 2017

as an additional test, try using a unicode index names as well :>

@jreback jreback added the 32bit 32-bit systems label May 23, 2017
@jreback
Copy link
Contributor

jreback commented May 23, 2017

In [14]: s = np.str_('foo')

In [15]: type(s)
Out[15]: numpy.str_

In [16]: type(s.item())
Out[16]: str

In [17]: type(str(s))
Out[17]: str

I prefer [17] for the conversion I think.

@makmanalp
Copy link
Contributor Author

@jreback so that was the original plan (and it works on my python 3.6 install) but the worry is that the behavior for what the heck happens with bytes vs strings (is it always going to return str_?) or py2/py3 (where in py2 np.str_ means bytes and in py3 np.str_ means unicode), and then how that interacts with .decode("utf-8") in the _ensure_decoded? If you're confident the above approach will work, more than happy to change.

Will add the test!

@jreback
Copy link
Contributor

jreback commented May 23, 2017

@makmanalp

So all string data is stored as bytes in PyTables. I think we encode to utf-8 (and decode the same). Though this is really just for table format. I dont' really remember what happens for fixed format for meta data. You might want to trace it and see.

@codecov
Copy link

codecov bot commented May 23, 2017

Codecov Report

Merging #16444 into master will increase coverage by 0.16%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16444      +/-   ##
==========================================
+ Coverage   90.75%   90.92%   +0.16%     
==========================================
  Files         161      161              
  Lines       51089    49244    -1845     
==========================================
- Hits        46368    44777    -1591     
+ Misses       4721     4467     -254
Flag Coverage Δ
#multiple 88.68% <80%> (+0.08%) ⬆️
#single 40.24% <100%> (+0.07%) ⬆️
Impacted Files Coverage Δ
pandas/io/pytables.py 93.12% <100%> (+0.01%) ⬆️
pandas/core/computation/expressions.py 0% <0%> (-90.48%) ⬇️
pandas/io/parsers.py 95.42% <0%> (-0.27%) ⬇️
pandas/io/msgpack/_version.py 44.65% <0%> (-0.22%) ⬇️
pandas/io/excel.py 80.55% <0%> (-0.09%) ⬇️
pandas/core/indexes/multi.py 96.65% <0%> (ø) ⬆️
pandas/core/groupby.py 92.07% <0%> (+0.03%) ⬆️
pandas/io/formats/style.py 100% <0%> (+4.07%) ⬆️
pandas/core/util/hashing.py 98.34% <0%> (+5.37%) ⬆️
pandas/io/json/json.py 100% <0%> (+9.63%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ec98d8...70a55ec. Read the comment docs.

@TomAugspurger
Copy link
Contributor

@makmanalp it seems like all the tests passed, so that's good.

Would you mind including the example from

import pandas as pd
import numpy as np
import datetime

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

with pd.HDFStore("test.h5", "w") as store:
    store.put("test", s, "fixed")

with pd.HDFStore("test.h5", "r") as store:
    s1 = store["test"]

and ensure that type(s1.index.name) is indeed a str. And maybe use a unicode name that isn't encodable as ASCII?

@TomAugspurger TomAugspurger added this to the 0.20.2 milestone May 23, 2017
@makmanalp
Copy link
Contributor Author

@jreback OK, if it's bytes only then that should simplify things a bit.
@TomAugspurger will do!

@TomAugspurger
Copy link
Contributor

@makmanalp will you have a chance to update this in the next day or so? We're releasing 0.20.2 by the end of the week. Not a big deal to push.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a test and a whatsnew entry.

@jreback jreback removed this from the 0.20.2 milestone May 30, 2017
@makmanalp
Copy link
Contributor Author

@TomAugspurger sorry for dropping the ball on this. I'm going to give it a shot tomorrow during the day. The test and whatsnew is easy, but double checking how things work in fixed format is a bit harder for me. Feel free to push it off the release if so.

@makmanalp
Copy link
Contributor Author

OK, oddly enough, in the test cases, I can't seem to use the ensure_clean_store() helper, perhaps there is something about the closing / opening of the file that changes things? I'm removing the bugfix that I made, and expecting the test to fail:

    with ensure_clean_store(self.path) as store:
        store.put('frame', df, format='fixed')
        recons = store['frame']

        tm.assert_frame_equal(recons, df)

        assert type(df.index.name) == str
        assert type(df.columns.name) == str
        # Won't fail (surprising)

    with pd.HDFStore("test.h5", "w") as store:
        store.put("test", df, "fixed")

    with pd.HDFStore("test.h5", "r") as store:
        df2 = store["test"]
        assert type(df2.index.name) == str
        assert type(df2.columns.name) == str
        # Will fail! (which is what I want)

I thought of using self._check_roundtrip but then I don't have access to the dataframe to be able to assert the assertions about the name types. Any pointers on the right way to do this?

@makmanalp makmanalp force-pushed the bug-13492-hdf-invalid-columns branch from 48b91c5 to 40c8f1a Compare June 1, 2017 15:36
@makmanalp
Copy link
Contributor Author

OK so I fixed the issue with ensure_clean_store because I found ensure_clean_path. Re: picking a unicode index name for the test, how are we going to make this work for py2/py3 again? Do we want a u"blah" string? Do we need a # coding=utf-8 on top of the file?

@makmanalp
Copy link
Contributor Author

Ah, OK, so I found the compat module and u_safe and text_type, will give a shot at those.

@TomAugspurger
Copy link
Contributor

I think the # coding is only required if you use non-ascii characters in the source.

@makmanalp
Copy link
Contributor Author

@TomAugspurger yeah - I've now realized we don't need to care about because I can use the u_safe thing helps because you can use the \u05d0 style notation. Thank you!

@makmanalp
Copy link
Contributor Author

Just for posterity, conversation on numpy strings / bytestrings / unicode in numpy gitter channel:


Mehmet Ali "Mali" Akmanalp @makmanalp May 22 19:12
Hi, does anyone know the difference between numpy.str and numpy.string (i.e. numpy.bytes_) and how to handle them in different python versions?
I'm trying to do a pull request for pandas where pytables is creating a np.str_ object that we need to decode into a regular ol' string

Julian Taylor @juliantaylor May 23 04:52
np.str_ is the equivalent to python str and maps to bytes in python2 and unicode in python3
np.string is always bytes and should not be used
the only real string type numpy has is np.unicode
that works in both python2 and 3

Mehmet Ali "Mali" Akmanalp @makmanalp May 23 09:27
@juliantaylor and how do conversions within those numpy types and/or from those to python types work?
(and thank you for the clarification)
My current understanding is that .item() does the right thing in terms of numpy->python, regardless of python version

Julian Taylor @juliantaylor May 23 11:31
@makmanalp no, item will return bytes in python3, which happens to be string in python2
the 'S' dtype is flat out broken, if you need actual bytes you are better of using int8 arrays, if you need strings you need to use unicode
the 'S' type has no concept of encoding, so even putting encoded data into it is troublesome as it strips trailing nulls

@makmanalp
Copy link
Contributor Author

makmanalp commented Jun 1, 2017

So I don't think the types converted / encoded / decoded at all for index names in fixed format, at least not within pandas:

node._v_attrs.name = index.name

and also

node._v_attrs.name = name

But I may not be understanding this code fully.

edit: It's also odd that calling type() on the on the index name is what's blowing up with an unicode error, and not printing it out or something:
https://travis-ci.org/pandas-dev/pandas/jobs/238406379#L1371

@@ -2567,7 +2575,7 @@ def read_index_node(self, node, start=None, stop=None):
name = None

if 'name' in node._v_attrs:
name = node._v_attrs.name
name = _ensure_decoded(_ensure_python_type(node._v_attrs.name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try replacing this with name = compat.text_type(node._v_attrs.name). That will be unicode on python 2 and str on python 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need the _ensure_python_type stuff then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can't just do this unconditionally, as the name could be something like None or the literal 10 (not a string). So it'll be something like

name = node._v_attrs.name
if isinstance(name, compat.string_types):
    name = compat.text_type(name)

df2 = read_hdf(path, 'df')
assert_frame_equal(df, df2)

assert type(df2.index.name) == str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then replace str here and below with compat.text_type. But I would also add an assert that the exact string matches.

assert type(df2.columns.name) == str

with ensure_clean_path(self.path) as path:
df.to_hdf(path, 'df', format='table')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can parametrize format I think

@pytest.mark.parametrized('format', ['table', 'fixed'])
def test_store_index_name_numpy_str(self, format):
...

@makmanalp makmanalp force-pushed the bug-13492-hdf-invalid-columns branch from e11c90d to a90b215 Compare June 1, 2017 19:36
@makmanalp
Copy link
Contributor Author

Thanks a bunch @TomAugspurger! The latest push should handle all your comments correctly, and pass the 2.7 test this time!

assert_frame_equal(df, df2)

assert type(df2.index.name) == text_type
assert df2.index.name == u('rows\u05d0')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess this is unnecessary since you have the assert_frame_equal up above. Oh well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this too!

@makmanalp makmanalp force-pushed the bug-13492-hdf-invalid-columns branch from a90b215 to ab75d27 Compare June 1, 2017 19:52
@makmanalp makmanalp changed the title BUG: Handle np strings in index.name in HDF #13492 (Work in Progress) BUG: Handle np strings in index.name in HDF #13492 Jun 1, 2017
@makmanalp makmanalp changed the title BUG: Handle np strings in index.name in HDF #13492 BUG: Handle numpy strings in index names in HDF #13492 Jun 1, 2017
@makmanalp makmanalp force-pushed the bug-13492-hdf-invalid-columns branch from ab75d27 to 90f63b0 Compare June 1, 2017 20:52
@makmanalp
Copy link
Contributor Author

(last push fixes a minor lint error that escaped me after the latest changes: pandas/tests/io/test_pytables.py:2931:24: E128 continuation line under-indented for visual indent)

@@ -70,6 +70,7 @@ I/O
- Bug that raised IndexError HTML-rendering an empty DataFrame (:issue:`15953`)
- Bug in ``pd.read_csv()`` in which tarfile object inputs were raising an error in Python 2.x for the C engine (:issue:`16530`)
- Bug where ``DataFrame.to_html()`` ignored the ``index_names`` parameter (:issue:`16493`)
- Bug where ``pd.read_hdf()`` returns numpy strings for index names (:issue:`13492`)
Copy link
Contributor

@jreback jreback Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify this is for format='fixed' nvm I see that this is for both.

take this back. I see its only for fixed stores, so mention that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where numpy strings were returned, rather than python strings for a named Index.

@@ -2568,6 +2568,8 @@ def read_index_node(self, node, start=None, stop=None):

if 'name' in node._v_attrs:
name = node._v_attrs.name
if isinstance(name, compat.string_types):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write this as _ensure_str (like _ensure_decoded) and use it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment (and issue reference) of why this is needed

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor code change. ping on green.

@jreback jreback added this to the 0.20.2 milestone Jun 2, 2017
@jreback
Copy link
Contributor

jreback commented Jun 2, 2017

@makmanalp can you update

@TomAugspurger
Copy link
Contributor

Got it, merging on green and will cherry pick for 0.20.2;

@TomAugspurger TomAugspurger merged commit 18c316b into pandas-dev:master Jun 4, 2017
@TomAugspurger
Copy link
Contributor

Thanks @makmanalp

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 4, 2017
…ndas-dev#16444)

* BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

* REF: refactor to _ensure_str

(cherry picked from commit 18c316b)
TomAugspurger pushed a commit that referenced this pull request Jun 4, 2017
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str

(cherry picked from commit 18c316b)
Kiv pushed a commit to Kiv/pandas that referenced this pull request Jun 11, 2017
…ndas-dev#16444)

* BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

* REF: refactor to _ensure_str
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
…ndas-dev#16444)

* BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

* REF: refactor to _ensure_str
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jul 12, 2017
Version 0.20.2

* tag 'v0.20.2': (68 commits)
  RLS: v0.20.2
  DOC: Update release.rst
  DOC: Whatsnew fixups (pandas-dev#16596)
  ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460)
  BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444)
  PERF: vectorize _interp_limit (pandas-dev#16592)
  DOC: whatsnew 0.20.2 edits (pandas-dev#16587)
  API: Make is_strictly_monotonic_* private (pandas-dev#16576)
  BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565)
  Strictly monotonic (pandas-dev#16555)
  ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026)
  fix linting
  BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244)
  BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317)
  return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486)
  BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549)
  BUG: Fixed tput output on windows (pandas-dev#16496)
  Strictly monotonic (pandas-dev#16555)
  BUG: fixed wrong order of ordered labels in pd.cut()
  BUG: Fixed to_html ignoring index_names parameter
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
32bit 32-bit systems IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: raise on invalid coulmns using a fixed HDFStore
3 participants