Skip to content

BUG: invalid column names in a HDF5 table format #9057 #10098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 7, 2015

Conversation

jameshiebert
Copy link
Contributor

Have DataFrame.to_hdf() raise an error when using pytables with non-string
column types

closes #9057

@@ -257,6 +257,30 @@ def _tables():
def to_hdf(path_or_buf, key, value, mode=None, complevel=None, complib=None,
append=None, **kwargs):
""" store this object, close it if we opened it """

# PyTables has some limitations that we need to check for
if kwargs.get('format', None) == 'table':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place for this. Should go here: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L1546

Just add on logic to validate. This is an IndexCol at this point, so should be straightforward.

@jameshiebert
Copy link
Contributor Author

I'm definitely open to moving this logic to the most appropriate place, and will rely on your judgement as to where that is. As far as I can tell, though, the code path through to_hdf() doesn't ever get to validate_col() prior to the error that we're trying to prevent/mask with this patch.

The calls to validate happen after table_creation (i.e. where the original error comes from).

For example:

james@basalt ~/code/git/pandas/pandas $ git --no-pager diff | tail
     if append:
         f = lambda store: store.append(key, value, **kwargs)
@@ -1569,6 +1569,7 @@ class IndexCol(StringMixin):

     def validate_col(self, itemsize=None):
         """ validate this column: return the compared against itemsize """
+        raise Exception("We'll never make it here")

         # validate this column for string truncation (or reset to the max size)
         if _ensure_decoded(self.kind) == u('string'):
(testenv)james@basalt ~/code/git/pandas $ nosetests -v pandas/io/tests/test_pytables.py:TestHDFStore.test_to_hdf_with_integer_column_names
test_to_hdf_with_integer_column_names (pandas.io.tests.test_pytables.TestHDFStore) ... ERROR

======================================================================
ERROR: test_to_hdf_with_integer_column_names (pandas.io.tests.test_pytables.TestHDFStore)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/james/code/git/pandas/pandas/io/tests/test_pytables.py", line 4625, in test_to_hdf_with_integer_column_names
    self.assertRaises(ValueError, df.to_hdf, path, 'df', format='table', data_columns=True)
  File "/home/james/code/git/pandas/pandas/util/testing.py", line 1576, in assertRaises
    _callable(*args, **kwargs)
  File "/home/james/code/git/pandas/pandas/core/generic.py", line 918, in to_hdf
    return pytables.to_hdf(path_or_buf, key, self, **kwargs)
  File "/home/james/code/git/pandas/pandas/io/pytables.py", line 292, in to_hdf
    f(store)
  File "/home/james/code/git/pandas/pandas/io/pytables.py", line 287, in <lambda>
    f = lambda store: store.put(key, value, **kwargs)
  File "/home/james/code/git/pandas/pandas/io/pytables.py", line 835, in put
    self._write_to_group(key, value, append=append, **kwargs)
  File "/home/james/code/git/pandas/pandas/io/pytables.py", line 1284, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/home/james/code/git/pandas/pandas/io/pytables.py", line 3792, in write
    table = self._handle.create_table(self.group, **options)
  File "/home/james/miniconda3/envs/testenv/lib/python3.4/site-packages/tables/file.py", line 1060, in create_table
    chunkshape=chunkshape, byteorder=byteorder)
  File "/home/james/miniconda3/envs/testenv/lib/python3.4/site-packages/tables/table.py", line 811, in __init__
    self.description = Description(description)
  File "/home/james/miniconda3/envs/testenv/lib/python3.4/site-packages/tables/description.py", line 466, in __init__
    if name.startswith('_v_'):
AttributeError: 'numpy.int64' object has no attribute 'startswith'

----------------------------------------------------------------------
Ran 1 test in 0.010s

FAILED (errors=1)

Should there be a separate validation call somewhere higher in write()? Or am I missing something obvious here?

@jreback
Copy link
Contributor

jreback commented May 11, 2015

Try something like this

diff --git a/pandas/io/pytables.py b/pandas/io/pytables.py
index 458a245..aea9336 100644
--- a/pandas/io/pytables.py
+++ b/pandas/io/pytables.py
@@ -1535,6 +1535,12 @@ class IndexCol(StringMixin):
                 self.typ = _tables(
                 ).StringCol(itemsize=min_itemsize, pos=self.pos)

+    def validate(self, handler, append, **kwargs):
+        self.validate_names()
+
+    def validate_names(self):
+        pass
+
     def validate_and_set(self, handler, append, **kwargs):
         self.set_table(handler.table)
         self.validate_col()
@@ -2080,6 +2086,10 @@ class DataIndexableCol(DataCol):
     """ represent a data column that can be indexed """
     is_data_indexable = True

+    def validate_names(self):
+        if not Index(self.values).is_object():
+            raise ValueError("cannot have a non-object label DataIndexableCol")
+
     def get_atom_string(self, block, itemsize):
         return _tables().StringCol(itemsize=itemsize)

@@ -3752,6 +3762,9 @@ class AppendableTable(LegacyTable):
                          min_itemsize=min_itemsize,
                          **kwargs)

+        for a in self.axes:
+            a.validate(self, append)
+
         if not self.is_exists:

@jameshiebert
Copy link
Contributor Author

Hmm, yeah, that works. Can I ask, what's the expected behaviour when a user tries to use unicode column name? Is this test actually testing for the right behaviour or is it irrelevant?

@jreback jreback changed the title BUG GH9057 BUG: invalid column names in a HDF5 table format #9057 May 12, 2015
@jreback
Copy link
Contributor

jreback commented May 12, 2015

@jameshiebert

  • can you comprehensively test for the this a bit more (e.g. make datetimes the actual names of the columns for example), I think we can only allow string-likes.
  • I think your unicode test is good. best to prob go thru all of the index types and make sure that ONLY the string-like ones work.

e.g.

for i in [tm.makeIntegerIndex(),tm.makeUnicodeIndex()......]:
    # test that these raise

All that said, I unicode should work in PY3! (but not PY2).

What we are actually asking here is if a non-ascii encoded string will work.

jameshiebert added a commit to jameshiebert/pandas that referenced this pull request May 13, 2015
@jameshiebert
Copy link
Contributor Author

@jreback Thanks for the suggestions. I used your pytables.py code straight-up, but as for the test, I think that what I pushed covers all of the index types. If I missed anything, please let me know.


for index in types_should_run:
df = DataFrame(np.random.randn(10, 2), columns=index(2))
with ensure_clean_path(self.path) as path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the ones that should run, I would then select out say the first element and make sure you are getting a non-zero length frame, e.g.

result = pd.read_hdf(path,'df',where="index = [{0}]".format(df.index[0]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update to my comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not following. Do you want me address this comment above? I thought that I did so two lines down...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, didn't see that

@jreback
Copy link
Contributor

jreback commented May 13, 2015

pls add a release note in v0.17.0.txt

@jameshiebert jameshiebert force-pushed the bugfix-9057-error-only branch from 4f0325d to a01d8c0 Compare May 13, 2015 19:01
jameshiebert added a commit to jameshiebert/pandas that referenced this pull request May 13, 2015
@jreback
Copy link
Contributor

jreback commented Jun 2, 2015

can you rebase and squash, contributing docs are here if you need help

@jameshiebert jameshiebert force-pushed the bugfix-9057-error-only branch from a01d8c0 to 8273356 Compare June 2, 2015 15:58
@jameshiebert
Copy link
Contributor Author

@jreback It should be good to go.

@@ -73,7 +73,7 @@ Bug Fixes
- Bug in ``Timestamp``'s' ``microsecond``, ``quarter``, ``dayofyear``, ``week`` and ``daysinmonth`` properties return ``np.int`` type, not built-in ``int``. (:issue:`10050`)
- Bug in ``NaT`` raises ``AttributeError`` when accessing to ``daysinmonth``, ``dayofweek`` properties. (:issue:`10096`)


- Bug in ``DataFrame.to_hdf()`` where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (:issue:`9057`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are now going to release 0.16.2, so can you remove from here (it no longer exists) and put there?

@jreback jreback modified the milestones: 0.16.2, 0.17.0 Jun 2, 2015
@jameshiebert jameshiebert force-pushed the bugfix-9057-error-only branch from 8273356 to 56fcc9d Compare June 3, 2015 15:42
@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

ok, why don't you rebase; ping when green.

Have DataFrame.to_hdf() raise an error when using pytables with non-string
column types. Fixes pandas-dev#9057
@jameshiebert jameshiebert force-pushed the bugfix-9057-error-only branch from 56fcc9d to 33cfaea Compare June 6, 2015 02:14
@jameshiebert
Copy link
Contributor Author

@jreback This PR should be good to go once again.

jreback added a commit that referenced this pull request Jun 7, 2015
BUG: invalid column names in a HDF5 table format #9057
@jreback jreback merged commit d142452 into pandas-dev:master Jun 7, 2015
@jreback
Copy link
Contributor

jreback commented Jun 7, 2015

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: to_hdf failing with integer columns and data_columns=True
2 participants