Skip to content

can't select a specific column from a HDFStore table with a MultiIndex DataFrame #6169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
glyg opened this issue Jan 29, 2014 · 6 comments · Fixed by #6202
Closed

can't select a specific column from a HDFStore table with a MultiIndex DataFrame #6169

glyg opened this issue Jan 29, 2014 · 6 comments · Fixed by #6202
Labels
Milestone

Comments

@glyg
Copy link
Contributor

glyg commented Jan 29, 2014

I'm running in what seems to be a bug.
I'm using pandas version '0.13.0rc1-29-ga0a527b' from github, python 3.3 on a linux Mint 15 64 bits.

Here's a minimal example that fails:

import numpy as np
import pandas as pd


index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
                              ['one', 'two', 'three']],
                      labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
                              [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
                      names=['foo_name', 'bar_name'])


df_mi = pd.DataFrame(np.random.randn(10, 3), index=index,
                     columns=['A', 'B', 'C'])

with pd.get_store('minimal_io.h5') as store:
    store.put('df_mi', df_mi, format='table')

with pd.get_store('minimal_io.h5') as store:
    ixs = store.select('df_mi', "columns=['A']")

And here is the error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-32-005cf4e724e0> in <module>()
     17 
     18 with pd.get_store('minimal_io.h5') as store:
---> 19     ixs = store.select('df_mi', "columns=['A']")

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    622 
    623         return TableIterator(self, func, nrows=s.nrows, start=start, stop=stop,
--> 624                              auto_close=auto_close).get_values()
    625 
    626     def select_as_coordinates(

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/io/pytables.py in get_values(self)
   1252 
   1253     def get_values(self):
-> 1254         results = self.func(self.start, self.stop)
   1255         self.close()
   1256         return results

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/io/pytables.py in func(_start, _stop)
    611         def func(_start, _stop):
    612             return s.read(where=where, start=_start, stop=_stop,
--> 613                           columns=columns, **kwargs)
    614 
    615         if iterator or chunksize is not None:

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/io/pytables.py in read(self, columns, **kwargs)
   3796         df = super(AppendableMultiFrameTable, self).read(
   3797             columns=columns, **kwargs)
-> 3798         df = df.set_index(self.levels)
   3799 
   3800         # remove names for 'level_%d'

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   2327                 names.append(None)
   2328             else:
-> 2329                 level = frame[col].values
   2330                 names.append(col)
   2331                 if drop:

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/core/frame.py in __getitem__(self, key)
   1626             return self._getitem_multilevel(key)
   1627         else:
-> 1628             return self._getitem_column(key)
   1629 
   1630     def _getitem_column(self, key):

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/core/frame.py in _getitem_column(self, key)
   1633         # get column
   1634         if self.columns.is_unique:
-> 1635             return self._get_item_cache(key)
   1636 
   1637         # duplicate columns & possible reduce dimensionaility

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/core/generic.py in _get_item_cache(self, item)
    976         res = cache.get(item)
    977         if res is None:
--> 978             values = self._data.get(item)
    979             res = self._box_item_values(item, values)
    980             cache[item] = res

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/core/internals.py in get(self, item)
   2738                 return self.get_for_nan_indexer(indexer)
   2739 
-> 2740             _, block = self._find_block(item)
   2741             return block.get(item)
   2742         else:

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/core/internals.py in _find_block(self, item)
   3049 
   3050     def _find_block(self, item):
-> 3051         self._check_have(item)
   3052         for i, block in enumerate(self.blocks):
   3053             if item in block:

/home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/core/internals.py in _check_have(self, item)
   3056     def _check_have(self, item):
   3057         if item not in self.items:
-> 3058             raise KeyError('no item named %s' % com.pprint_thing(item))
   3059 
   3060     def reindex_axis(self, new_axis, indexer=None, method=None, axis=0,

KeyError: 'no item named foo_name'

> /home/guillaume/python3/lib/python3.3/site-packages/pandas-0.13.0rc1_29_ga0a527b-py3.3-linux-x86_64.egg/pandas/core/internals.py(3058)_check_have()
   3057         if item not in self.items:
-> 3058             raise KeyError('no item named %s' % com.pprint_thing(item))
   3059
@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

this is a bug, because of the way multi-index are stored, they are columns and so much be retrieved EVEN when specifiying the columns filter.

care to do a PR to fix (slightly but not too tricky)

@glyg
Copy link
Contributor Author

glyg commented Jan 29, 2014

I guess I can give it a shot...

@glyg
Copy link
Contributor Author

glyg commented Jan 30, 2014

So I did some digging, and I think I got the problem well defined, but not the solution.

First, for anyone passing by, this works:

ixs = store.select('df_mi', columns=['A'])

as well as that:

ixs = store.select('df_mi', "foo_name='bar'", columns=['A'])

So the problem is in the implementation of the read function of the class AppendableMultiFrameTable in pandas.io.pytables. If the columns argument is passed within a where=... expression, the following will not be executed:

        if columns is not None:
            for n in self.levels:
                if n not in columns:
                    columns.insert(0, n)

From that, the columns containing the index won't be retrieved, because they are absent of the where expression, which is parsed higher in the class hierarchy.
So that's where I'm stuck. As the columns argument is effectively None, it is absurd to insert the index columns to it. Passing columns=[n for n in self.levels] fails also, as will
columns=[columns = self.non_index_axes[0][1] which is the list of all the columns.

So the only way I see is to somehow modify the where kwarg before it is passed to this super(AppendableMultiFrameTable, self).read(columns=columns, **kwargs) for example by extracting somehow the columns=['A'] part, or by appending the level names to it....
I had a look at that, but I fail to see how to do so...

@jreback
Copy link
Contributor

jreback commented Jan 30, 2014

so, start off by writing some tests (which use your example), to try to get it to fail.

Then step thru that example to see where the issue is.

@glyg
Copy link
Contributor Author

glyg commented Jan 30, 2014

Well I guess I understood the issue, but I don't see how to solve it, because as far as I understand it implies something like modifying the selection attribute of the AppendableMultiFrameTable object, and I don't see how to do that, it looks pretty obscure to me...

@jreback
Copy link
Contributor

jreback commented Jan 30, 2014

welll easiest is simply to raise an when columns is passed and its in conflict (this is easy because its a varible). don't worry about the columns in where; I can intercept that too (but its deeper).

writing the tests as a PR would be great though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants