Skip to content

Since 0.13: passing pandas DataFrame/Series like numpy array breaks #6127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
twiecki opened this issue Jan 27, 2014 · 19 comments
Closed

Since 0.13: passing pandas DataFrame/Series like numpy array breaks #6127

twiecki opened this issue Jan 27, 2014 · 19 comments

Comments

@twiecki
Copy link
Contributor

twiecki commented Jan 27, 2014

As discussed in #6063:
I noticed that that numpy-style access breaks sometimes under 0.13. While I haven't been able to pin-point the issue, calls like pylab.hist(-df.ix[row, col_name]) fail with some x[0] index error and I have to use pylab.hist(-df.ix[row, col_name]).values.

Here is a csv file for which this happens: https://gist.github.com/8651509

plt.hist(pd.read_csv('debug.csv'))

produces:

--------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-203-b20a1ff5d1db> in <module>()
----> 1 hist(pd.load('debug.pickle'))

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/matplotlib/pyplot.pyc in hist(x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, hold, **kwargs)
   2825                       histtype=histtype, align=align, orientation=orientation,
   2826                       rwidth=rwidth, log=log, color=color, label=label,
-> 2827                       stacked=stacked, **kwargs)
   2828         draw_if_interactive()
   2829     finally:

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/matplotlib/axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   8247         # Massage 'x' for processing.
   8248         # NOTE: Be sure any changes here is also done below to 'weights'
-> 8249         if isinstance(x, np.ndarray) or not iterable(x[0]):
   8250             # TODO: support masked arrays;
   8251             x = np.asarray(x)

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
    482     def __getitem__(self, key):
    483         try:
--> 484             result = self.index.get_value(self, key)
    485             if isinstance(result, np.ndarray):
    486                 return self._constructor(result,index=[key]*len(result)).__finalize__(self)

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
   1030 
   1031         try:
-> 1032             return self._engine.get_value(s, k)
   1033         except KeyError as e1:
   1034             if len(self) > 0 and self.inferred_type == 'integer':

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2890)()

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2702)()

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3440)()

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6595)()

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6536)()

KeyError: 0

While passing .values works.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

what numpy/matplotlib are you using here?

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

I think your df has float headers which maybe the problem

In [1]: import pylab

In [2]: pylab.hist(pd.read_csv('debug.csv',header=None)
   ...: 
KeyboardInterrupt

In [2]: pylab.hist(pd.read_csv('debug.csv',header=None))
Out[2]: 
([array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])],
 array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]),
 <a list of 2 Lists of Patches objects>)
INSTALLED VERSIONS
------------------
commit: 1112cb74264d40a91ce2a80f6bbbf24298a72f40
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.0rc1-151-g1777c89
Cython: 0.20
numpy: 1.7.1
scipy: 0.12.0
statsmodels: 0.5.0
IPython: 1.0.0
sphinx: 1.1.3
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: None
bottleneck: 0.6.0
tables: 3.0.0
numexpr: 2.1
matplotlib: 1.2.0
openpyxl: 1.5.7
xlrd: 0.9.0
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 2.3.4
bs4: None
html5lib: None
bq: v2.0.15
apiclient: 1.0

@twiecki
Copy link
Contributor Author

twiecki commented Jan 27, 2014

This does happen with a df that was read in via a csv that had proper columns; i.e. the error does not occur only when I load from the file I provided.

http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#internal-refactoring reads as if it could be the cause but I'm obviously not familiar enough with the internals. I only observed this when I slice a dataframe and select a column inside a df.ix[slice, col_name] like call and pass it to a function that expects numpy ndarrays.

INSTALLED VERSIONS
------------------
Python: 2.7.3.final.0
OS: Linux
Release: 3.2.0-29-generic
Processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: C

pandas: 0.13.0
Cython: 0.19.1
Numpy: 1.8.0
Scipy: 0.14.0.dev-a3e9c7f
statsmodels: 0.6.0.dev-fe6e688
    patsy: 0.2.1
scikits.timeseries: Not installed
dateutil: 2.2
pytz: 2013.9
bottleneck: Not installed
PyTables: 2.4.0
    numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: Not installed
xlsxwriter: Not installed
sqlalchemy: Not installed
lxml: 2.3.2
bs4: Not installed
html5lib: Not installed
bigquery: Not installed
apiclient: Not installed

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

can you try on master, ?

@twiecki
Copy link
Contributor Author

twiecki commented Jan 27, 2014

So far couldn't reproduce on master!

@twiecki
Copy link
Contributor Author

twiecki commented Jan 27, 2014

I'll close this and will reopen if problem resurfaces.

@twiecki twiecki closed this as completed Jan 27, 2014
@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

ok...gr8!

@twiecki
Copy link
Contributor Author

twiecki commented Jan 28, 2014

ok, resurfaced.

Here is an updated file: https://gist.github.com/anonymous/8676957

I can trigger this by loading and passing this (loaded as anti_val):
hist(anti_val.ix[anti_val.cond == 'incong', 'rt'], bins=bins, histtype='step', normed=True);

@twiecki twiecki reopened this Jan 28, 2014
@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

can u put up the exact code u r using to load

@twiecki
Copy link
Contributor Author

twiecki commented Jan 28, 2014

Hrm, I can't reproduce with the freshly loaded one, sorry... I guess I could pickle it but not sure how to upload that anywhere quick and easy.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

Dropbox public link

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

any reason you dont use series.hist()?

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

@twiecki can you repro? about to release 0.13.1

@twiecki
Copy link
Contributor Author

twiecki commented Jan 30, 2014

Sorry, here's the pickle that can reproduce it:
https://www.dropbox.com/s/1k9hln4cvoc1pev/anti_val.pickle

df = pd.read_pickle('/tmp/anti_val.pickle')
hist(df.ix[df.cond == 'incong', 'rt']);
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-164-bb7f9e1b7a1b> in <module>()
      1 df = pd.read_pickle('/tmp/anti_val.pickle')
----> 2 hist(df.ix[df.cond == 'incong', 'rt']);

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/matplotlib/pyplot.pyc in hist(x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, hold, **kwargs)
   2825                       histtype=histtype, align=align, orientation=orientation,
   2826                       rwidth=rwidth, log=log, color=color, label=label,
-> 2827                       stacked=stacked, **kwargs)
   2828         draw_if_interactive()
   2829     finally:

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/matplotlib/axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   8247         # Massage 'x' for processing.
   8248         # NOTE: Be sure any changes here is also done below to 'weights'
-> 8249         if isinstance(x, np.ndarray) or not iterable(x[0]):
   8250             # TODO: support masked arrays;
   8251             x = np.asarray(x)

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
    487     def __getitem__(self, key):
    488         try:
--> 489             result = self.index.get_value(self, key)
    490             if isinstance(result, np.ndarray):
    491                 return self._constructor(result,index=[key]*len(result)).__finalize__(self)

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
   1030 
   1031         try:
-> 1032             return self._engine.get_value(s, k)
   1033         except KeyError as e1:
   1034             if len(self) > 0 and self.inferred_type == 'integer':

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2957)()

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_value (pandas/index.c:2772)()

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3498)()

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6930)()

/home/ipython/envs/ipynb/local/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6871)()

KeyError: 0

@jreback
Copy link
Contributor

jreback commented Jan 30, 2014

ok, this will reproduce it:

pylab.hist(Series([1,2,3],index=[1,2,3]))

Here's what is happening. Matplotlib thinks it is always passed an ndarray, or an iterable, so it first
checks if its an ndarray (which < 0.13 Series WAS an ndarray) so it didn't get to the second part of the check, which is checking x[0] which normally is the 0th element, but since you don't have an index of 0 then this raises a KeyError.

If you have a 0th element that all is good.

Matplotlib should be trapping this exception , so I believe this is a trivial bug there.

work-arounds:

  • use pandas .hist method
  • pass the actual .values
  • trap the exception (by wrapping .hist) which your own routine that does one of the above

IIRC matplotlib < 1.3 doesn't have this issue.

@jreback jreback closed this as completed Jan 30, 2014
@twiecki
Copy link
Contributor Author

twiecki commented Jan 30, 2014

Hm, OK. I always thought it was a nice feature that a pandas df behaved like a ndarray. And I think this happened not only with hist(). Isn't there some way to fake the isinstance() check?

@jreback
Copy link
Contributor

jreback commented Jan 30, 2014

I have tried that, but its essentially a c-level call, not even with a MetaClass.
well it works for a lot of stuff, but numpy is hard headed about it, no way to get around it.

blame it on matplotlib!!! maybe file a bug report!

@twiecki
Copy link
Contributor Author

twiecki commented Jan 30, 2014

I see. Well maybe it's better to explicit in any case; I've just gotten used to passing it around like a ndarray. Agreed that it's a matplotlib problem.

diazona added a commit to diazona/pandas that referenced this issue Dec 16, 2015
This commit prevents the KeyError raised when DataFrame.plot() is called
with xerr or yerr being a Series or DataFrame whose index doesn't include 0.
The error comes from matplotlib code which tries to access xerr[0] or yerr[0],
so to solve the problem, we convert xerr and yerr from Pandas objects to
NumPy ndarrays before sending them through to matplotlib. This is
a different instance of the same type of problem in Github issues pandas-dev#4493
and pandas-dev#6127 (and perhaps others).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants