BUG: DataFrame.describe() breaks with a column index of object type and numeric entries #13288

pijucha · 2016-05-25T22:17:16Z

Preparing a commit for another issue in .describe(), I encountered this puzzling bug, surprisingly easy to trigger.

Symptoms

df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]})
df.describe()
# Long traceback listing formatting and internal functions...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

However:

df.describe(include='all')
               0    A
count   4.000000    4
unique       NaN    4
top          NaN    D
freq         NaN    1
mean    2.500000  NaN
std     1.290994  NaN
min     1.000000  NaN
25%     1.750000  NaN
50%     2.500000  NaN
75%     3.250000  NaN
max     4.000000  NaN

# It's OK if we don't print on screen:
x = df.describe()
x.columns
Out[8]: Index([0], dtype='int64')

# Fixing this suspicious index (int works too):
x.columns = x.columns.astype(object)
x
Out[10]: 
              0
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

Same issue happens with a simpler data frame:

df0 = pd.DataFrame([1,2,3,4])
# It's  OK now
df0.describe()
Out[28]: 
              0
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

# Modify column index:
df0.columns = pd.Index([0], dtype=object)
df0.describe()
# ...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

Current version (but the bug is also present in pandas release 0.18.1):

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)_i5-2520M_CPU_@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1+64.g7ed22fe.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0.dev0+3f3c371
IPython: 4.0.1
...

Reason

Some internal function gets confused by dtypes of a column index, I guess. But the faulty index is created in .describe().

# Output from %debug df.describe()
# NDFrame.describe() in pandas/core/generic.py:
#
   4943             data = self
   4944         else:
   4945             data = self.select_dtypes(include=include, exclude=exclude)
   4946 
   4947         ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
   4948         # set a convenient order for rows
   4949         names = []
   4950         ldesc_indexes = sorted([x.index for x in ldesc], key=len)
   4951         for idxnames in ldesc_indexes:
   4952             for name in idxnames:
   4953                 if name not in names:
   4954                     names.append(name)
   4955 
   4956         d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
1> 4957         d.columns = self.columns._shallow_copy(values=d.columns.values)
   4958         d.columns.names = data.columns.names
   4959         return d

_shallow_copy() in the marked line changes d.columns:

ipdb> p d.columns
Int64Index([0], dtype='int64')
ipdb> n
> /home/users/piotr/workspace/pandas-pijucha/pandas/core/generic.py(4958)describe()
1  4957         d.columns = self.columns._shallow_copy(values=d.columns.values)
-> 4958         d.columns.names = data.columns.names
   4959         return d
ipdb> p d.columns
Index([0], dtype='int64')

Possible solutions

Lines 4957-4958 are actually used to fix issues that pd.concat brings about. They try to pass the column structure from self to d.
I think a simpler solution is replacing these lines with:

 d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
 d.columns = data.columns
 return d

or

d = pd.DataFrame(pd.concat(ldesc, axis=1), index = pd.Index(names), columns = data.columns)
return d

data is a subframe of self and retains the same column structure.

pd.concat has some parameters that help pass a hierarchical index but can't do anything on its own with a categorical one.

I'm going to submit a pull request with this fix together with some others related with describe(). I hope I haven't overlooked anything obvious. But if so, any comments are very welcome.

The text was updated successfully, but these errors were encountered:

jreback · 2016-05-26T11:09:02Z

simple enough to stringify column names.

jreback · 2016-05-26T11:22:19Z

https://github.com/pydata/pandas/blob/master/pandas/core/generic.py#L4957

should be

d.columns = Index(d.columns, dtype='object')

Separately, this violates our guarantees on Index creation. I think we should assert that the dtype of a create Index is object if its not a sub-class.

@sinhrks

In [2]: i = Index([0,'A'])

In [3]: i._shallow_copy([0])
Out[3]: Index([0], dtype='int64')

pijucha · 2016-05-26T14:53:38Z

Not sure if I understand. Don't we want d.columns to be of the same type as self.columns?

jreback · 2016-05-26T16:01:27Z

The trouble is the columns are split up by dtype, so the sub-indexes need to be constructed similarly. ._shallow_copy is an internal (to Index) method and should not be used here.

actually you don't even need to specify the dtype, I think

``d.columns = d.columns.copy()will prob work here. The problem is.values` converts to a base form which may change things (e.g. try this with a datetime for the columns and see what happens).

pijucha · 2016-05-26T19:13:52Z

Oh, you mean

d.columns = data.columns.copy()    #(1)

and earlier

d.columns = Index(data.columns, dtype='object').  #(2)

(1) does work. Or at least it passes tests from the repository plus some others I tried. (Actually, I ran nosetests with d.columns = data.columns but it shouldn't make a difference I guess.)

On the other hand, (2) fails with datetime64[ns] in columns. When I specify dtype=data.columns.dtype, it breaks with localized datetime.

My understanding is that since data = self.loc[bool_arr], then data.columns is just a subset of the original column index and preserves its structure (including dtype and dtypes of its elements). So, why not just simply pass/copy it to d.

Another advantage of (1) is that we can skip the next line

d.columns.names = data.columns.names

jreback · 2016-05-26T19:20:35Z

yeah (2) is not what we want, we don't need to coerce. so use (1). This was using some internal code which it shouldn't have. (Index has a public API and we try to use whenever possible, except when deeply needed). The reason is that there are certain guarantees, which in this case were violated (the separate issue I opened).

pijucha · 2016-05-26T20:20:38Z

@jreback Thanks for clarifying.

…s-dev#13288) BUG pandas-dev#13104: - Percentile identifiers are now rounded to the least precision that keeps them unique. - Supplying duplicates in percentiles will raise ValueError. BUG pandas-dev#13288 - Fixed a column index of the output data frame. Previously, if a data frame had a column index of object type and the index contained numeric values, the output column index could be corrupt. It led to ValueError if the output was displayed. - describe() will raise ValueError with an informative message on DataFrame without columns.

jreback added the Bug label May 26, 2016

jreback added Difficulty Novice labels May 26, 2016

jreback added this to the 0.18.2 milestone May 26, 2016

jreback mentioned this issue May 26, 2016

ERR: _shallow_copy should assert dtype #13294

Closed

pijucha changed the title ~~BUG: DataFrame.describe() breaks with numeric columns and a column index of object type~~ BUG: DataFrame.describe() breaks with a column index of object type and numeric entries May 26, 2016

pijucha mentioned this issue May 26, 2016

BUG: Fix describe(): percentiles (#13104), col index (#13288) #13298

Closed

4 tasks

jreback closed this as completed in 132c1c5 May 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.describe() breaks with a column index of object type and numeric entries #13288

BUG: DataFrame.describe() breaks with a column index of object type and numeric entries #13288

pijucha commented May 25, 2016 •

edited

Loading

jreback commented May 26, 2016 •

edited

Loading

jreback commented May 26, 2016

pijucha commented May 26, 2016

jreback commented May 26, 2016

pijucha commented May 26, 2016

jreback commented May 26, 2016

pijucha commented May 26, 2016

BUG: DataFrame.describe() breaks with a column index of object type and numeric entries #13288

BUG: DataFrame.describe() breaks with a column index of object type and numeric entries #13288

Comments

pijucha commented May 25, 2016 • edited Loading

Symptoms

Reason

Possible solutions

jreback commented May 26, 2016 • edited Loading

jreback commented May 26, 2016

pijucha commented May 26, 2016

jreback commented May 26, 2016

pijucha commented May 26, 2016

jreback commented May 26, 2016

pijucha commented May 26, 2016

pijucha commented May 25, 2016 •

edited

Loading

jreback commented May 26, 2016 •

edited

Loading