Skip to content

BUG: DataFrame.describe() breaks with a column index of object type and numeric entries #13288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pijucha opened this issue May 25, 2016 · 7 comments
Labels
Milestone

Comments

@pijucha
Copy link
Contributor

pijucha commented May 25, 2016

Preparing a commit for another issue in .describe(), I encountered this puzzling bug, surprisingly easy to trigger.

Symptoms

df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]})
df.describe()
# Long traceback listing formatting and internal functions...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

However:

df.describe(include='all')
               0    A
count   4.000000    4
unique       NaN    4
top          NaN    D
freq         NaN    1
mean    2.500000  NaN
std     1.290994  NaN
min     1.000000  NaN
25%     1.750000  NaN
50%     2.500000  NaN
75%     3.250000  NaN
max     4.000000  NaN

# It's OK if we don't print on screen:
x = df.describe()
x.columns
Out[8]: Index([0], dtype='int64')

# Fixing this suspicious index (int works too):
x.columns = x.columns.astype(object)
x
Out[10]: 
              0
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

Same issue happens with a simpler data frame:

df0 = pd.DataFrame([1,2,3,4])
# It's  OK now
df0.describe()
Out[28]: 
              0
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

# Modify column index:
df0.columns = pd.Index([0], dtype=object)
df0.describe()
# ...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

Current version (but the bug is also present in pandas release 0.18.1):

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)_i5-2520M_CPU_@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1+64.g7ed22fe.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0.dev0+3f3c371
IPython: 4.0.1
...

Reason

Some internal function gets confused by dtypes of a column index, I guess. But the faulty index is created in .describe().

# Output from %debug df.describe()
# NDFrame.describe() in pandas/core/generic.py:
#
   4943             data = self
   4944         else:
   4945             data = self.select_dtypes(include=include, exclude=exclude)
   4946 
   4947         ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
   4948         # set a convenient order for rows
   4949         names = []
   4950         ldesc_indexes = sorted([x.index for x in ldesc], key=len)
   4951         for idxnames in ldesc_indexes:
   4952             for name in idxnames:
   4953                 if name not in names:
   4954                     names.append(name)
   4955 
   4956         d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
1> 4957         d.columns = self.columns._shallow_copy(values=d.columns.values)
   4958         d.columns.names = data.columns.names
   4959         return d

_shallow_copy() in the marked line changes d.columns:

ipdb> p d.columns
Int64Index([0], dtype='int64')
ipdb> n
> /home/users/piotr/workspace/pandas-pijucha/pandas/core/generic.py(4958)describe()
1  4957         d.columns = self.columns._shallow_copy(values=d.columns.values)
-> 4958         d.columns.names = data.columns.names
   4959         return d
ipdb> p d.columns
Index([0], dtype='int64')

Possible solutions

Lines 4957-4958 are actually used to fix issues that pd.concat brings about. They try to pass the column structure from self to d.
I think a simpler solution is replacing these lines with:

 d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
 d.columns = data.columns
 return d

or

d = pd.DataFrame(pd.concat(ldesc, axis=1), index = pd.Index(names), columns = data.columns)
return d

data is a subframe of self and retains the same column structure.

pd.concat has some parameters that help pass a hierarchical index but can't do anything on its own with a categorical one.

I'm going to submit a pull request with this fix together with some others related with describe(). I hope I haven't overlooked anything obvious. But if so, any comments are very welcome.

@jreback
Copy link
Contributor

jreback commented May 26, 2016

simple enough to stringify column names.

@jreback jreback added the Bug label May 26, 2016
@jreback
Copy link
Contributor

jreback commented May 26, 2016

https://github.com/pydata/pandas/blob/master/pandas/core/generic.py#L4957

should be

d.columns = Index(d.columns, dtype='object')

Separately, this violates our guarantees on Index creation. I think we should assert that the dtype of a create Index is object if its not a sub-class.

@sinhrks

In [2]: i = Index([0,'A'])

In [3]: i._shallow_copy([0])
Out[3]: Index([0], dtype='int64')

@pijucha
Copy link
Contributor Author

pijucha commented May 26, 2016

Not sure if I understand. Don't we want d.columns to be of the same type as self.columns?

@jreback
Copy link
Contributor

jreback commented May 26, 2016

The trouble is the columns are split up by dtype, so the sub-indexes need to be constructed similarly. ._shallow_copy is an internal (to Index) method and should not be used here.

actually you don't even need to specify the dtype, I think

``d.columns = d.columns.copy()will prob work here. The problem is.values` converts to a base form which may change things (e.g. try this with a datetime for the columns and see what happens).

@pijucha
Copy link
Contributor Author

pijucha commented May 26, 2016

Oh, you mean

d.columns = data.columns.copy()    #(1)

and earlier

d.columns = Index(data.columns, dtype='object').  #(2) 

(1) does work. Or at least it passes tests from the repository plus some others I tried. (Actually, I ran nosetests with d.columns = data.columns but it shouldn't make a difference I guess.)

On the other hand, (2) fails with datetime64[ns] in columns. When I specify dtype=data.columns.dtype, it breaks with localized datetime.

My understanding is that since data = self.loc[bool_arr], then data.columns is just a subset of the original column index and preserves its structure (including dtype and dtypes of its elements). So, why not just simply pass/copy it to d.

Another advantage of (1) is that we can skip the next line

d.columns.names = data.columns.names

@jreback
Copy link
Contributor

jreback commented May 26, 2016

yeah (2) is not what we want, we don't need to coerce. so use (1). This was using some internal code which it shouldn't have. (Index has a public API and we try to use whenever possible, except when deeply needed). The reason is that there are certain guarantees, which in this case were violated (the separate issue I opened).

@pijucha
Copy link
Contributor Author

pijucha commented May 26, 2016

@jreback Thanks for clarifying.

@pijucha pijucha changed the title BUG: DataFrame.describe() breaks with numeric columns and a column index of object type BUG: DataFrame.describe() breaks with a column index of object type and numeric entries May 26, 2016
pijucha added a commit to pijucha/pandas that referenced this issue May 31, 2016
…s-dev#13288)

BUG pandas-dev#13104:
- Percentile identifiers are now rounded to the least precision
that keeps them unique.
- Supplying duplicates in percentiles will raise ValueError.

BUG pandas-dev#13288
- Fixed a column index of the output data frame.
Previously, if a data frame had a column index of object type and
the index contained numeric values, the output column index could
be corrupt. It led to ValueError if the output was displayed.

- describe() will raise ValueError with an informative message
on DataFrame without columns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants