Skip to content

describe() for boolean series #6625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
waitingkuo opened this issue Mar 13, 2014 · 3 comments
Closed

describe() for boolean series #6625

waitingkuo opened this issue Mar 13, 2014 · 3 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@waitingkuo
Copy link
Contributor

Seems the boolean series is considered as numeric data

In [27]: import pandas as pd
In [28]: s = pd.Series([True, True, False])
In [29]: s.describe()
Out[29]: 
count            3
mean     0.6666667
std      0.5773503
min          False
25%            0.5
50%              1
75%              1
max           True
dtype: object

Should we deem it as non-numeric data? Says, output as followings:

In [31]: s.describe()
Out[31]: 
count        3
unique       2
top       True
freq         2
dtype: object
@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

can't change the actual fields as it needs to return consistly for all dtypes. You could special case it though. I don't think this was done because the same operations on frame will integerize the values on a boolean dtype.

So should be consistently done (and the dataframe fix is a bit non-trivial).

@jreback jreback added this to the 0.15.0 milestone Mar 13, 2014
@waitingkuo
Copy link
Contributor Author

I can help to specialize the boolean case.

Most of the code for describe() of DataFrame are the same as those of Series. I'll refactor some of them to fit the "Don't Repeat Yourself" principle.

The original function can only return numerical columns except that all the columns are non-numerical. I propose that we can add a parameter, says column_type, to decide which kind of data(column) we would like to return. And this parameter should be 'numeric' as default to be backward-compatible.

Example:

In [65]: df
Out[65]: 
   0  1      2                          3
0  1  a   True 2014-03-13 11:24:03.297115
1  2  b   True 2014-03-13 11:24:03.297125
2  3  c  False 2014-03-13 11:24:03.297128

[3 rows x 4 columns]

In [66]: df.describe(column_type='numeric')
Out[66]: 
         0
count  3.0
mean   2.0
std    1.0
min    1.0
25%    1.5
50%    2.0
75%    2.5
max    3.0

[8 rows x 1 columns]


In [70]: df.describe(column_type='datetime')
Out[70]: 
                                 3
count                            3
unique                           3
first   2014-03-13 11:24:03.297115
last    2014-03-13 11:24:03.297128
top     2014-03-13 11:24:03.297115
freq                             1

[6 rows x 1 columns]

In [72]: df.describe(column_type='object')
Out[72]: 
        1     2
count   3     3
unique  3     2
top     a  True
freq    1     2

[4 rows x 2 columns]

In [75]: df.describe(column_type='all')
Out[75]: 
          0    1     2                           3
count   3.0  NaN   NaN                         NaN
mean    2.0  NaN   NaN                         NaN
std     1.0  NaN   NaN                         NaN
min     1.0  NaN   NaN                         NaN
25%     1.5  NaN   NaN                         NaN
50%     2.0  NaN   NaN                         NaN
75%     2.5  NaN   NaN                         NaN
max     3.0  NaN   NaN                         NaN
count   NaN    3     3                         NaN
unique  NaN    3     2                         NaN
top     NaN    a  True                         NaN
freq    NaN    1     2                         NaN
count   NaN  NaN   NaN                           3
unique  NaN  NaN   NaN                           3
first   NaN  NaN   NaN  2014-03-13 11:24:03.297115
last    NaN  NaN   NaN  2014-03-13 11:24:03.297128
top     NaN  NaN   NaN  2014-03-13 11:24:03.297115
freq    NaN  NaN   NaN                           1

[18 rows x 4 columns]

@jreback
Copy link
Contributor

jreback commented Mar 13, 2014

I think you can simply just make this work correctly for all columns; the resulting dtype of the frame would be object if you end up with mixed types, but that's just how it is.

basically have the series describe deal with the dtype correctly and have the frame desrcibe just call it instead of the individual functions. (don't use apply though, just iterate like it is now).

also don't add fields for now, maybe later

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jreback jreback modified the milestones: 0.18.0, Next Major Release Feb 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants