Skip to content

BUG: regression in max_info_columns behaviour? #6939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks
jorisvandenbossche opened this issue Apr 23, 2014 · 25 comments · Fixed by #7130
Closed
2 tasks

BUG: regression in max_info_columns behaviour? #6939

jorisvandenbossche opened this issue Apr 23, 2014 · 25 comments · Fixed by #7130
Labels
Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Update:

  • max_info_columns should not behave the same as max_info_rows. If max_info_columns is exceeded, it should flip to short summary (as verbose=False)
  • we could add a keyword argument like show_counts (or another name) to specify if you want to show the non-null counts (to be able to override the max_info_rows option for a specific info call)

When you have more columns than specified in max_info_columns, df.info() will now still show all columns, but just without the information about the number of null values:

In [1]: pd.__version__
Out[1]: '0.13.1-656-gf30278e'

In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
0    5 non-null float64
1    5 non-null float64
2    5 non-null float64
3    5 non-null float64
4    5 non-null float64
dtypes: float64(5)

In [5]: df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Columns: 5 entries, 0 to 4
dtypes: float64(5)

In [6]: pd.options.display.max_info_columns = 4

In [7]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
0    float64
1    float64
2    float64
3    float64
4    float64
dtypes: float64(5)

while previously in 0.13 this gave the same behaviour as for info(verbose=False):

In [15]: pd.__version__
Out[15]: '0.13.0'

In [16]: pd.options.display.max_info_columns = 4

In [17]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Columns: 5 entries, 0 to 4
dtypes: float64(5)

which seems much more logical to me.

I suppose this is related to #5682, which added the behaviour to also show the dtype per column.

Update: it was deliberately added in #5974

@jorisvandenbossche
Copy link
Member Author

So it was added in #5974: rationale was options.display.max_info_rows was deprecated because of the new dataframe display, but this was then used in that PR to do the right behaviour in df.info whether to show the counts or not.
However, there is nothing about max_info_columns in that PR, and I personally also don't think that is what max_info_columns should be for.

Because now, if you have eg a dataframe with 1000 columns, df.info() will have 1000 lines in your terminal output.

@jorisvandenbossche
Copy link
Member Author

This is the intended behaviour of that PR (as far as I understand):

In [17]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
0    5 non-null float64
1    5 non-null float64
2    5 non-null float64
3    5 non-null float64
4    5 non-null float64
dtypes: float64(5)

In [18]: pd.options.display.max_info_rows = 4

In [19]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
0    float64
1    float64
2    float64
3    float64
4    float64
dtypes: float64(5)

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

ok so why don't we hijack max_info_columns to determine info_verbose rather than a separate parm?

@jorisvandenbossche
Copy link
Member Author

That was how it was in 0.13 and before, and that seems more logical to me personally.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

cc @bjonen

so let's make that change rather than adding the info_verbose option in #6890

@jorisvandenbossche
Copy link
Member Author

although it is still a little bit different. The columns argument sets the default for a threshold when using long and when short info output, while a info_verbose would set the default for info(verbose=..) for all dataframes.

EDIT: I suppose a user can then just set the max_info_columns=0 to have always the short summary info?

@jorisvandenbossche
Copy link
Member Author

@cpcloud OK with this? (you were also involved a bit in the info view rework in 0.13.1, not exactly this but the showing of the dtypes)

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

max_info_columns=None/0 should effectively trigger what info_verbose=False was going to do
(and you can always pass it in too)

@jorisvandenbossche
Copy link
Member Author

Added two to do's at the top of the PR:

  • max_info_columns should not behave the same as max_info_rows. If max_info_columns is exceeded, it should flip to short summary (as verbose=False)
  • we could add a keyword argument like show_counts (or another name) to specify if you want to show the non-null counts (to be able to override the max_info_rows option for a specific info call)

@jreback
Copy link
Contributor

jreback commented May 1, 2014

cc @bjonen

you are working on this one?

@jreback
Copy link
Contributor

jreback commented May 12, 2014

cc @sinhrks want to dig in on this?

would be really helpful

@TomAugspurger
Copy link
Contributor

I can take this one.

@jreback
Copy link
Contributor

jreback commented May 12, 2014

@TomAugspurger that would be awesome!

(I only think its really necessary to address @jorisvandenbossche first point in 0.14.0; the rest can wait) - though I t hink their IS a kw for that now anyhow

@TomAugspurger
Copy link
Contributor

Just to make sure.

Current behavior:

In [9]: df = pd.DataFrame(np.random.randn(10, 5))

In [10]: pd.set_option('max_info_columns', 4)

In [11]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 5 columns):
0    float64
1    float64
2    float64
3    float64
4    float64
dtypes: float64(5)

But that should be

In [12]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Columns: 5 entries, 0 to 4
dtypes: float64(5)
In [13]: 

Which is the output you get from df.info(verbose=False)

@jreback
Copy link
Contributor

jreback commented May 12, 2014

i thnk that is right, it closes #6568 with an automatic passing of verbose=False which I think is a regression from 0.13.0.

no tests currently for either max_info_columns or max_info_rows at all

@cpcloud
Copy link
Member

cpcloud commented May 12, 2014

@jorisvandenbossche the change is fine by me

@TomAugspurger
Copy link
Contributor

There's going to be a bit of a conflict between the options and the verbose keyword to info.

In [1]: df = pd.DataFrame(np.random.randn(10, 5))

# This stuff is fine. It doesn't exceed the limit
In [3]: df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Columns: 5 entries, 0 to 4
dtypes: float64(5)
In [4]: df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 5 columns):
0    10 non-null float64
1    10 non-null float64
2    10 non-null float64
3    10 non-null float64
4    10 non-null float64
dtypes: float64(5)

In [5]: pd.set_option('display.large_repr', 'info',
                      'display.max_info_columns', 4,
                      'display.max_columns', 2)
# Now we have too many cols, so we want to truncate the info repr:
In [6]: df
Out[6]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Columns: 5 entries, 0 to 4
dtypes: float64(5)

In [7]: df.info(verbose=False)  # same
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Columns: 5 entries, 0 to 4
dtypes: float64(5)
In [8]: df.info(verbose=True)  # No way to get all the columns
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Columns: 5 entries, 0 to 4
dtypes: float64(5)

Pretty much, should the verbose keyword override the display.max_info_columns setting? Should the last case print out the full info repr?

@jreback
Copy link
Contributor

jreback commented May 12, 2014

yes, if a user actually specifies a 'local' keyword override the 'global' setting

I think verbose=None should actually be the default

then you can see if the user actually passed anything

@TomAugspurger
Copy link
Contributor

Ok... I think that to implement this, I'll need to change the default of verbose in df.info to None, and then treat None differently than False, is that OK? Because right now verbose defaults to True, which means we'd always be printing the column count summary, even if it exceeds the max_info_columns setting.

Cases

  1. verbose=None, max_info_columns exceeded -> don't print column count summary
  2. verbose=False, max_info_columns exceeded -> don't print column count summary (redundant)
  3. verbose=True, max_info_columns exceeded -> do print column count summary (precedence to verbose)
  4. verbose=None, max_info_columns not exceeded -> do print column count summary
  5. verbose=False, max_info_columns not exceeded -> don't print column count summary (precedence to verbose)
  6. verbose=True, max_info_columns not exceeded -> do print column count summary (redundant)

@jreback
Copy link
Contributor

jreback commented May 12, 2014

that looks right

@TomAugspurger
Copy link
Contributor

notebook showing this http://nbviewer.ipython.org/gist/TomAugspurger/58838d627194a9113f66

Not sure if it's intentional, but the html info_repr doesn't have the

<class 'pandas.core.frame.DataFrame'>

at the top of the repr.

@jreback
Copy link
Contributor

jreback commented May 12, 2014

hmm seems like they should be the same (except for the basic/table repr)

@TomAugspurger
Copy link
Contributor

Yeah they probably should be. It's the same way on master. I'll see if I can track it down

@jorisvandenbossche
Copy link
Member Author

@TomAugspurger About the conflict between the options and the verbose keyword to info, this conflict was already there in 0.13. But I agree with the proposed way to solve this (this is then even an improvement over 0.13, not only fixing the regression).

Your overview of cases seem indeed right, only "don't print column count summary" is not fully correct, it should rather be "don't print summary (count/dtype) of all columns seperately", as whether the counts are shown or not is determined by max_info_rows (if exceeded then the counts are not shown but only the dtypes). But I suppose that you actually meant it the right way :-)

Thanks for taking this up!

@jreback
Copy link
Contributor

jreback commented May 14, 2014

@TomAugspurger progress on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants