Skip to content

Fix repr of DataFrame with IntervalIndex #24134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 13, 2018

Conversation

jorisvandenbossche
Copy link
Member

@TomAugspurger after the repr PR, the docs build catched an error: the repr of a DataFrame with an IntervalIndex started failing:

In [1]: df = pd.DataFrame({'A': [1, 2, 3, 4]}, 
   ...:                   index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))                                                                                                  

In [2]: df                                                                                                                                                                      
Out[2]: ---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/miniconda3/envs/dev/lib/python3.5/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/miniconda3/envs/dev/lib/python3.5/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    400                         if cls is not object \
    401                                 and callable(cls.__dict__.get('__repr__')):
--> 402                             return _repr_pprint(obj, self, cycle)
    403 
    404             return _default_pprint(obj, self, cycle)

~/miniconda3/envs/dev/lib/python3.5/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    695     """A pprint that just redirects to the normal repr function."""
    696     # Find newlines and replace them with p.break_()
--> 697     output = repr(obj)
    698     for idx,output_line in enumerate(output.splitlines()):
    699         if idx:

~/scipy/pandas/pandas/core/base.py in __repr__(self)
     75         Yields Bytestring in Py2, Unicode String in py3.
     76         """
---> 77         return str(self)
     78 
     79 

~/scipy/pandas/pandas/core/base.py in __str__(self)
     54 
     55         if compat.PY3:
---> 56             return self.__unicode__()
     57         return self.__bytes__()
     58 

~/scipy/pandas/pandas/core/frame.py in __unicode__(self)
    626             width = None
    627         self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
--> 628                        line_width=width, show_dimensions=show_dimensions)
    629 
    630         return buf.getvalue()

~/scipy/pandas/pandas/core/frame.py in to_string(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, max_rows, max_cols, show_dimensions, decimal, line_width)
    707                                            decimal=decimal,
    708                                            line_width=line_width)
--> 709         formatter.to_string()
    710 
    711         if buf is None:

~/scipy/pandas/pandas/io/formats/format.py in to_string(self)
    601         else:
    602 
--> 603             strcols = self._to_str_columns()
    604             if self.line_width is None:  # no need to wrap around just print
    605                 # the whole frame

~/scipy/pandas/pandas/io/formats/format.py in _to_str_columns(self)
    510         # may include levels names also
    511 
--> 512         str_index = self._get_formatted_index(frame)
    513 
    514         if not is_list_like(self.header) and not self.header:

~/scipy/pandas/pandas/io/formats/format.py in _get_formatted_index(self, frame)
    807                                      names=show_index_names, formatter=fmt)
    808         else:
--> 809             fmt_index = [index.format(name=show_index_names, formatter=fmt)]
    810         fmt_index = [tuple(_make_fixed_width(list(x), justify='left',
    811                                              minimum=(self.col_space or 0),

~/scipy/pandas/pandas/core/indexes/base.py in format(self, name, formatter, **kwargs)
    993             return header + list(self.map(formatter))
    994 
--> 995         return self._format_with_header(header, **kwargs)
    996 
    997     def _format_with_header(self, header, na_rep='NaN', **kwargs):

~/scipy/pandas/pandas/core/indexes/interval.py in _format_with_header(self, header, **kwargs)
   1012 
   1013     def _format_with_header(self, header, **kwargs):
-> 1014         return header + list(self._format_native_types(**kwargs))
   1015 
   1016     def _format_native_types(self, na_rep='', quoting=None, **kwargs):

~/scipy/pandas/pandas/core/indexes/interval.py in _format_native_types(self, na_rep, quoting, **kwargs)
   1016     def _format_native_types(self, na_rep='', quoting=None, **kwargs):
   1017         """ actually format my specific types """
-> 1018         from pandas.io.formats.format import IntervalArrayFormatter
   1019         return IntervalArrayFormatter(values=self,
   1020                                       na_rep=na_rep,

ImportError: cannot import name 'IntervalArrayFormatter'

What is in this PR "fixes" the immediate error, but, I see a difference with what was before:

On 0.23.4:

In [2]: i = pd.io.formats.format.IntervalArrayFormatter(pd.interval_range(1, 5))

In [3]: i.get_result()
Out[3]: ['(1, 2]', '(2, 3]', '(3, 4]', '(4, 5]']

On master:

In [22]: i = pd.io.formats.format.ExtensionArrayFormatter(pd.interval_range(1, 5))                                                                                              

In [23]: i.get_result()                                                                                                                                                         
Out[23]: [' (1, 2]', ' (2, 3]', ' (3, 4]', ' (4, 5]']

So there is now an extra space. So which means also the DataFrame repr starts with a space.

(still need to add tests, and it might this also breaks existing tests due to the whitespace change)

@pep8speaks
Copy link

Hello @jorisvandenbossche! Thanks for submitting the PR.

@jorisvandenbossche jorisvandenbossche added Bug Output-Formatting __repr__ of pandas objects, to_string Blocker Blocking issue or pull request for an upcoming release Interval Interval data type labels Dec 6, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Dec 6, 2018
@codecov
Copy link

codecov bot commented Dec 6, 2018

Codecov Report

Merging #24134 into master will not change coverage.
The diff coverage is 0%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #24134   +/-   ##
=======================================
  Coverage    92.2%    92.2%           
=======================================
  Files         162      162           
  Lines       51701    51701           
=======================================
  Hits        47671    47671           
  Misses       4030     4030
Flag Coverage Δ
#multiple 90.6% <0%> (ø) ⬆️
#single 43.02% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 94.73% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03134cb...ec88f66. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 6, 2018

Codecov Report

Merging #24134 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24134      +/-   ##
==========================================
+ Coverage    92.2%   92.22%   +0.01%     
==========================================
  Files         162      162              
  Lines       51700    51769      +69     
==========================================
+ Hits        47669    47742      +73     
+ Misses       4031     4027       -4
Flag Coverage Δ
#multiple 90.62% <100%> (+0.01%) ⬆️
#single 43.01% <70%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/format.py 97.97% <100%> (+0.01%) ⬆️
pandas/core/indexes/interval.py 95.26% <100%> (+0.52%) ⬆️
pandas/core/arrays/categorical.py 95.3% <0%> (-0.1%) ⬇️
pandas/core/frame.py 96.91% <0%> (ø) ⬆️
pandas/core/indexes/base.py 96.27% <0%> (ø) ⬆️
pandas/core/indexes/category.py 97.9% <0%> (ø) ⬆️
pandas/core/arrays/datetimes.py 98.56% <0%> (ø) ⬆️
pandas/core/generic.py 96.65% <0%> (ø) ⬆️
pandas/io/pytables.py 92.31% <0%> (ø) ⬆️
pandas/core/internals/managers.py 95.93% <0%> (+0.01%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0b9eea...f95ae76. Read the comment docs.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 8, 2018

I'll look into this later today.

It seems that we're now hitting the else in

for i, v in enumerate(vals):
if not is_float_type[i] and leading_space:
fmt_values.append(u' {v}'.format(v=_format(v)))
elif is_float_type[i]:
fmt_values.append(float_format(v))
else:
fmt_values.append(u' {v}'.format(v=_format(v)))
.

@TomAugspurger
Copy link
Contributor

Updated to pass through a leading_space argument and tests for the formatters that we removed.

from pandas.io.formats.format import ExtensionArrayFormatter
return ExtensionArrayFormatter(values=self,
na_rep=na_rep,
justify='all',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this new arg? just change the output tests, which are incorrect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 0.23.4, we didn't have the leading space for indexes.

In [2]: df = pd.Series(1, index=pd.IntervalIndex.from_breaks([1, 2, 3, 4])).to_frame()

In [3]: df
Out[3]:
        0
(1, 2]  1
(2, 3]  1
(3, 4]  1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn’t this the same issue you recently adjusted for DTi? this keywords just promote inconsistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't think so.

AFAICT, this is happing because IntervalIndex and Series[Interval] are now both using GenericArrayFormatter to format the values. Series need a leading space, but indexes don't. So I think things should be more consistent. If you want I can remove the keyword and go back to the old implementation which just did the formatting on its own, but I suspect you don't want that :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i guess

@@ -936,7 +941,13 @@ def _format(x):
elif is_float_type[i]:
fmt_values.append(float_format(v))
else:
fmt_values.append(u' {v}'.format(v=_format(v)))
if leading_space is False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so hacky, we should not be doing this

@@ -927,7 +930,9 @@ def _format(x):
vals = vals.values

is_float_type = lib.map_infer(vals, is_float) & notna(vals)
leading_space = is_float_type.any()
leading_space = self.leading_space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document when this is set.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@jorisvandenbossche
Copy link
Member Author

this is happing because IntervalIndex and Series[Interval] are now both using GenericArrayFormatter to format the values. Series need a leading space, but indexes don't.

I don't really know the formatting code, but would it make sense to move this "adding a leading space or not" to the code that actually combines the list of strings into the actual repr? As it is only when concatting them with ',' in the middle for the Index repr that you need those spaces?
(just wondering, not for this PR, but might be an idea to clean this up)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 11, 2018 via email

@jreback
Copy link
Contributor

jreback commented Dec 13, 2018

ok, merging this though @TomAugspurger can you open an issue on #24134 (comment) which sounds good.

@jreback jreback merged commit 33ca356 into pandas-dev:master Dec 13, 2018
@jorisvandenbossche jorisvandenbossche deleted the interval-df-repr branch December 16, 2018 07:54
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Bug Interval Interval data type Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants