Skip to content

Index repr changes to make them consistent #9901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 9, 2015

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 14, 2015

With this PR, makes repr of all Index types consistent.
closes #6482
closes #6295
replaces #9897

Previous Behavior

In [1]: pd.get_option('max_seq_items')
Out[1]: 100

In [2]: pd.Index(range(4),name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')

In [3]: pd.Index(range(104),name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [4]: pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')
Out[4]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern

In [5]: pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[5]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern

New Behavior


# this is here just to have a consisten display here, it will normally get the value from the console width
In [1]:   pd.set_option('display.width',100)

In [2]:    pd.Index(range(4),name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64', name=u'foo')

In [3]:    pd.Index(range(25),name='foo')
Out[3]: 
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
            24],
           dtype='int64', name=u'foo')

In [4]:    pd.Index(range(104),name='foo')
Out[4]: 
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9, 
            ...
             94,  95,  96,  97,  98,  99, 100, 101, 102, 103],
           dtype='int64', name=u'foo', length=104)

In [5]:    pd.Index(['datetime', 'sA', 'sB', 'sC', 'flow', 'error', 'temp', 'ref', 'a_bit_a_longer_one']*2)
Out[5]: 
Index([u'datetime', u'sA', u'sB', u'sC', u'flow', u'error', u'temp', u'ref', u'a_bit_a_longer_one',
       u'datetime', u'sA', u'sB', u'sC', u'flow', u'error', u'temp', u'ref',
       u'a_bit_a_longer_one'],
      dtype='object')

In [6]:    pd.CategoricalIndex(['a','bb','ccc','dddd'],ordered=True,name='foobar')
Out[6]: CategoricalIndex([u'a', u'bb', u'ccc', u'dddd'], categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category')

In [7]:    pd.CategoricalIndex(['a','bb','ccc','dddd']*10,ordered=True,name='foobar')
Out[7]: 
CategoricalIndex([u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc',
                  u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb',
                  u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',
                  u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd'],
                 categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category')

In [8]:    pd.CategoricalIndex(['a','bb','ccc','dddd']*100,ordered=True,name='foobar')
Out[8]: 
CategoricalIndex([u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', 
                  ...
                  u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd'],
                 categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category', length=400)

In [9]:    pd.CategoricalIndex(np.arange(1000),ordered=True,name='foobar')
Out[9]: 
CategoricalIndex([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9, 
                  ...
                  990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
                 categories=[0, 1, 2, 3, 4, 5, 6, 7, ...], ordered=True, name=u'foobar', dtype='category', length=1000)
In [10]:    pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')
Out[10]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
              dtype='datetime64[ns]', name=u'foo', freq='D', tz='US/Eastern')

In [11]:    pd.date_range('20130101',periods=25,name='foo',tz='US/Eastern')
Out[11]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
               '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
               '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
               '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
               '2013-01-11 00:00:00-05:00', '2013-01-12 00:00:00-05:00',
               '2013-01-13 00:00:00-05:00', '2013-01-14 00:00:00-05:00',
               '2013-01-15 00:00:00-05:00', '2013-01-16 00:00:00-05:00',
               '2013-01-17 00:00:00-05:00', '2013-01-18 00:00:00-05:00',
               '2013-01-19 00:00:00-05:00', '2013-01-20 00:00:00-05:00',
               '2013-01-21 00:00:00-05:00', '2013-01-22 00:00:00-05:00',
               '2013-01-23 00:00:00-05:00', '2013-01-24 00:00:00-05:00',
               '2013-01-25 00:00:00-05:00'],
              dtype='datetime64[ns]', name=u'foo', freq='D', tz='US/Eastern')

In [12]:    pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[12]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
               '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
               '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
               '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00', 
               ...
               '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
               '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
               '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
               '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
               '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
              dtype='datetime64[ns]', name=u'foo', length=104, freq='D', tz='US/Eastern')

Note that MultiIndex are multi-line repr and do no truncate sequences (of e.g. labels), this is consistent with previous versions. (easy to change this though)

In [1]: MultiIndex.from_product([list('abcdefg'),range(10)],names=['first','second'])
Out[1]: 
MultiIndex(levels=[[u'a', u'b', u'c', u'd', u'e', u'f', u'g'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           names=[u'first', u'second'])

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Output-Formatting __repr__ of pandas objects, to_string labels Apr 14, 2015
@jreback jreback added this to the 0.16.1 milestone Apr 14, 2015
@jreback
Copy link
Contributor Author

jreback commented Apr 14, 2015

@shoyer
Copy link
Member

shoyer commented Apr 14, 2015

Looks like a great idea to me! Would be nice to squeeze name into repr for datetimeindex, too.

@jreback jreback force-pushed the repr-name branch 2 times, most recently from 5a79e96 to 9ff163e Compare April 15, 2015 00:19
@jorisvandenbossche
Copy link
Member

I think adding the name is certainly OK!

However, I don't know if I like the stacking and indenting under each other. For short indices, this makes your output in the console a lot longer.

If we change the Index repr in such a way, I would do at once a more thourough clean-up, and also look if we can/want to make the DatetimeIndex more consistent (and I would then do this for 0.17 and not in a minor release)

@jorisvandenbossche
Copy link
Member

Ah, I see that you updated the initial message :-) (my response was based on the mail I got)

So you also reformatted the datetime-like indices. I think that is a good idea, but to repeat above: maybe we should leave this for 0.17? And I would also ping the mailing list.

And maybe we should also think about if we want a difference between repr and str.

Small remark on the actual output: for DatetimeIndex, the dates should also be quoted I think to have this evalable (like it is for TimedeltaIndex and objext Index)

@shoyer
Copy link
Member

shoyer commented Apr 15, 2015

I do agree with @jorisvandenbossche that it's nice to have the output on one line than spread over many.

As for larger unification of index repr, this is also a great idea! Given that this is unlikely to break any existing code, I think it would probably be OK in an point release (especially given how loosely pandas does semantic versioning already -- people already expect new features and changes in point releases).

My vote would be to include more than the first and last items in string output for all indexes. Perhaps the first 30 and last 30 items, like Series? I'd be happy even with just 5/5. This is actually a bit of a pet-peeve of mine with the old DatetimeIndex repr -- I would often end up converting the index to a series just to see more than 2 values.

@jreback
Copy link
Contributor Author

jreback commented Apr 15, 2015

ok, this was pretty trivial (and now the short format works for everything).

max_seq_items is for this purpose, but I think need to make much smaller (its a default of 100 now), or maybe adjust a bit, e.g. for integers its ok, but for datetimes its way too big.

so will print short format if its a small sequence, and long format otherwise (trivial to be either one)

In [1]: pd.set_option('max_seq_items',8)

In [2]: pd.date_range('20130101',freq='s',periods=2,tz='US/Eastern',name='foo')
Out[2]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-01 00:00:01-05:00'], name=u'foo', dtype='datetime64[ns]', length=2, freq='S', tz='US/Eastern')

In [3]: pd.date_range('20130101',freq='s',periods=4,tz='US/Eastern',name='foo')
Out[3]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-01 00:00:01-05:00', '2013-01-01 00:00:02-05:00', '2013-01-01 00:00:03-05:00'], name=u'foo', dtype='datetime64[ns]', length=4, freq='S', tz='US/Eastern')

In [4]: pd.date_range('20130101',freq='s',periods=6,tz='US/Eastern',name='foo')
Out[4]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-01 00:00:01-05:00', '2013-01-01 00:00:02-05:00', '2013-01-01 00:00:03-05:00', '2013-01-01 00:00:04-05:00', '2013-01-01 00:00:05-05:00'], name=u'foo', dtype='datetime64[ns]', length=6, freq='S', tz='US/Eastern')

In [5]: pd.date_range('20130101',freq='s',periods=8,tz='US/Eastern',name='foo')
Out[5]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-01 00:00:01-05:00', '2013-01-01 00:00:02-05:00', '2013-01-01 00:00:03-05:00', '2013-01-01 00:00:04-05:00', '2013-01-01 00:00:05-05:00', '2013-01-01 00:00:06-05:00', '2013-01-01 00:00:07-05:00'], name=u'foo', dtype='datetime64[ns]', length=8, freq='S', tz='US/Eastern')

In [6]: pd.date_range('20130101',freq='s',periods=10,tz='US/Eastern',name='foo')
Out[6]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-01 00:00:01-05:00', ..., '2013-01-01 00:00:08-05:00', '2013-01-01 00:00:09-05:00'],
              name=u'foo',
              dtype='datetime64[ns]',
              length=10,
              freq='S',
              tz='US/Eastern')
In [9]: Index(range(8),name='foo')
Out[9]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7], name=u'foo', dtype='int64')

In [10]: Index(range(10),name='foo')
Out[10]: 
Int64Index([0, 1, ..., 8, 9],
           name=u'foo',
           dtype='int64')

This is not really much of a change per-se as a) you normally don't print indexes themselves that often. and its just a display fix (and really a bug fix at that).

@jreback
Copy link
Contributor Author

jreback commented Apr 15, 2015

starting to like these (I changed it to always print on the same line). MultiIndex (and CategoricalIndex) are the odd one out here, but they have quite a bit more info to communicate

In [2]: pd.set_option('max_seq_items',8)

In [3]: pd.date_range('20130101',periods=80)
Out[3]: DatetimeIndex(['2013-01-01', '2013-01-02', ..., '2013-03-20', '2013-03-21'], dtype='datetime64[ns]', length=80, freq='D', tz=None)

In [4]: Index(range(8))
Out[4]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')

In [5]: Index(range(10))
Out[5]: Int64Index([0, 1, ..., 8, 9], dtype='int64')

In [6]: Index(range(10),name='foo')
Out[6]: Int64Index([0, 1, ..., 8, 9], name=u'foo', dtype='int64')

@jreback
Copy link
Contributor Author

jreback commented Apr 16, 2015

ok, updated the top of the PR with new/previous behavior.

This unifies all Index repr (including MultiIndex, though that is unchanged and multi-line). I switched back to a single line, with the display.max_seq_items controlling if it does a short or long-format. All datetimelike are quoted (so actually a short-repr works, a long-one won't work).

The number to display in the short-repr is 2 on the head and 2 on the tail and is hard coded, but I suppose could be 'computed', e.g. in theory you could show say more integers that you can datetimes, but will leave that for later.

@shoyer
Copy link
Member

shoyer commented Apr 16, 2015

OK, I would probably default to showing 3 items rather than 2, but this is a step in the right direction.

As long as we're changing this, let's make sure that all index types can be run through eval, at least if they aren't truncated for length. So I think that means the length field needs to go.

@jorisvandenbossche
Copy link
Member

I would vote for on a single line. The terminal wraps it anyways, and otherwise you can get strange effects like:

In [3]: pd.Index(range(21))
Out[3]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
, 20],
           dtype='int64')

But should check how this in a notebook.

Some other feedback:

  • not sure that the 'name' should go before dtype. It seems more logical to put optional ones as last.
  • agree about the length field. I know it is there in the current repr, but this is inconsistent with the others, and also not a real correct kwarg.
  • it feels strange to me that if you approach max_seq_items you can have a very long output, and then at once if you reach it, only a very short output. It seems more logical that the output is truncated at max_seq_items (eg if max_seq_items is set at 30, show the first 15 and last 15). But that can be an argument to change the default of max_seq_items to a much lower one (or to make this a new option specific for index)

@jorisvandenbossche
Copy link
Member

By the way, as this directly touches the output users see (and that can be delicate), I would certainly also bring this to the mailing list. Plus, just having more eyes looking at this as the three of us would be good. I can do it tomorrow, if it is not yet done by then.

@jreback
Copy link
Contributor Author

jreback commented Apr 16, 2015

we could have an argument say max_seq_visible to control the head/tail ones. But honestly I would do that later, see if its really needed.

I just put name/dtype (and length only shows up if you are truncating) first. but could easily change that order (you want dtype then name, ok)

If you want to put on the mailing list ok by me. I don't really think this is that big of a deal. The basic formats are pretty much unchanged and its just cleaning up things. (the current PR is for 1 line btw).

@jreback
Copy link
Contributor Author

jreback commented Apr 16, 2015

top-section is updated

@jorisvandenbossche jorisvandenbossche changed the title Index to show name Index repr changes to make them consistent Apr 17, 2015
@jreback jreback force-pushed the repr-name branch 2 times, most recently from 241d2bf to 66f5e12 Compare April 20, 2015 12:05
@jreback
Copy link
Contributor Author

jreback commented Apr 20, 2015

I updated the top of the PR to show the repr of MultiIndex/CategoricalIndex (unchanged for MultiIndex from previous versions)

@jorisvandenbossche
Copy link
Member

Wouldn't we keep the new CategoricalIndex consistent with the other Index types instead of to MultiIndex?

@jreback
Copy link
Contributor Author

jreback commented Apr 20, 2015

@jorisvandenbossche you mean 1 line?

I think things get lost, e.g. ordered/name. As you have both codes/categories. Now codes themselves are not really necessary (and simliarly for MultiIndex), e.g. in my example a MultiIndex is really just

In [7]: MultiIndex.from_product([list('abcdefg'),range(10)],names=['first','second']).levels[0]
Out[7]: Index([u'a', u'b', u'c', u'd', u'e', u'f', u'g'], dtype='object', name=u'first')

In [8]: MultiIndex.from_product([list('abcdefg'),range(10)],names=['first','second']).levels[1]
Out[8]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name=u'second')

Which IMHO is more informative, but it would be by-definition multi-line repr

and CategoricalIndex is closer to MultiIndex I think.

@jreback
Copy link
Contributor Author

jreback commented Apr 28, 2015

In [12]: pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[12]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', ...,
               '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
              dtype='datetime64[ns]', name=u'foo', length=104, freq='D', tz='US/Eastern')

@jorisvandenbossche your proposal looks like adding a line feed after the 'data' portion, then putting all the rest on the next line, yes?

@jreback
Copy link
Contributor Author

jreback commented May 8, 2015

going once, going twice........

@shoyer
Copy link
Member

shoyer commented May 8, 2015

OK I'll test this out right now...

@shoyer
Copy link
Member

shoyer commented May 8, 2015

I'm not a big fan of the way truncating looks with multiple lines right now. For example:

In [29]: pd.Index(['asd', 'zx', 'a'] * 10000)
Out[29]:
Index([u'asd',  u'zx',   u'a', u'asd',  u'zx',   u'a', u'asd',  u'zx',   u'a',
       u'asd',
       ...
         u'a', u'asd',  u'zx',   u'a', u'asd',  u'zx',   u'a', u'asd',  u'zx',
         u'a'],
      dtype='object', length=30000)

In [40]: pd.Index(np.arange(2000) * 100)
Out[40]:
Int64Index([     0,    100,    200,    300,    400,    500,    600,    700,
               800,    900,
            ...
            199000, 199100, 199200, 199300, 199400, 199500, 199600, 199700,
            199800, 199900],
           dtype='int64', length=2000)

My vote would probably be for showing not quite as many elements in the truncated version (3?) and putting everything on line (at least the array elements). Closer to the one of the earlier versions:

In [3]: pd.date_range('20130101',periods=80)
Out[3]: DatetimeIndex(['2013-01-01', '2013-01-02', ..., '2013-03-20', '2013-03-21'], dtype='datetime64[ns]', length=80, freq='D', tz=None)

In [4]: Index(range(8))
Out[4]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')

In [5]: Index(range(10))
Out[5]: Int64Index([0, 1, ..., 8, 9], dtype='int64')

In [6]: Index(range(10),name='foo')
Out[6]: Int64Index([0, 1, ..., 8, 9], name=u'foo', dtype='int64')

I know this is arguably less consistent with the logic for displaying series and frames, but I think it looks better.

@shoyer
Copy link
Member

shoyer commented May 8, 2015

But to be clear, I would not hold up the release for my objections. Any of these changes are better than what we have now.

@jorisvandenbossche
Copy link
Member

  • the truncating: it looks off in this case beacause the length of the strings used are so that there is just only one element on the second line. I think there are some options:

    • we could detect that and fill the rest of the line (but, we don't know the length of the following elements in advance, so that may break the aligning
    • we could also use not a fixed number of values, but just one full row, the dots, a full row with the last elements. So for datetimes it is then maybe the first and last 2 elements, for integers it can be more?
    • Or we could also decide to return max_seq_len number of items, so max_seq_len/2 first and last. Then the line in the middle will not be that prominent
    • or the suggestion of @shoyer (this is also how numpy does it)
  • Something else I realized: this aligning is not really ideal for column names. Example:

    In [10]: pd.Index(['datetime', 'sA', 'sB', 'sC', 'flow', 'error', 'temp', 'ref',
    'and_a_longer_one'])
    Out[10]:
    Index([        u'datetime',               u'sA',               u'sB',
                     u'sC',             u'flow',            u'error',
                   u'temp',              u'ref', u'and_a_longer_one'],
      dtype='object')
    

    And apparantly, numpy does not align arrays of object dtype (which has some logic, as these have a higher chance of being of different length). Maybe we should follow that?

    In [12]: pd.Index(['datetime', 'sA', 'sB', 'sC', 'flow', 'error', 'temp', 'ref',
    'and_a_longer_one']*3).values
    Out[12]:
    array(['datetime', 'sA', 'sB', 'sC', 'flow', 'error', 'temp', 'ref',
           'and_a_longer_one', 'datetime', 'sA', 'sB', 'sC', 'flow', 'error',
           'temp', 'ref', 'and_a_longer_one', 'datetime', 'sA', 'sB', 'sC',
           'flow', 'error', 'temp', 'ref', 'and_a_longer_one'], dtype=object)
    

@jreback
Copy link
Contributor Author

jreback commented May 8, 2015

@shoyer @jorisvandenbossche
slight modification; I print now 2 head and 2 tail on the truncated display; each line is justified itself (but maxlen are the same). Doesn't affect fixed width types like datetimelikes, but integers look better, mixed width strings are somewhat better I think

In [1]:    pd.get_option('max_seq_items')
Out[1]: 100

In [2]:    pd.Index(range(4),name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64', name=u'foo')

In [3]:    pd.Index(range(25),name='foo')
Out[3]: 
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24],
           dtype='int64', name=u'foo')

In [4]:    pd.Index(range(104),name='foo')
Out[4]: 
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
            10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
            ...
            84, 85, 86, 87, 88, 89, 90, 91, 92, 93,
            94, 95, 96, 97, 98, 99, 100, 101, 102, 103],
           dtype='int64', name=u'foo', length=104)

In [5]:    pd.CategoricalIndex(['a','bb','ccc','dddd'],ordered=True,name='foobar')
Out[5]: CategoricalIndex([u'a', u'bb', u'ccc', u'dddd'], categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category')

In [6]:    pd.CategoricalIndex(['a','bb','ccc','dddd']*10,ordered=True,name='foobar')
Out[6]: 
CategoricalIndex([   u'a',   u'bb',  u'ccc', u'dddd',    u'a',   u'bb',
                   u'ccc', u'dddd',    u'a',   u'bb',  u'ccc', u'dddd',
                     u'a',   u'bb',  u'ccc', u'dddd',    u'a',   u'bb',
                   u'ccc', u'dddd',    u'a',   u'bb',  u'ccc', u'dddd',
                     u'a',   u'bb',  u'ccc', u'dddd',    u'a',   u'bb',
                   u'ccc', u'dddd',    u'a',   u'bb',  u'ccc', u'dddd',
                     u'a',   u'bb',  u'ccc', u'dddd'],
                 categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category')

In [7]:    pd.CategoricalIndex(['a','bb','ccc','dddd']*100,ordered=True,name='foobar')
Out[7]: 
CategoricalIndex([   u'a',   u'bb',  u'ccc', u'dddd',    u'a',   u'bb',
                   u'ccc', u'dddd',    u'a',   u'bb',
                   u'ccc', u'dddd',    u'a',   u'bb',  u'ccc', u'dddd',
                     u'a',   u'bb',  u'ccc', u'dddd',
                  ...
                     u'a',   u'bb',  u'ccc', u'dddd',    u'a',   u'bb',
                   u'ccc', u'dddd',    u'a',   u'bb',
                   u'ccc', u'dddd',    u'a',   u'bb',  u'ccc', u'dddd',
                     u'a',   u'bb',  u'ccc', u'dddd'],
                 categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category', length=400)

In [8]:    pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')
Out[8]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
              dtype='datetime64[ns]', name=u'foo', freq='D', tz='US/Eastern')

In [9]:    pd.date_range('20130101',periods=25,name='foo',tz='US/Eastern')
Out[9]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
               '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
               '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
               '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
               '2013-01-11 00:00:00-05:00', '2013-01-12 00:00:00-05:00',
               '2013-01-13 00:00:00-05:00', '2013-01-14 00:00:00-05:00',
               '2013-01-15 00:00:00-05:00', '2013-01-16 00:00:00-05:00',
               '2013-01-17 00:00:00-05:00', '2013-01-18 00:00:00-05:00',
               '2013-01-19 00:00:00-05:00', '2013-01-20 00:00:00-05:00',
               '2013-01-21 00:00:00-05:00', '2013-01-22 00:00:00-05:00',
               '2013-01-23 00:00:00-05:00', '2013-01-24 00:00:00-05:00',
               '2013-01-25 00:00:00-05:00'],
              dtype='datetime64[ns]', name=u'foo', freq='D', tz='US/Eastern')

In [10]:    pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[10]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
               '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
               '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
               '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
               '2013-01-11 00:00:00-05:00', '2013-01-12 00:00:00-05:00',
               '2013-01-13 00:00:00-05:00', '2013-01-14 00:00:00-05:00',
               '2013-01-15 00:00:00-05:00', '2013-01-16 00:00:00-05:00',
               '2013-01-17 00:00:00-05:00', '2013-01-18 00:00:00-05:00',
               '2013-01-19 00:00:00-05:00', '2013-01-20 00:00:00-05:00',
               ...
               '2013-03-26 00:00:00-04:00', '2013-03-27 00:00:00-04:00',
               '2013-03-28 00:00:00-04:00', '2013-03-29 00:00:00-04:00',
               '2013-03-30 00:00:00-04:00', '2013-03-31 00:00:00-04:00',
               '2013-04-01 00:00:00-04:00', '2013-04-02 00:00:00-04:00',
               '2013-04-03 00:00:00-04:00', '2013-04-04 00:00:00-04:00',
               '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
               '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
               '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
               '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
               '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
              dtype='datetime64[ns]', name=u'foo', length=104, freq='D', tz='US/Eastern')

@jreback jreback force-pushed the repr-name branch 3 times, most recently from b784464 to 2746c13 Compare May 8, 2015 20:15
@jreback
Copy link
Contributor Author

jreback commented May 8, 2015

ok, updated on the top of the PR. here are basically the rules:

  • 0, 1, 2 values on a single line, no justification
  • if we are < max_seq_items, wrap the output to make multi-line; no justification (datetimelikes are 'naturally justfiied though, they just line up)
  • if we are >= max_seq_items, truncate. Justify non (string,categorical), print head/tail lines, these will wrap (e.g. have more than one line) for things that are datetimelike, but for strings, integers, they will not wrap at all (e.g. only 1 line)

The last point is a slight deviation as it then doesn't justify say integers for a categorical (but make the code a bit simpler, and though it looked a bit better - though my categorical examples of 1000 categories makes it look worse...oh well).

hsperr and others added 3 commits May 9, 2015 11:50
move tests to generically tests for index
generify __unicode__ for Index

adjust index display to max_seq_items
increase limits for max_seq_items & printing for Index

add extended repr for datetimelike indexes

fix tseries/test_base for repr

adjust docs for repr-name

use new format_data on all Index types
@jorisvandenbossche
Copy link
Member

@jreback While on the train I made an alternative implementation to allow an unequal number of elements on one row, based on this branch (jorisvandenbossche@8ff7998). It is partly based on the array2string function of numpy.

The advantage is that the number of elements on all rows don't have to be equal, it just fills the row up to display width. As a consequence, this (all 3 elements on each row because of one larger string):

In [8]: pd.Index(['datetime', 'sA', 'sB', 'sC', 'flow', 'error', 'temp', 'ref', 'a_bit_a_longer_one']*2)
Index([u'datetime', u'sA', u'sB',
       u'sC', u'flow', u'error',
       u'temp', u'ref', u'and_a_longer_one',
       u'datetime', u'sA', u'sB',
       u'sC', u'flow', u'error',
       u'temp', u'ref', u'and_a_longer_one'],
      dtype='object')

now looks as:

In [5]: pd.Index(['datetime', 'sA', 'sB', 'sC', 'flow', 'error', 'temp', 'ref','a_bit_a_longer_one']*2)
Out[5]:
Index([u'datetime', u'sA', u'sB', u'sC', u'flow', u'error', u'temp', u'ref',
       u'a_bit_a_longer_one', u'datetime', u'sA', u'sB', u'sC', u'flow',
       u'error', u'temp', u'ref', u'a_bit_a_longer_one'],
      dtype='object')

which looks a bit better in my opinion (but you can give more extreme examples where the difference is larger).

For the rest, everything looks exactly the same as before (the latest update of this PR). The only difference is that the justifying or not (depending on string/categorical or not) is now the same for truncated and non-truncated (as I do this with the same code), while in the latest update you made it not justify for non-truncated ones (but I think doing it the same is a bit more consistent).

@jreback
Copy link
Contributor Author

jreback commented May 9, 2015

ok let me incorporate this and I'll update the top section

Conflicts:
	pandas/tseries/base.py

use new format_data

updates

Fix detection of good width

more fixes

Change [

Conflicts:
	pandas/core/index.py

more fixes

revsised according to comments
@jreback
Copy link
Contributor Author

jreback commented May 9, 2015

@jorisvandenbossche @shoyer ok, updated with joris commit. I had to change around slightly the code to avoid justification in certain cases, e.g. here's you don't want to justify otherwise the NaT would have a very long string (basically if its going to be 1 line and its non-truncated, then you don't justify)

In [13]: PeriodIndex(['2011-01-01 09:00', '2012-02-01 10:00', 'NaT'], freq='H')
Out[13]: PeriodIndex(['2011-01-01 09:00', '2012-02-01 10:00', 'NaT'], dtype='int64', freq='H')

@jorisvandenbossche
Copy link
Member

Ah, yes, the case of 1 line not needing justification, I forgot, thanks!

There are some more details we could improve, but I think this is looking good for now!

For later (follow-up PR), I was thinking of:

  • also splitting the attributes part over multiple lines if this is very long

  • for truncated output: now hard-code at 10 elements, but this could maybe also be 1 line? (and possibly with a minimum number of elements for longer strings) Because now, small integers don't fill the full line with 10 elements, and eg floats there is one element on the second line:

    In [7]: pd.Index(np.arange(400.))
    Out[7]:
    Float64Index([  0.0,   1.0,   2.0,   3.0,   4.0,   5.0,   6.0,   7.0,   8.0,
                    9.0,
                  ...
                  390.0, 391.0, 392.0, 393.0, 394.0, 395.0, 396.0, 397.0, 398.0,
                  399.0],
                 dtype='float64', length=400)
    

    This could eg easily be restricted to two times one line.

@jreback
Copy link
Contributor Author

jreback commented May 9, 2015

@jorisvandenbossche yes, I was trying to avoid the wrap-around with a small number of elements, in fact thats why the example used 100 for display.width; if its too short you get that kind of effect.

ok, bombs aways. And will add a follow-up issue.

jreback added a commit that referenced this pull request May 9, 2015
Index repr changes to make them consistent
@jreback jreback merged commit d5fdbc6 into pandas-dev:master May 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

index repr should show name(s) attribute ENH: pprint index/columns?
4 participants