pivot & unstack with nan in the index #9061

behzadnouri · 2014-12-12T02:10:36Z

closes #7466
on branch:

>>> df
     a   b   c
0   R1  C1  10
1   R2  C2  15
2  NaN  C3  17
3   R4  C4  20
>>> df.pivot('a', 'b', 'c').fillna('.')
b    C1  C2  C3  C4
a
NaN   .   .  17   .
R1   10   .   .   .
R2    .  15   .   .
R4    .   .   .  20
>>> df.pivot('b', 'a', 'c').fillna('.')
a  NaN  R1  R2  R4
b
C1   .  10   .   .
C2   .   .  15   .
C3  17   .   .   .
C4   .   .   .  20

the pivot function requires the unstack function to work with nans in the index:

>>> df
                4th     5th
1st 2nd 3rd
d   y   67   d.y.67  67.y.d
        39   d.y.39  39.y.d
    w   53   d.w.53  53.w.d
NaN w   72    .w.72  72.w. 
        57    .w.57  57.w. 
    NaN 80    . .80  80. . 
        31    . .31  31. . 
        18    . .18  18. . 
a   z   11   a.z.11  11.z.a
        30   a.z.30  30.z.a
c   z   59   c.z.59  59.z.c
        50   c.z.50  50.z.c
    NaN 62   c. .62  62. .c
e   NaN 59   e. .59  59. .e
        76   e. .76  76. .e
b   x   52   b.x.52  52.x.b
        14   b.x.14  14.x.b
        53   b.x.53  53.x.b
    NaN 60   b. .60  60. .b
        51   b. .51  51. .b

the 4th and 5th column are so that it is easy to verify the frame:

>>> df.unstack(level=1).fillna('-')
            4th                                     5th
2nd         NaN       w       x       y       z     NaN       w       x       y       z
1st 3rd
NaN 18    . .18       -       -       -       -  18. .        -       -       -       -
    31    . .31       -       -       -       -  31. .        -       -       -       -
    57        -   .w.57       -       -       -       -  57.w.        -       -       -
    72        -   .w.72       -       -       -       -  72.w.        -       -       -
    80    . .80       -       -       -       -  80. .        -       -       -       -
a   11        -       -       -       -  a.z.11       -       -       -       -  11.z.a
    30        -       -       -       -  a.z.30       -       -       -       -  30.z.a
b   14        -       -  b.x.14       -       -       -       -  14.x.b       -       -
    51   b. .51       -       -       -       -  51. .b       -       -       -       -
    52        -       -  b.x.52       -       -       -       -  52.x.b       -       -
    53        -       -  b.x.53       -       -       -       -  53.x.b       -       -
    60   b. .60       -       -       -       -  60. .b       -       -       -       -
c   50        -       -       -       -  c.z.50       -       -       -       -  50.z.c
    59        -       -       -       -  c.z.59       -       -       -       -  59.z.c
    62   c. .62       -       -       -       -  62. .c       -       -       -       -
d   39        -       -       -  d.y.39       -       -       -       -  39.y.d       -
    53        -  d.w.53       -       -       -       -  53.w.d       -       -       -
    67        -       -       -  d.y.67       -       -       -       -  67.y.d       -
e   59   e. .59       -       -       -       -  59. .e       -       -       -       -
    76   e. .76       -       -       -       -  76. .e       -       -       -       -

nan values are kept out of levels and are handled by labels:

>>> df.unstack(level=1).index
MultiIndex(levels=[['a', 'b', 'c', 'd', 'e'], [11, 14, 18, 30, 31, 39, 50, 51, 52, 53, 57, 59, 60, 62, 67, 72, 76, 80]],
           labels=[[-1, -1, -1, -1, -1, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4], [2, 4, 10, 15, 17, 0, 3, 1, 7, 8, 9, 12, 6, 11, 13, 5, 9, 14, 11, 16]],
           names=['1st', '3rd'])
>>> df.unstack(level=1).columns
MultiIndex(levels=[['4th', '5th'], ['w', 'x', 'y', 'z']],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [-1, 0, 1, 2, 3, -1, 0, 1, 2, 3]],
           names=[None, '2nd'])

behzadnouri · 2014-12-12T16:27:17Z

benchmarks with -r 'unstack' and -r 'pivot' argument:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
reshape_pivot_time_series                    | 311.2280 | 291.2517 |   1.0686 |
groupby_pivot_table                          |  31.4820 |  28.9483 |   1.0875 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [6715a60] : unstack with nan in the index
Base   [5bb5025] : DOC: initial setup for v0.16.0

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
unstack_sparse_keyspace                      |   2.7093 |   2.8173 |   0.9617 |
reshape_unstack_simple                       |   5.6877 |   5.5610 |   1.0228 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [6715a60] : unstack with nan in the index
Base   [5bb5025] : DOC: initial setup for v0.16.0

shoyer · 2014-12-14T02:29:54Z

This is not an area of the pandas codebase that I'm particularly familiar with, but I took a quick look and generally this looks good to me.

shoyer · 2014-12-19T00:35:32Z

@behzadnouri could you add release notes for this enhancement? A simple example like the one you show in your first post would be nice.

@hayd @TomAugspurger @sinhrks any thoughts?

hayd · 2014-12-19T02:58:17Z

pandas/core/reshape.py

+        # when index includes `nan`, need to lift levels/strides by 1
+        self.lift = 1 if -1 in self.index.labels[self.level] else 0
+
+        self.new_index_levels = list(index.levels)


perhaps use .copy() here rather than list. ?

hayd · 2014-12-19T03:00:20Z

This looks pretty good to me too (pesky NaN).

behzadnouri · 2014-12-19T12:58:24Z

@shoyer given that #9101 was merged, i first need to re-factor some of the common logic between these two.

sinhrks · 2014-12-20T01:34:01Z

Yes, looks nice.

behzadnouri · 2014-12-20T19:01:33Z

@shoyer added the release notes. I put this under bug fixes, since it is fixing a special case where an existing api breaks, as opposed to adding a new feature.

Since there was some re-factoring with #9101 I reran the benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_pivot_table                          |  29.3674 |  29.2567 |   1.0038 |
reshape_pivot_time_series                    | 300.8887 | 295.7307 |   1.0174 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [06b0431] : unstack with nan in the index
Base   [8eee396] : Merge pull request #9104 from jreback/dti_iteration

BUG: Bug in DatetimeIndex iteration, related to (GH8890), fixed in (GH9100)

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
unstack_sparse_keyspace                      |   2.3497 |   2.7726 |   0.8475 |
reshape_unstack_simple                       |   5.3949 |   5.5247 |   0.9765 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [06b0431] : unstack with nan in the index
Base   [8eee396] : Merge pull request #9104 from jreback/dti_iteration

BUG: Bug in DatetimeIndex iteration, related to (GH8890), fixed in (GH9100)

jreback · 2014-12-20T19:53:02Z

pandas/core/groupby.py

+    """
+    Given a list of labels at each level, returns a flat array of int64 ids
+    corresponding to unique tuples across the labels. If `retain_lex_rank`,
+    rank of returned ids preserve lexical ranks of labels.


can you add a Parameters/Returns block

pivot & unstack with nan in the index

jreback · 2014-12-22T13:11:19Z

@behzadnouri

thanks

what I am referring above is that then these types of structures are easy to create and very tricky to actually index.

In [20]: df.set_index(['1st','2nd','3rd']).unstack(0)          
Out[20]: 
        value                    
1st       NaN   a   b   c   d   e
2nd 3rd                          
NaN 18      7 NaN NaN NaN NaN NaN
    31      6 NaN NaN NaN NaN NaN
    51    NaN NaN  19 NaN NaN NaN
    59    NaN NaN NaN NaN NaN  13
    60    NaN NaN  18 NaN NaN NaN
    62    NaN NaN NaN  12 NaN NaN
    76    NaN NaN NaN NaN NaN  14
    80      5 NaN NaN NaN NaN NaN
w   53    NaN NaN NaN NaN   2 NaN
    57      4 NaN NaN NaN NaN NaN
    72      3 NaN NaN NaN NaN NaN
x   14    NaN NaN  16 NaN NaN NaN
    52    NaN NaN  15 NaN NaN NaN
    53    NaN NaN  17 NaN NaN NaN
y   39    NaN NaN NaN NaN   1 NaN
    67    NaN NaN NaN NaN   0 NaN
z   11    NaN   8 NaN NaN NaN NaN
    30    NaN   9 NaN NaN NaN NaN
    50    NaN NaN NaN  11 NaN NaN
    59    NaN NaN NaN  10 NaN NaN

This fails

In [21]: df.set_index(['1st','2nd','3rd']).unstack(0).loc[nan]
ValueError: cannot use label indexing with a null key

This is ok though

In [22]: df.set_index(['1st','2nd','3rd']).unstack(0).loc[(nan,18)]
Out[22]: 
       1st
value  NaN     7
       a     NaN
       b     NaN
       c     NaN
       d     NaN
       e     NaN
Name: (nan, 18), dtype: float64

behzadnouri force-pushed the nan-pivot branch 4 times, most recently from 89d3473 to 11fdd46 Compare December 16, 2014 12:39

behzadnouri mentioned this pull request Dec 16, 2014

BUG: Bug in MultiIndex.has_duplicates when having many levels causes an indexer overflow (GH9075) #9077

Closed

shoyer added Internals Related to non-user accessible pandas implementation Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Dec 19, 2014

shoyer added this to the 0.16.0 milestone Dec 19, 2014

shoyer added the Enhancement label Dec 19, 2014

hayd reviewed Dec 19, 2014
View reviewed changes

shoyer added the MultiIndex label Dec 19, 2014

behzadnouri force-pushed the nan-pivot branch from 11fdd46 to 06b0431 Compare December 20, 2014 17:47

jreback reviewed Dec 20, 2014
View reviewed changes

behzadnouri force-pushed the nan-pivot branch 2 times, most recently from 7a53d97 to 489a04a Compare December 21, 2014 13:02

jreback mentioned this pull request Dec 21, 2014

PERF: use labels to find duplicates in multi-index #9125

Closed

unstack with nan in the index

e3afd5b

behzadnouri force-pushed the nan-pivot branch from 489a04a to e3afd5b Compare December 22, 2014 00:16

jreback added a commit that referenced this pull request Dec 22, 2014

Merge pull request #9061 from behzadnouri/nan-pivot

099a02c

pivot & unstack with nan in the index

jreback merged commit 099a02c into pandas-dev:master Dec 22, 2014

behzadnouri mentioned this pull request Dec 30, 2014

TST: tests for GH5873 #9169

Closed

behzadnouri deleted the nan-pivot branch January 10, 2015 20:21

behzadnouri mentioned this pull request Jan 18, 2015

TST: tests for GH4862, GH7401, GH7403, GH7405 #9292

Merged

jreback mentioned this pull request Jan 25, 2015

BUG/WIP: fix pivot with nan indexes #7481

Closed

2 tasks

behzadnouri mentioned this pull request Feb 3, 2015

BUG: NaN level values in stack() and unstack() #9406

Closed

seth-p mentioned this pull request Feb 4, 2015

ENH/API: DataFrame.stack() support for level=None, sequentially=True/False, and NaN level values. #9023

Closed

seth-p mentioned this pull request Feb 16, 2015

BUG: DataFrame.unstack() with two levels containing NaN #9497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pivot & unstack with nan in the index #9061

pivot & unstack with nan in the index #9061

behzadnouri commented Dec 12, 2014

behzadnouri commented Dec 12, 2014

shoyer commented Dec 14, 2014

shoyer commented Dec 19, 2014

hayd Dec 19, 2014

hayd commented Dec 19, 2014

behzadnouri commented Dec 19, 2014

sinhrks commented Dec 20, 2014

behzadnouri commented Dec 20, 2014

jreback Dec 20, 2014

jreback commented Dec 22, 2014

pivot & unstack with nan in the index #9061

pivot & unstack with nan in the index #9061

Conversation

behzadnouri commented Dec 12, 2014

behzadnouri commented Dec 12, 2014

shoyer commented Dec 14, 2014

shoyer commented Dec 19, 2014

hayd Dec 19, 2014

Choose a reason for hiding this comment

hayd commented Dec 19, 2014

behzadnouri commented Dec 19, 2014

sinhrks commented Dec 20, 2014

behzadnouri commented Dec 20, 2014

jreback Dec 20, 2014

Choose a reason for hiding this comment

jreback commented Dec 22, 2014