Skip to content

pivot & unstack with nan in the index #9061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 22, 2014

Conversation

behzadnouri
Copy link
Contributor

closes #7466
on branch:

>>> df
     a   b   c
0   R1  C1  10
1   R2  C2  15
2  NaN  C3  17
3   R4  C4  20
>>> df.pivot('a', 'b', 'c').fillna('.')
b    C1  C2  C3  C4
a
NaN   .   .  17   .
R1   10   .   .   .
R2    .  15   .   .
R4    .   .   .  20
>>> df.pivot('b', 'a', 'c').fillna('.')
a  NaN  R1  R2  R4
b
C1   .  10   .   .
C2   .   .  15   .
C3  17   .   .   .
C4   .   .   .  20

the pivot function requires the unstack function to work with nans in the index:

>>> df
                4th     5th
1st 2nd 3rd
d   y   67   d.y.67  67.y.d
        39   d.y.39  39.y.d
    w   53   d.w.53  53.w.d
NaN w   72    .w.72  72.w. 
        57    .w.57  57.w. 
    NaN 80    . .80  80. . 
        31    . .31  31. . 
        18    . .18  18. . 
a   z   11   a.z.11  11.z.a
        30   a.z.30  30.z.a
c   z   59   c.z.59  59.z.c
        50   c.z.50  50.z.c
    NaN 62   c. .62  62. .c
e   NaN 59   e. .59  59. .e
        76   e. .76  76. .e
b   x   52   b.x.52  52.x.b
        14   b.x.14  14.x.b
        53   b.x.53  53.x.b
    NaN 60   b. .60  60. .b
        51   b. .51  51. .b

the 4th and 5th column are so that it is easy to verify the frame:

>>> df.unstack(level=1).fillna('-')
            4th                                     5th
2nd         NaN       w       x       y       z     NaN       w       x       y       z
1st 3rd
NaN 18    . .18       -       -       -       -  18. .        -       -       -       -
    31    . .31       -       -       -       -  31. .        -       -       -       -
    57        -   .w.57       -       -       -       -  57.w.        -       -       -
    72        -   .w.72       -       -       -       -  72.w.        -       -       -
    80    . .80       -       -       -       -  80. .        -       -       -       -
a   11        -       -       -       -  a.z.11       -       -       -       -  11.z.a
    30        -       -       -       -  a.z.30       -       -       -       -  30.z.a
b   14        -       -  b.x.14       -       -       -       -  14.x.b       -       -
    51   b. .51       -       -       -       -  51. .b       -       -       -       -
    52        -       -  b.x.52       -       -       -       -  52.x.b       -       -
    53        -       -  b.x.53       -       -       -       -  53.x.b       -       -
    60   b. .60       -       -       -       -  60. .b       -       -       -       -
c   50        -       -       -       -  c.z.50       -       -       -       -  50.z.c
    59        -       -       -       -  c.z.59       -       -       -       -  59.z.c
    62   c. .62       -       -       -       -  62. .c       -       -       -       -
d   39        -       -       -  d.y.39       -       -       -       -  39.y.d       -
    53        -  d.w.53       -       -       -       -  53.w.d       -       -       -
    67        -       -       -  d.y.67       -       -       -       -  67.y.d       -
e   59   e. .59       -       -       -       -  59. .e       -       -       -       -
    76   e. .76       -       -       -       -  76. .e       -       -       -       -

nan values are kept out of levels and are handled by labels:

>>> df.unstack(level=1).index
MultiIndex(levels=[['a', 'b', 'c', 'd', 'e'], [11, 14, 18, 30, 31, 39, 50, 51, 52, 53, 57, 59, 60, 62, 67, 72, 76, 80]],
           labels=[[-1, -1, -1, -1, -1, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4], [2, 4, 10, 15, 17, 0, 3, 1, 7, 8, 9, 12, 6, 11, 13, 5, 9, 14, 11, 16]],
           names=['1st', '3rd'])
>>> df.unstack(level=1).columns
MultiIndex(levels=[['4th', '5th'], ['w', 'x', 'y', 'z']],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [-1, 0, 1, 2, 3, -1, 0, 1, 2, 3]],
           names=[None, '2nd'])

@behzadnouri
Copy link
Contributor Author

benchmarks with -r 'unstack' and -r 'pivot' argument:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
reshape_pivot_time_series                    | 311.2280 | 291.2517 |   1.0686 |
groupby_pivot_table                          |  31.4820 |  28.9483 |   1.0875 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [6715a60] : unstack with nan in the index
Base   [5bb5025] : DOC: initial setup for v0.16.0

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
unstack_sparse_keyspace                      |   2.7093 |   2.8173 |   0.9617 |
reshape_unstack_simple                       |   5.6877 |   5.5610 |   1.0228 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [6715a60] : unstack with nan in the index
Base   [5bb5025] : DOC: initial setup for v0.16.0

@shoyer
Copy link
Member

shoyer commented Dec 14, 2014

This is not an area of the pandas codebase that I'm particularly familiar with, but I took a quick look and generally this looks good to me.

@behzadnouri behzadnouri force-pushed the nan-pivot branch 4 times, most recently from 89d3473 to 11fdd46 Compare December 16, 2014 12:39
@shoyer shoyer added Internals Related to non-user accessible pandas implementation Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Dec 19, 2014
@shoyer shoyer added this to the 0.16.0 milestone Dec 19, 2014
@shoyer
Copy link
Member

shoyer commented Dec 19, 2014

@behzadnouri could you add release notes for this enhancement? A simple example like the one you show in your first post would be nice.

@hayd @TomAugspurger @sinhrks any thoughts?

# when index includes `nan`, need to lift levels/strides by 1
self.lift = 1 if -1 in self.index.labels[self.level] else 0

self.new_index_levels = list(index.levels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps use .copy() here rather than list. ?

@hayd
Copy link
Contributor

hayd commented Dec 19, 2014

This looks pretty good to me too (pesky NaN).

@behzadnouri
Copy link
Contributor Author

@shoyer given that #9101 was merged, i first need to re-factor some of the common logic between these two.

@sinhrks
Copy link
Member

sinhrks commented Dec 20, 2014

Yes, looks nice.

@behzadnouri
Copy link
Contributor Author

@shoyer added the release notes. I put this under bug fixes, since it is fixing a special case where an existing api breaks, as opposed to adding a new feature.

Since there was some re-factoring with #9101 I reran the benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_pivot_table                          |  29.3674 |  29.2567 |   1.0038 |
reshape_pivot_time_series                    | 300.8887 | 295.7307 |   1.0174 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [06b0431] : unstack with nan in the index
Base   [8eee396] : Merge pull request #9104 from jreback/dti_iteration

BUG: Bug in DatetimeIndex iteration, related to (GH8890), fixed in (GH9100)

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
unstack_sparse_keyspace                      |   2.3497 |   2.7726 |   0.8475 |
reshape_unstack_simple                       |   5.3949 |   5.5247 |   0.9765 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [06b0431] : unstack with nan in the index
Base   [8eee396] : Merge pull request #9104 from jreback/dti_iteration

BUG: Bug in DatetimeIndex iteration, related to (GH8890), fixed in (GH9100)

"""
Given a list of labels at each level, returns a flat array of int64 ids
corresponding to unique tuples across the labels. If `retain_lex_rank`,
rank of returned ids preserve lexical ranks of labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Parameters/Returns block

jreback added a commit that referenced this pull request Dec 22, 2014
pivot & unstack with nan in the index
@jreback jreback merged commit 099a02c into pandas-dev:master Dec 22, 2014
@jreback
Copy link
Contributor

jreback commented Dec 22, 2014

@behzadnouri

thanks

what I am referring above is that then these types of structures are easy to create and very tricky to actually index.

In [20]: df.set_index(['1st','2nd','3rd']).unstack(0)          
Out[20]: 
        value                    
1st       NaN   a   b   c   d   e
2nd 3rd                          
NaN 18      7 NaN NaN NaN NaN NaN
    31      6 NaN NaN NaN NaN NaN
    51    NaN NaN  19 NaN NaN NaN
    59    NaN NaN NaN NaN NaN  13
    60    NaN NaN  18 NaN NaN NaN
    62    NaN NaN NaN  12 NaN NaN
    76    NaN NaN NaN NaN NaN  14
    80      5 NaN NaN NaN NaN NaN
w   53    NaN NaN NaN NaN   2 NaN
    57      4 NaN NaN NaN NaN NaN
    72      3 NaN NaN NaN NaN NaN
x   14    NaN NaN  16 NaN NaN NaN
    52    NaN NaN  15 NaN NaN NaN
    53    NaN NaN  17 NaN NaN NaN
y   39    NaN NaN NaN NaN   1 NaN
    67    NaN NaN NaN NaN   0 NaN
z   11    NaN   8 NaN NaN NaN NaN
    30    NaN   9 NaN NaN NaN NaN
    50    NaN NaN NaN  11 NaN NaN
    59    NaN NaN NaN  10 NaN NaN

This fails

In [21]: df.set_index(['1st','2nd','3rd']).unstack(0).loc[nan]
ValueError: cannot use label indexing with a null key

This is ok though

In [22]: df.set_index(['1st','2nd','3rd']).unstack(0).loc[(nan,18)]
Out[22]: 
       1st
value  NaN     7
       a     NaN
       b     NaN
       c     NaN
       d     NaN
       e     NaN
Name: (nan, 18), dtype: float64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Internals Related to non-user accessible pandas implementation Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pivot with nans gives a seemingly wrong result
5 participants