Skip to content

ENH/API: DataFrame.stack() support for level=None, sequentially=True/False, and NaN level values. #9023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

seth-p
Copy link
Contributor

@seth-p seth-p commented Dec 6, 2014

closes #8851
closes #9399
closes #9406
closes #9533

@jreback
Copy link
Contributor

jreback commented Dec 6, 2014

I don't like this conversions to sets. why not just use a list and make a scalar into a single-element list. thus guaranteeing you can always iterate over the levels?

(its kind of what you are doing with a set), but we use lists for this purpose.

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 6, 2014
@jreback jreback added this to the 0.16.0 milestone Dec 6, 2014
@seth-p
Copy link
Contributor Author

seth-p commented Dec 6, 2014

I didn't want to break backwards compatibility with the existing treatment of lists, e.g. the fact that df.stack(level=[0,1]) is equivalent to df.stack(level=0).stack(level=0), and is different from df.stack(level=[1,0]) (which is equivalent to df.stack(level=1).stack(level=0)). Note that the order of the levels in the list matters.

The behavior I implemented for df.stack(level={0, 1}) is different from df.stack(level=[0, 1]), in that I want to stack the two levels simultaneously, without introducing extra rows corresponding to unused combinations of level=0 and level=1 values. (Note that the dropna parameter isn't really sufficient to get at the desired behavior, since it doesn't make a distinction between values that were missing in the original DataFrame and values that are missing because the index value didn't exist in the original DataFrame.) Since here I want to stack the levels simultaneously, their order doesn't matter, which seems like the natural semantics of a set.

With this PR, one can do df.stack(level=[0, {1, 5}, 3]) to say first stack level 0, then stack levels 1 and 5 simultaneously, and then stack level 3. That's probably excessive, but I figured once I allowed for a distinction between level=[0, 1] and level={0, 1}, might as well go all out and allow a list of levels and/or sets of levels.

An alternative implementation, which would be more restrictive, would be be to stick with lists, but add a boolean simultaneous flag (which would default to False for backwards compatibility) indicating that all the levels in the list should be stacked simultaneously. But it's not as flexible.

@seth-p
Copy link
Contributor Author

seth-p commented Dec 6, 2014

Argh, Python 2.6 doesn't support set comprehension.

@shoyer
Copy link
Member

shoyer commented Dec 7, 2014

Arguments whose meaning changes depending on their type seems very un-pythonic to me, so I am -1 on the set/list distinction.

Using a list of sets to control simultaneous stacking of levels is cute but complex, and I'm struggling to think of when this would actually be useful. I don't think it's so bad to call stack repeatedly.

However, I do think handling stacking simultaneously would be a good change to the API. It is technically a break in backwards compatibility, so we'll need to think about how that could be rolled out. We could do that with a keyword. But I also think the number of people who would be effected by this is likely to be quite small (when the existing behavior is encountered, it is probably followed by dropna), and in the long term there wouldn't be any point to including the flag.

@seth-p
Copy link
Contributor Author

seth-p commented Dec 7, 2014

I agree that providing a list of sets is extremely unlikely. My main requirement is to be able to support simultaneous stacking of multiple levels, and in particular of all levels: If df.shape = (m, n), I'd like to be able to stack the columns to get a Series of length m * n, regardless of the level structure of df.columns (in particular if it is not a "product", i.e. where np.product([len(lev) for lev in df.columns.levels]) > n), and regardless of whether df has any missing values. (If df has no missing values, df.stack(level=list(range(len(df.columns.levels))), dropna=True) will do. But since in interpreting dropna=True there is no distinction made between values that are missing in the original data and missing values for "phantom" level combinations that are created when stacking levels sequentially, this doesn't work in general.)

I sort of like my current distinction between lists and sets, but if every else thinks it's non-pythonic, I'm happy to change it. It would be reasonably straightforward to change this PR to simply stack a list of levels either sequentially or simultaneously based on a new flag. The only thing one would give up is supporting a list of sets, which I agree is a rather unlikely use case.

As an aside, it would be useful, I think, to support an 'all' value for the levels parameter, so that one doesn't need hard-code level=[1, 2, 3] or level=list(range(len(df.columns.levels))). Anything better than level='all'?

@seth-p seth-p force-pushed the multilevel_stack branch 2 times, most recently from c9bb60c to 41b30df Compare December 7, 2014 19:55
@shoyer
Copy link
Member

shoyer commented Dec 7, 2014

As an aside, it would be useful, I think, to support an 'all' value for the levels parameter, so that one doesn't need hard-code level=[1, 2, 3] or level=list(range(len(df.columns.levels))). Anything better than level='all'?

I agree, this would be useful functionality. But using level='all' is a bad idea, because 'all' is a valid name for a level. Rather than adding a new keyword argument or sentinel value, I think this would be handled most cleanly by adding a new method: stack_all.

Also: whatever we settle on here for changing stack, we should probably also implement for unstack (though unstack_all would be unnecessary).

@jreback
Copy link
Contributor

jreback commented Dec 7, 2014

another option

is to do something like (IIRC this is useful other times as well)

df.stack(levels=pd.AllLevels)

@seth-p
Copy link
Contributor Author

seth-p commented Dec 8, 2014

I'm not sure why when I looked earlier I thought the changes I made to stack() weren't applicable to unstack(). Only difference I see now is that unstack() is missing the dropna parameter.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@seth-p can you revisit?

@seth-p
Copy link
Contributor Author

seth-p commented Jan 22, 2015

Haven't looked at this in a while. Will try to revisit over next few days.

@seth-p
Copy link
Contributor Author

seth-p commented Feb 2, 2015

@jreback, @shoyer, question for you guys: Is pd.core.reshape.stack() a "public" API, that I should not alter (except in a backwards compatible way), or am I free to change it as I see fit so long as pd.DataFrame.stack() remains backwards compatible? I think it should be regarded as internal -- it doesn't show up in the API Reference, http://pandas.pydata.org/pandas-docs/version/0.15.2/api.html (though that's not necessarily complete) -- and that I can change it if I want, but figured I'd check first.

@shoyer
Copy link
Member

shoyer commented Feb 2, 2015

@seth-p I agree, I don't think we make any API guarantees for pandas.core -- only for the top level namespace.

@seth-p seth-p force-pushed the multilevel_stack branch 2 times, most recently from 9facf36 to 6b660ba Compare February 3, 2015 03:01
@seth-p seth-p changed the title ENH: DataFrame.stack() with 'level' a set or list of sets ENH/API: DataFrame.stack() support for level=ALL_LEVELS and sequentially=True/False Feb 3, 2015
@@ -123,6 +123,8 @@ class Index(IndexOpsMixin, PandasObject):

_engine_type = _index.ObjectEngine

ALL_LEVELS = -1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preferred way to make sentinel values is usually ALL_LEVELS = object(). Then, use is to do comparisons.

@jorisvandenbossche
Copy link
Member

I don't really like the pd.Index.ALL_LEVELS. Can't we use None for this? df.unstack(level=None)

@seth-p
Copy link
Contributor Author

seth-p commented Feb 3, 2015

@jorisvandenbossche, ok, I've removed ALL_LEVELS and made df.stack(level=None) mean all levels.

I haven't made any corresponding changes to DataFrame.unstack(), as that code is very different from DataFrame.stack(). I would have expected DataFrame.unstack() to be implemented as DataFrame.T.stack().T, but it's not (except that it is implemented as DataFrame.T.stack() in the special case that level is a single number and df.index is a simple Index (not a MultiIndex).)

In general the code (in reshape.py) that implements DataFrame.unstack() (unstack(), _unstack_frame(), and _unstack_multiple()) looks quite different from the code that implements DataFrame.stack() (stack(), stack_multiple(), and _stack_multi_columns() -- before the changes of this PR).

Comparing the results of df.unstack() to df.T.stack().T, I observe two differences:

  1. dtypes. Since dtypes are done by column, transposing can mess up dtypes. Given this,I'm still thinking what to do about unstack().
  2. When a MultiIndex label is NaN, things are just plain weird. Perhaps there should be a separate issue for this. While the current code passes the test_unstack_nan_index() in test_frame.py, the treatment of NaN labels seems very inconsistent/unreliable:
In [140]: df = pd.DataFrame(np.arange(4).reshape(2, 2),
                            columns=pd.MultiIndex.from_tuples([('A','a'), ('B', 'b')],
                                                              names=['Upper', 'Lower']),
                            index=Index([0, 1], name='Num'), dtype=np.float64)

In [141]: df = pd.DataFrame(np.arange(4).reshape(2, 2),
                            columns=pd.MultiIndex.from_tuples([('A',np.nan), ('B', 'b')],
                                                              names=['Upper', 'Lower']),
                            index=Index([0, 1], name='Num'), dtype=np.float64)

In [148]: df
Out[148]:
Upper  A  B
Lower  a  b
Num
0      0  1
1      2  3

In [149]: df.stack()
Out[149]:
Upper       A   B
Num Lower
0   a       0 NaN
    b     NaN   1
1   a       2 NaN
    b     NaN   3

In [150]: df.T.unstack().T
Out[150]:
Upper       A   B
Num Lower
0   a       0 NaN
    b     NaN   1
1   a       2 NaN
    b     NaN   3

In [151]: df_nan
Out[151]:
Upper   A  B
Lower NaN  b
Num
0       0  1
1       2  3

In [152]: df_nan.stack()
Out[152]:
Upper      A  B
Num Lower
0   NaN    0  1
    b      0  1
1   NaN    2  3
    b      2  3

In [153]: df_nan.T.unstack().T
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-153-edbcaeb64f64> in <module>()
----> 1 df_nan.T.unstack().T

C:\Python34\lib\site-packages\pandas\core\frame.py in unstack(self, level)
   3486         """
   3487         from pandas.core.reshape import unstack
-> 3488         return unstack(self, level)
   3489
   3490     #----------------------------------------------------------------------

C:\Python34\lib\site-packages\pandas\core\reshape.py in unstack(obj, level)
    439     if isinstance(obj, DataFrame):
    440         if isinstance(obj.index, MultiIndex):
--> 441             return _unstack_frame(obj, level)
    442         else:
    443             return obj.T.stack(dropna=False)

C:\Python34\lib\site-packages\pandas\core\reshape.py in _unstack_frame(obj, level)
    479     else:
    480         unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 481                                value_columns=obj.columns)
    482         return unstacker.get_result()
    483

C:\Python34\lib\site-packages\pandas\core\reshape.py in __init__(self, values, index, level, value_columns)
    101
    102         self._make_sorted_values_labels()
--> 103         self._make_selectors()
    104
    105     def _make_sorted_values_labels(self):

C:\Python34\lib\site-packages\pandas\core\reshape.py in _make_selectors(self)
    143
    144         if mask.sum() < len(self.index):
--> 145             raise ValueError('Index contains duplicate entries, '
    146                              'cannot reshape')
    147

ValueError: Index contains duplicate entries, cannot reshape

@jorisvandenbossche
Copy link
Member

@seth-p maybe don't rush with changing it None, it was just an idea of me, I don't know what others think of it

And on unstack, what does it have to do with the issue you are solving here? Why does this need to be changed?

@seth-p
Copy link
Contributor Author

seth-p commented Oct 17, 2015

I rebased.

On first attempt I couldn't get asv to work with either conda or virtualenv (am I the only one struggling under Windows? Certain packages simply don't install automatically with pip; I have to download version-specific wheels from http://www.lfd.uci.edu/~gohlke/pythonlibs/), but I'll try to get it to work.

@jorisvandenbossche
Copy link
Member

I don't have any problems on windows with asv. What problems do you have? It is a very simple pure python package with no other difficult dependencies, so I just cloned it and did python setup.py install as the explain in their docs.

@seth-p
Copy link
Contributor Author

seth-p commented Oct 17, 2015

My problem isn't asv itself, but rather getting virtualenv/conda to set up the virtual environments with all the packages. Ok, now I know that it's just me, so I'll try to figure it out.

On Oct 17, 2015, at 5:02 AM, Joris Van den Bossche [email protected] wrote:

I don't have any problems on windows with asv. What problems do you have? It is a very simple pure python package with no other difficult dependencies, so I just cloned it and did python setup.py install as the explain in their docs.


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2015

This is very easy on windows: http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-development-environment

virtualenv works, but you have to set it up carefully, conda is quite easy

@jorisvandenbossche
Copy link
Member

But I advise you to use conda and not virtualenv, certainly on windows

@seth-p
Copy link
Contributor Author

seth-p commented Oct 17, 2015

Neither one worked for me -- both seemed to fail at "pip install --update " -- but I will try to figure it out.

On Oct 17, 2015, at 11:29 AM, Joris Van den Bossche [email protected] wrote:

But I advise you to use conda and not virtualenv, certainly on windows


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2015

@seth-p there is NO pip install with conda.

so you mean to say

cd ~/pandas
conda create -n new_env python=2.7
activate new_env
conda install --file ci/requirements_dev.txt

doesn't work for you?

@jorisvandenbossche
Copy link
Member

@seth-p and by the way, you shouldn't be installing things yourself (at least for running the benchmarks), as asv makes the environments itself

@seth-p
Copy link
Contributor Author

seth-p commented Oct 17, 2015

@jreback, you're right the pip install --upgrade .. was from virtualenv, not conda. Sorry about that.

Now asv seems to be working with conda. I don't know what I was doing wrong previously. Thanks for the pushing me in that direction.

@seth-p
Copy link
Contributor Author

seth-p commented Oct 17, 2015

I now get the following error from asv/conda. Is it really just a matter of taking too long? If so, can I tell it to wait longer?

C:\Users\seth\github\pandas\asv_bench>asv continuous master HEAD
· Creating environments
· Discovering benchmarks
·· Uninstalling from py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlwt
·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlwt
·· Error running C:\Users\seth\github\pandas\asv_bench\env\fb7d1550581d7ee2d55fa3f3eca5f25a\python.exe setup.py build
             STDOUT -------->
             running build
             running build_py
             creating build
             creating build\lib.win-amd64-3.4
             creating build\lib.win-amd64-3.4\pandas
             copying pandas\info.py -> build\lib.win-amd64-3.4\pandas
             copying pandas\version.py -> build\lib.win-amd64-3.4\pandas
             copying pandas\__init__.py -> build\lib.win-amd64-3.4\pandas
             creating build\lib.win-amd64-3.4\pandas\compat
             copying pandas\compat\chainmap.py -> build\lib.win-amd64-3.4\pandas\compat
             copying pandas\compat\chainmap_impl.py -> build\lib.win-amd64-3.4\pandas\compat
             ...
             << lot of "copying" and "warning" messages >>
             ...
             pandas\algos.c(189710) : warning C4244: 'function' : conversion from 'Py_ssize_t' to 'int', possible loss of data
             pandas\algos.c(189926) : warning C4244: 'function' : conversion from 'npy_int64' to 'int', possible loss of data
             pandas\algos.c(212430) : warning C4244: '=' : conversion from 'float' to '__pyx_t_5numpy_float16_t', possible loss of data
             pandas\algos.c(212678) : warning C4244: '=' : conversion from 'float' to '__pyx_t_5numpy_float16_t', possible loss of data
             STDERR -------->

Traceback (most recent call last):
  File "C:\Python34\Scripts\asv-script.py", line 9, in <module>
    load_entry_point('asv==0.2.dev818+c07fdbf2', 'console_scripts', 'asv')()
  File "C:\Python34\lib\site-packages\asv\main.py", line 36, in main
    result = args.func(args)
  File "C:\Python34\lib\site-packages\asv\commands\__init__.py", line 48, in run_from_args
    return cls.run_from_conf_args(conf, args)
  File "C:\Python34\lib\site-packages\asv\commands\continuous.py", line 49, in run_from_conf_args
    **kwargs
  File "C:\Python34\lib\site-packages\asv\commands\continuous.py", line 73, in run
    _machine_file=_machine_file)
  File "C:\Python34\lib\site-packages\asv\commands\run.py", line 198, in run
    benchmarks = Benchmarks(conf, regex=bench)
  File "C:\Python34\lib\site-packages\asv\benchmarks.py", line 289, in __init__
    benchmarks = self.disc_benchmarks(conf)
  File "C:\Python34\lib\site-packages\asv\benchmarks.py", line 331, in disc_benchmarks
    env.install_project(conf)
  File "C:\Python34\lib\site-packages\asv\environment.py", line 360, in install_project
    build_root = self.build_project(commit_hash)
  File "C:\Python34\lib\site-packages\asv\environment.py", line 338, in build_project
    self.run(['setup.py', 'build'], cwd=self._build_root)
  File "C:\Python34\lib\site-packages\asv\plugins\conda.py", line 125, in run
    return self.run_executable('python', args, **kwargs)
  File "C:\Python34\lib\site-packages\asv\environment.py", line 394, in run_executable
    return util.check_output([exe] + args, **kwargs)
  File "C:\Python34\lib\site-packages\asv\util.py", line 497, in check_output
    raise ProcessError(args, retcode, stdout, stderr)
asv.util.ProcessError: Command 'C:\Users\seth\github\pandas\asv_bench\env\fb7d1550581d7ee2d55fa3f3eca5f25a\python.exe setup.py build' timed out

@jorisvandenbossche
Copy link
Member

@seth-p That was an issue with the time out calculation, bus should be fixed in asv master: airspeed-velocity/asv#319 (if not, best to report it there)

@seth-p
Copy link
Contributor Author

seth-p commented Oct 18, 2015

I still get the timeout error after getting the latest master (pip install -U git+git://github.com/spacetelescope/asv.git, and confirming that my installed version has the changes in airspeed-velocity/asv@4397d75). Hrmph.

@seth-p
Copy link
Contributor Author

seth-p commented Oct 18, 2015

I overrode all timeout defaults in asv itself (since I couldn't figure out how to do it on the command line or in asv.conf.json), and it runs until I get the following error message, which looks like an asv bug. I feel like I must have offended the python gods, if other people don't encounter these issues...

C:\Users\seth\github\pandas\asv_bench>asv continuous master HEAD
· Creating environments
· Discovering benchmarks
·· Uninstalling from py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlwt
·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlwt
·· Installing into py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlwt
· Running 1320 total benchmarks (2 commits * 1 environments * 660 benchmarks)
[  0.00%] · For pandas commit hash cd9c777f:
[  0.00%] ·· Building for py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlwt
[  0.00%] ·· Benchmarking py3.4-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlwt
[  0.08%] ··· Running attrs_caching.getattr_dataframe_index.time_getattr_dataframe_index
[  0.15%] ··· Running attrs_caching.setattr_dataframe_index.time_setattr_dataframe_indexTraceback (most recent call last):
  File "C:\Python34\Scripts\asv-script.py", line 9, in <module>
    load_entry_point('asv==0.2.dev818+c07fdbf2', 'console_scripts', 'asv')()
  File "C:\Python34\lib\site-packages\asv\main.py", line 36, in main
    result = args.func(args)
  File "C:\Python34\lib\site-packages\asv\commands\__init__.py", line 48, in run_from_args
    return cls.run_from_conf_args(conf, args)
  File "C:\Python34\lib\site-packages\asv\commands\continuous.py", line 49, in run_from_conf_args
    **kwargs
  File "C:\Python34\lib\site-packages\asv\commands\continuous.py", line 73, in run
    _machine_file=_machine_file)
  File "C:\Python34\lib\site-packages\asv\commands\run.py", line 276, in run
    profile=profile, skip=skipped_benchmarks)
  File "C:\Python34\lib\site-packages\asv\benchmarks.py", line 529, in run_benchmarks
    cwd=tmpdir)
  File "C:\Python34\lib\site-packages\asv\benchmarks.py", line 171, in run_benchmark
    log_result(display)
  File "C:\Python34\lib\site-packages\asv\benchmarks.py", line 91, in log_result
    log.add(" {0}{1}".format(padding, msg))
  File "C:\Python34\lib\site-packages\asv\console.py", line 355, in add
    _write_with_fallback(msg, sys.stdout.write, sys.stdout)
  File "C:\Python34\lib\site-packages\asv\console.py", line 157, in _write_with_fallback
    write(s)
  File "C:\Python34\lib\codecs.py", line 374, in write
    self.stream.write(data)
TypeError: must be str, not bytes

@jorisvandenbossche
Copy link
Member

@seth-p an alternative for now to be able to run the benchmarks, is to run then using asv run --python same. This will run the benchmarks in the same python environment as you call asv from (so your development setup where pandas master is already installed). Then you can run the benchmarks for master, and then for your branch, and manually compare both.

Best to something like asv run --python same -b stack to only run the benchmarks for stack/unstack.

I just tested the above approach, and I see some slowdown with this branch (~80 %) for stack.

@seth-p
Copy link
Contributor Author

seth-p commented Oct 18, 2015

Thanks. Looks like they just fixed the asv bug in airspeed-velocity/asv#336. I'll takes look later today or tomorrow.

On Oct 18, 2015, at 10:34 AM, Joris Van den Bossche [email protected] wrote:

@seth-p an alternative for now to be able to run the benchmarks, is to run then using asv run --python same. This will run the benchmarks in the same python environment as you call asv from (so your development setup where pandas master is already installed). Then you can run the benchmarks for master, and then for your branch, and manually compare both.

Best to something like asv run --python same -b stack to only run the benchmarks for stack/unstack.

I just tested the above approach, and I see some slowdown with this branch (~80 %) for stack.


Reply to this email directly or view it on GitHub.

@seth-p
Copy link
Contributor Author

seth-p commented Oct 19, 2015

OK, I have asv working. For the three (un)stack benchmarks, my changes are 15-70% slower. I'll see if I can speed them up.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

@seth-p how's this coming.

@seth-p
Copy link
Contributor Author

seth-p commented Nov 25, 2015

@jreback, I'm afraid I haven't had a chance to work on it.

@jreback
Copy link
Contributor

jreback commented Dec 15, 2015

xref #11847

@jreback
Copy link
Contributor

jreback commented Jan 20, 2016

closing, but if you wish to rebase / update pls do so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
5 participants