Skip to content

BUG: pandas.DataFrame().stack() raise an error, while expected is empty #36185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Nov 26, 2020
Merged

BUG: pandas.DataFrame().stack() raise an error, while expected is empty #36185

merged 28 commits into from
Nov 26, 2020

Conversation

steveya
Copy link
Contributor

@steveya steveya commented Sep 7, 2020

@pep8speaks
Copy link

pep8speaks commented Sep 7, 2020

Hello @steveya! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-26 10:50:16 UTC

@jreback
Copy link
Contributor

jreback commented Sep 7, 2020

@steveya is this the correct issue reference?

@steveya
Copy link
Contributor Author

steveya commented Sep 8, 2020

@jrreback, It should fix the error seen in #36113, this is my first contribution so please let me know the specifics if I have made even an obvious error. Thanks.

Comment on lines 1277 to 1278
tm.assert_series_equal(
DataFrame().stack(), Series(index=MultiIndex([[], []], [[], []]), dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you expect that there should be MultiIndex?
Would it be reasonable to expect Series([], dtype=object)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Series([1, 2, 3]).unstack() will throw an error as it does not know what to put on the columns, this is because it has only one level of index. I think the same should apply to an empty Series, so Series([]).unstack() would throw the same error and not become an empty DataFrame.

Given this constraint, for stack/unstack round-trip to work, DataFrame([]).stack() needs to return an empty Series with empty multi index with two levels, one from its original index and one from its empty column.

@@ -517,7 +517,7 @@ def factorize(index):
# For homogeneous EAs, frame._values will coerce to object. So
# we concatenate instead.
dtypes = list(frame.dtypes._values)
dtype = dtypes[0]
dtype = dtypes[0] if len(dtypes) > 0 else object
Copy link
Member

@ivanovmg ivanovmg Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you return the Series right away if the dataframe is empty?

        if not frame.empty:
            dtypes = list(frame.dtypes._values)
            dtype = dtypes[0]
        else:
            return Series([], dtype=object)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you would still need to solve unstacking from an empty series.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely you can just add a not frame.empty and frame._is_homogenerous_type on L516 might work

@jreback jreback changed the title BUG: GH36113 BUG: pandas.DataFrame().stack() raise an error, while expected is empty Sep 9, 2020
@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 9, 2020
@@ -1273,6 +1273,18 @@ def test_stack_timezone_aware_values():
tm.assert_series_equal(result, expected)


def test_stack_empty_frame():
tm.assert_series_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

result=
expected=
tm.assert_series_equal (or frame)

pls parameterize these cases

add a comment with the issue number

@@ -517,7 +517,7 @@ def factorize(index):
# For homogeneous EAs, frame._values will coerce to object. So
# we concatenate instead.
dtypes = list(frame.dtypes._values)
dtype = dtypes[0]
dtype = dtypes[0] if len(dtypes) > 0 else object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely you can just add a not frame.empty and frame._is_homogenerous_type on L516 might work

@steveya
Copy link
Contributor Author

steveya commented Sep 11, 2020

@jreback I have made the suggested changes.

@jreback jreback added this to the 1.2 milestone Sep 11, 2020
@steveya
Copy link
Contributor Author

steveya commented Sep 13, 2020

@jreback @TomAugspurger I have updated the code to raise proper error when a Series with a single level index will raise an exception when it is unstacked. a DataFrame with single level of index and column will no longer raise an exception when unstack is called (it will return a Series(). I will dig deeper into @TomAugspurger's example further.

@@ -322,7 +322,7 @@ Reshaping
- Bug in :meth:`DataFrame.pivot_table` with ``aggfunc='count'`` or ``aggfunc='sum'`` returning ``NaN`` for missing categories when pivoted on a ``Categorical``. Now returning ``0`` (:issue:`31422`)
- Bug in :func:`union_indexes` where input index names are not preserved in some cases. Affects :func:`concat` and :class:`DataFrame` constructor (:issue:`13475`)
- Bug in func :meth:`crosstab` when using multiple columns with ``margins=True`` and ``normalize=True`` (:issue:`35144`)
-
- Bug in :meth:`DataFrame.stack` for empty DataFrame (:issue:`36113`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate a bit on what is changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have elaborate this a bit in the latest commit

# GH 36113
# Give nicer error messages when unstack a Series whose
# Index is not a MultiIndex.
raise ValueError("index must be a MultiIndex to unstack")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have a test that hits it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add: f'{type(obj.index)} was passed'

@tsu-shiuan
Copy link

Could we please create a patch for 0.25.3 when releasing ?

@jreback
Copy link
Contributor

jreback commented Sep 14, 2020

@tsu-shiuan thus won't even be backport to 1.x let alone 0.25.x

@tsu-shiuan
Copy link

@jreback Okay! ☹️

@steveya
Copy link
Contributor Author

steveya commented Nov 5, 2020

@jreback Sorry I have concluded that I am not sure how to do that. on my end, I have tried this

% git pull origin GH36113
From https://github.com/steveya/pandas

  • branch GH36113 -> FETCH_HEAD
    Already up to date.
    % git merge master
    Already up to date.
    % git push origin GH36113
    Everything up-to-date

This was what I did the 12 days ago and nothing seemed to have happened. Can you provide more guides please, thank you.

@arw2019
Copy link
Member

arw2019 commented Nov 5, 2020

@jreback Sorry I have concluded that I am not sure how to do that. on my end, I have tried this

% git pull origin GH36113
From https://github.com/steveya/pandas

  • branch GH36113 -> FETCH_HEAD
    Already up to date.
    % git merge master
    Already up to date.
    % git push origin GH36113
    Everything up-to-date

This was what I did the 12 days ago and nothing seemed to have happened. Can you provide more guides please, thank you.

do

git fetch upstream
git merge upstream/master
git push origin GH36113

You may have to resolve conflicts

@steveya
Copy link
Contributor Author

steveya commented Nov 5, 2020

@jreback I checked out the tests that failed. I cannot reproduce these errors locally and I am not sure how to fix them. Any suggestions?

Comment on lines 1199 to 1200
Series().unstack()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to specify dtype here in the Series to something (like dtype='float64', maybe).

I just looked at the test failure and it seems that it is caused by the warning in repr.

            if is_empty_data(data) and dtype is None:
                # gh-17261
>               warnings.warn(
                    "The default dtype for empty Series will be 'object' instead "
                    "of 'float64' in a future version. Specify a dtype explicitly "
                    "to silence this warning.",
                    DeprecationWarning,
                    stacklevel=2,
                )
E               DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

pandas/core/series.py:234: DeprecationWarning

@pytest.mark.parametrize("fill_value", [None, 0])
def test_stack_unstack_empty_frame(dropna, fill_value):
# GH 36113
result = DataFrame().stack(dropna=dropna).unstack(fill_value=fill_value)
Copy link
Member

@ivanovmg ivanovmg Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails in 32 bit.

>   def groupsort_indexer(const int64_t[:] index, Py_ssize_t ngroups):
E   ValueError: Buffer dtype mismatch, expected 'const int64_t' but got 'int'

Looks like also a problem with the dtype.
I am not sure if that is critical, but what if you specify here dtype=np.intp?

@ivanovmg
Copy link
Member

ivanovmg commented Nov 6, 2020

I see that the failures on py37-32bit persist.
Is it possible that dtype changes when performing stack/unstack?

=================================== FAILURES ===================================
__________________ test_stack_unstack_empty_frame[None-True] ___________________

dropna = True, fill_value = None

@pytest.mark.parametrize("dropna", [True, False])
@pytest.mark.parametrize("fill_value", [None, 0])
def test_stack_unstack_empty_frame(dropna, fill_value):
    # GH 36113
    result = (
      DataFrame(dtype=np.intp).stack(dropna=dropna).unstack(fill_value=fill_value)
    )

pandas/tests/frame/test_stack_unstack.py:1191:


pandas/core/series.py:3872: in unstack
return unstack(self, level, fill_value)
pandas/core/reshape/reshape.py:431: in unstack
obj.index, level=level, constructor=obj._constructor_expanddim
pandas/core/reshape/reshape.py:118: in init
self._make_selectors()
pandas/core/reshape/reshape.py:152: in _make_selectors
remaining_labels = self.sorted_labels[:-1]
pandas/_libs/properties.pyx:33: in pandas._libs.properties.CachedProperty.get
val = self.func(obj)
pandas/core/reshape/reshape.py:139: in sorted_labels
indexer, to_sort = self._indexer_and_to_sort
pandas/_libs/properties.pyx:33: in pandas._libs.properties.CachedProperty.get
val = self.func(obj)
pandas/core/reshape/reshape.py:132: in _indexer_and_to_sort
indexer = libalgos.groupsort_indexer(comp_index, ngroups)[0]


def groupsort_indexer(const int64_t[:] index, Py_ssize_t ngroups):
E ValueError: Buffer dtype mismatch, expected 'const int64_t' but got 'int'

pandas/_libs/algos.pyx:177: ValueError

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on the 32-bit, ping on green (or ping if you this doesn't fix)

@steveya
Copy link
Contributor Author

steveya commented Nov 20, 2020

@jreback the error persists after the change on 32bit

ValueError: Buffer dtype mismatch, expected 'const int64_t' but got 'int'

@jreback
Copy link
Contributor

jreback commented Nov 20, 2020

ok try this:

return comp_ids, obs_group_ids

here you want to add
return ensure_int64(comp_ids), ensure_int64(obs_group_ids)

if you can also update the Returns sections to indicate these are int64 indexers

@steveya
Copy link
Contributor Author

steveya commented Nov 22, 2020

@jreback by the Returns section you mean the doc for "compress_group_index" in sorting.py?

@jreback jreback removed this from the 1.2 milestone Nov 24, 2020
@jreback
Copy link
Contributor

jreback commented Nov 24, 2020

actually this should pass, @steveya can you merge master one more time.

@steveya
Copy link
Contributor Author

steveya commented Nov 25, 2020

@jreback yay there is only one remaining error in test_pivot.py in 32bit build.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2020

_ TestPivot.test_pivot_empty __________________________
[XPASS(strict)] GH 36579: fail on 32-bit system

oh this is easy, you fixed the test, so remove the xfail.

@jreback jreback added this to the 1.2 milestone Nov 25, 2020
@jreback jreback merged commit 0787b53 into pandas-dev:master Nov 26, 2020
@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

thanks @steveya very nice. thanks for sticking with it!

@steveya
Copy link
Contributor Author

steveya commented Nov 27, 2020

@jreback @arw2019 @ivanovmg , thank you all for helping me with my first PR.

@steveya steveya deleted the GH36113 branch November 27, 2020 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas.DataFrame().stack() raise an error, while expected is empty
7 participants