Skip to content

BUG: conversion of empty DataFrame to SparseDtype (#33113) #33118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 25, 2020
Merged

BUG: conversion of empty DataFrame to SparseDtype (#33113) #33118

merged 17 commits into from
Jun 25, 2020

Conversation

choucavalier
Copy link
Contributor

@choucavalier choucavalier commented Mar 29, 2020

@jreback
Copy link
Contributor

jreback commented Mar 29, 2020

rebase on master as have merged #33110 which might fix this (if so, you can just push the test itself)

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type labels Mar 29, 2020
@choucavalier
Copy link
Contributor Author

@jreback now this is synced with master, i removed my changes in generic.py and only left the test, which is still failing.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2020

@jreback now this is synced with master, i removed my changes in generic.py and only left the test, which is still failing.

then there maybe still an issue

@@ -552,6 +553,10 @@ def test_astype_to_incorrect_datetimelike(self, unit):
with pytest.raises(TypeError, match=msg):
df.astype(dtype)

def test_astype_sparsedtype_empty_dataframe(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the same issue occurs with all extension array dtypes can you add a test to pandas\tests\extension\base\dtype.py instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

@choucavalier
Copy link
Contributor Author

@simonjayhawkins i've added the new test.

@choucavalier choucavalier changed the title Fix conversion of empty DataFrame to SparseDtype (#33113) BUG: Fix conversion of empty DataFrame to SparseDtype (#33113) Mar 30, 2020
@choucavalier choucavalier changed the title BUG: Fix conversion of empty DataFrame to SparseDtype (#33113) BUG: conversion of empty DataFrame to SparseDtype (#33113) Mar 30, 2020

def test_astype_empty_dataframe(self, dtype):
empty_dataframe = pd.DataFrame()
assert empty_dataframe.astype(dtype) == empty_dataframe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use tm.assert_frame_equal(result, empty_dataframe) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@choucavalier
Copy link
Contributor Author

i've added a small fix by simply returning an empty dataframe when there's nothing to concatenate.

@simonjayhawkins simonjayhawkins added this to the 1.1 milestone Mar 31, 2020
@simonjayhawkins
Copy link
Member

Thanks @tgy can you add a whatsnew

@choucavalier
Copy link
Contributor Author

@simonjayhawkins done!

@choucavalier
Copy link
Contributor Author

@simonjayhawkins @jreback it seems that using tm.assert_frame_equal is prevented by CI checks

##[error]pandas/tests/extension/base/dtype.py:119: 
tm.assert_frame_equal(empty_dataframe.astype(dtype), empty_dataframe)

maybe i should only do empty_dataframe.astype(dtype) instead, without checking the result.

what do you recommend?

there's also a typing validation error:

pandas/core/generic.py:5573: error: 
Incompatible return value type (got "DataFrame", expected "FrameOrSeries")

what should i do about that?

@simonjayhawkins
Copy link
Member

it seems that using tm.assert_frame_equal is prevented by CI checks

##[error]pandas/tests/extension/base/dtype.py:119: 
tm.assert_frame_equal(empty_dataframe.astype(dtype), empty_dataframe)

maybe i should only do empty_dataframe.astype(dtype) instead, without checking the result.

That was my mistake asking for the test to be moved in #33118 (comment). We'll probably need to keep the test where is was and use parameterization instead.

We don't have a fixture specifically for extension dtypes in conftest.py. (It maybe elsewhere), so could manually copy the dtypes from each of the /tests/extensions/.. tests for now.

@simonjayhawkins
Copy link
Member

there's also a typing validation error:

pandas/core/generic.py:5573: error: 
Incompatible return value type (got "DataFrame", expected "FrameOrSeries")

what should i do about that?

The issue only affects DataFrame? but for mypy the return type needs to the same as the input type. maybe the original fix, see #33118 (comment), is more appropriate?

@simonjayhawkins simonjayhawkins self-requested a review May 8, 2020 16:37
@choucavalier choucavalier requested a review from jreback June 21, 2020 14:52
@choucavalier
Copy link
Contributor Author

@jreback @simonjayhawkins I've updated the fix and moved the test. What do you guys think?

@pep8speaks
Copy link

pep8speaks commented Jun 21, 2020

Hello @tgy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-22 16:03:25 UTC

@simonjayhawkins
Copy link
Member

Thanks @tgy for being patient here and sorry for reviewing sooner

it seems that using tm.assert_frame_equal is prevented by CI checks

ah my bad. should have been self.assert_frame_equal

i've moved the test (again!) to pandas\tests\extension\base\casting.py

can you add a release note to doc\source\whatsnew\v1.1.0.rst

@choucavalier
Copy link
Contributor Author

@simonjayhawkins done! I was the one who postponed that issue :-) Thanks for reviewing so quickly on a Sunday

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgy generally lgtm.

@@ -1068,6 +1068,7 @@ Sparse
- Bug in :meth:`Series.sum` with ``SparseArray`` raises ``TypeError`` (:issue:`25777`)
- Bug where :class:`DataFrame` containing :class:`SparseArray` filled with ``NaN`` when indexed by a list-like (:issue:`27781`, :issue:`29563`)
- The repr of :class:`SparseDtype` now includes the repr of its ``fill_value`` attribute. Previously it used ``fill_value``'s string representation (:issue:`34352`)
- Fixed bug where :class:`DataFrame` or :class:`Series` could not be cast to :class:`SparseDtype` when empty (:issue:`33113`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this fix is only for DataFrame. Series.astype works fine on master?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I tried and Series.astype works fine on master. Will fix the entry.

@@ -5548,6 +5548,9 @@ def astype(
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
return self._constructor(new_data).__finalize__(self, method="astype")

if not results:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be in elif is_extension_array_dtype(dtype) and self.ndim > 1: block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although then the issue mentioned in #33113 (comment) would not be fixed. maybe keep this where it is and a test for #33113 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did not see your second comment. Ok I'll do that. Anyhow if we reach the "concat" part of the code and results is empty, then we have a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, one could argue that pd.concat of an empty list could return an empty frame or series instead of raising an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins Actually starting to think the other issue was not an issue but an error on the author's part. See #33113 (comment). Awaiting your answer to proceed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other issue is relevant.

on master and in this PR (latest commit)...

>>> df = pd.DataFrame([[1, 2], [3, 4]])
>>> df
   0  1
0  1  2
1  3  4
>>>
>>> res = df.astype(dict())
>>> res
   0  1
0  1  2
1  3  4
>>>
>>> res is df
False
>>>
>>> df = pd.DataFrame([dict(), dict()])
>>>
>>> df.astype(dict())
Traceback (most recent call last):
   ...
ValueError: No objects to concatenate
>>>

this PR before moving #33118 (comment) ...

>>> df = pd.DataFrame([dict(), dict()])
>>>
>>> df.astype(dict())
Empty DataFrame
Columns: []
Index: [0, 1]
>>>
>>> res is df
False

so it appears astype with an empty dict is a valid op with a copy returned. so this should also work on an empty DataFrame (even if created with empty dict).

these sort of circumstances can arise on data pipelines and having consistency reduces the complexity of downstream code.

so I think need to move the changes back and add a test for this.

Copy link
Contributor Author

@choucavalier choucavalier Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will revert the changes back and add a test. I still find it very strange to call .astype(T) where T is an instance of an object and not a type (except of course for string-defined such as .astype("float64")).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.astype can also accept a dict of column name -> data type as the dtype argument, https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.astype.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh that makes a lot of sense now.

@choucavalier
Copy link
Contributor Author

@simonjayhawkins let me know what you think :)

@simonjayhawkins
Copy link
Member

@simonjayhawkins let me know what you think :)

see #33118 (comment)

@simonjayhawkins
Copy link
Member

Thanks @tgy for adding the test. The test is not applicable to the extension array dtype tests. can you move it to pandas\tests\frame\methods\test_astype.py

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Jun 22, 2020

The test will still fail with name 'pd' is not defined and thereafter with AttributeError: 'TestAstype' object has no attribute 'assert_frame_equal'

you can run it locally

pytest pandas/tests/frame/methods/test_astype.py -k test_astype_empty_dtype_dict

can you also add assert result is not df

@choucavalier
Copy link
Contributor Author

Thanks, should work now. Locally it did.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgy lgtm @jreback

@jreback jreback merged commit e37ff6e into pandas-dev:master Jun 25, 2020
@jreback
Copy link
Contributor

jreback commented Jun 25, 2020

thanks @tgy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.astype(SparseDtype(float)) on empty dataframe leads to "ValueError: No objects to concatenate"
4 participants