Skip to content

BUG: conversion of empty DataFrame to SparseDtype (#33113) #33118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 25, 2020
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1068,6 +1068,7 @@ Sparse
- Bug in :meth:`Series.sum` with ``SparseArray`` raises ``TypeError`` (:issue:`25777`)
- Bug where :class:`DataFrame` containing :class:`SparseArray` filled with ``NaN`` when indexed by a list-like (:issue:`27781`, :issue:`29563`)
- The repr of :class:`SparseDtype` now includes the repr of its ``fill_value`` attribute. Previously it used ``fill_value``'s string representation (:issue:`34352`)
- Fixed bug where :class:`DataFrame` or :class:`Series` could not be cast to :class:`SparseDtype` when empty (:issue:`33113`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this fix is only for DataFrame. Series.astype works fine on master?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I tried and Series.astype works fine on master. Will fix the entry.


ExtensionArray
^^^^^^^^^^^^^^
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -5548,6 +5548,9 @@ def astype(
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
return self._constructor(new_data).__finalize__(self, method="astype")

if not results:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be in elif is_extension_array_dtype(dtype) and self.ndim > 1: block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although then the issue mentioned in #33113 (comment) would not be fixed. maybe keep this where it is and a test for #33113 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did not see your second comment. Ok I'll do that. Anyhow if we reach the "concat" part of the code and results is empty, then we have a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, one could argue that pd.concat of an empty list could return an empty frame or series instead of raising an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins Actually starting to think the other issue was not an issue but an error on the author's part. See #33113 (comment). Awaiting your answer to proceed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other issue is relevant.

on master and in this PR (latest commit)...

>>> df = pd.DataFrame([[1, 2], [3, 4]])
>>> df
   0  1
0  1  2
1  3  4
>>>
>>> res = df.astype(dict())
>>> res
   0  1
0  1  2
1  3  4
>>>
>>> res is df
False
>>>
>>> df = pd.DataFrame([dict(), dict()])
>>>
>>> df.astype(dict())
Traceback (most recent call last):
   ...
ValueError: No objects to concatenate
>>>

this PR before moving #33118 (comment) ...

>>> df = pd.DataFrame([dict(), dict()])
>>>
>>> df.astype(dict())
Empty DataFrame
Columns: []
Index: [0, 1]
>>>
>>> res is df
False

so it appears astype with an empty dict is a valid op with a copy returned. so this should also work on an empty DataFrame (even if created with empty dict).

these sort of circumstances can arise on data pipelines and having consistency reduces the complexity of downstream code.

so I think need to move the changes back and add a test for this.

Copy link
Contributor Author

@choucavalier choucavalier Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will revert the changes back and add a test. I still find it very strange to call .astype(T) where T is an instance of an object and not a type (except of course for string-defined such as .astype("float64")).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.astype can also accept a dict of column name -> data type as the dtype argument, https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.astype.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh that makes a lot of sense now.

return self.copy()

# GH 19920: retain column metadata after concat
result = pd.concat(results, axis=1, copy=False)
result.columns = self.columns
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/extension/base/casting.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,9 @@ def test_to_numpy(self, data):

result = pd.Series(data).to_numpy()
self.assert_equal(result, expected)

def test_astype_empty_dataframe(self, dtype):
# https://github.com/pandas-dev/pandas/issues/33113
df = pd.DataFrame()
result = df.astype(dtype)
self.assert_frame_equal(result, df)