-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Create empty dataframe with string dtype fails #33651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Create empty dataframe with string dtype fails #33651
Conversation
466a9ba
to
82dc446
Compare
if dtype is None or np.issubdtype(dtype, np.flexible): | ||
if is_dtype_equal(dtype, "string"): | ||
# GH 33623 | ||
nan_dtype = dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be dtype.na_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I can't figure out how to fix this from "this will be dtype.na_value".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [15]: pd.Int32Dtype.na_value
Out[15]: <NA>
nan_dtype = dtype.na_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change to nan_dtype = dtype.na_value
, error occurs.
if not isinstance(dtype, (np.dtype, type(np.dtype))):
> dtype = dtype.dtype
E AttributeError: 'NAType' object has no attribute 'dtype'
pandas/core/dtypes/cast.py:1545: AttributeError
So I updated it like this.
if (
dtype is None
or is_extension_array_dtype(dtype)
or np.issubdtype(dtype, np.flexible)
):
nan_dtype = object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also needs a whatsnew note, but in bug fixes Extension array section in 1.1)
if dtype is None or np.issubdtype(dtype, np.flexible): | ||
if is_dtype_equal(dtype, "string"): | ||
# GH 33623 | ||
nan_dtype = dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update this
if dtype is None or np.issubdtype(dtype, np.flexible): | ||
if is_dtype_equal(dtype, "string"): | ||
# GH 33623 | ||
nan_dtype = dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [15]: pd.Int32Dtype.na_value
Out[15]: <NA>
nan_dtype = dtype.na_value
3de5485
to
333e36c
Compare
333e36c
to
ff26d95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok code change and tests in the base extensions looks fine.
- pls add a whatsnew note (bug fix in Conversion section in 1.1)
- why are you xfailing? can you show what is failing,
@@ -186,7 +186,9 @@ class TestInterface(base.BaseInterfaceTests): | |||
|
|||
|
|||
class TestConstructors(base.BaseConstructorsTests): | |||
pass | |||
@pytest.mark.xfail(reason="bad is-na for empty data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this xfailed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- coerce_to_array() in core/arrays/integer.py doesn't accept array(nan).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would fix this here. What needs to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to allows values.ndim to be 0 in coerce_to_array().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead not pass a 0-dim array to coerce_to_array
? It's not clear to me why we need a 0-d array in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added processing to convert np.nan to [].
values = [] if values is np.nan else values
@@ -83,7 +83,9 @@ class TestCasting(BaseInterval, base.BaseCastingTests): | |||
|
|||
|
|||
class TestConstructors(BaseInterval, base.BaseConstructorsTests): | |||
pass | |||
@pytest.mark.xfail(reason="bad is-na for empty data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this xfailed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- object is not supported for IntervalArray
- na_value of IntervalArray is float, so
AttributeError: 'float' object has no attribute 'dtype'
in construct_1d_arraylike_from_scalar().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would ideally fixed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if nan_dtype is dtype (IntervalDtype), can create df.
if is_interval_dtype(dtype):
nan_dtype = dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Can you add a whatsnew for this?
Can you dig a bit deeper on why integer array & interval needed to be xfailed? If it's not too much additional work it'd be good to get all of these at once.
def test_construct_empty_dataframe(self, dtype): | ||
# GH 33623 | ||
result = pd.DataFrame(columns=["a"], dtype=dtype) | ||
expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype) | |
expected = pd.DataFrame({"a": pd.array([], dtype=dtype}) |
This seems a bit safer way to get the expected result.
@@ -186,7 +186,9 @@ class TestInterface(base.BaseInterfaceTests): | |||
|
|||
|
|||
class TestConstructors(base.BaseConstructorsTests): | |||
pass | |||
@pytest.mark.xfail(reason="bad is-na for empty data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would fix this here. What needs to change?
@@ -83,7 +83,9 @@ class TestCasting(BaseInterval, base.BaseCastingTests): | |||
|
|||
|
|||
class TestConstructors(BaseInterval, base.BaseConstructorsTests): | |||
pass | |||
@pytest.mark.xfail(reason="bad is-na for empty data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would ideally fixed here.
36204d3
to
74e6192
Compare
74e6192
to
92d75b7
Compare
pandas/core/arrays/integer.py
Outdated
@@ -184,6 +184,8 @@ def coerce_to_array( | |||
------- | |||
tuple of (values, mask) | |||
""" | |||
values = [] if values is np.nan else values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please look at the caller? This indicates that we're passing np.nan
to a place where we shouldn't be (probably IntegerArray._from_sequence
). That means there may be other ExtensionArrays facing the same issue. I'd much rather fix it at the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means there may be other ExtensionArrays facing the same issue
Will we work this on other issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on the size of the required changes to get this working.
I'm not comfortable merging this until the problem is better understood. We should not be passing np.nan
to _from_sequence
. We should be passing []
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making changes here may not be necessary once the changes to sanitize_array in #33846 are merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.DataFrame(columns=["a"], dtype="Int64").dtypes
now works on master following #33846. have reverted this change.
if ( | ||
dtype is None | ||
or is_extension_array_dtype(dtype) | ||
or np.issubdtype(dtype, np.flexible) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changing this fixes the interval case
if ( | |
dtype is None | |
or is_extension_array_dtype(dtype) | |
or np.issubdtype(dtype, np.flexible) | |
): | |
if dtype is None or ( | |
not is_extension_array_dtype(dtype) | |
and np.issubdtype(dtype, np.flexible) | |
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion doesn't work...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've pushed this change, works on my machine. can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @simonjayhawkins . works on my environment too.
Thanks @kotamatsuoka for working on this. can you add a whatsnew. |
@simonjayhawkins |
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -755,7 +755,7 @@ ExtensionArray | |||
- Fixed bug where :meth:`Series.value_counts` would raise on empty input of ``Int64`` dtype (:issue:`33317`) | |||
- Fixed bug in :class:`Series` construction with EA dtype and index but no data or scalar data fails (:issue:`26469`) | |||
- Fixed bug that caused :meth:`Series.__repr__()` to crash for extension types whose elements are multidimensional arrays (:issue:`33770`). | |||
|
|||
- Fixed bug where :meth:`init_dict` would raise on empty input (:issue:`27953` and :issue:`33623`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always make this user facing. init_dict is a private method. likely you want DataFrame(columns=.., dtype='string')
would fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jreback . I updated. Please review.
thanks @kotamatsuoka |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff