Skip to content

BUG: Create empty dataframe with string dtype fails #33651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kotamatsuoka
Copy link
Contributor

@kotamatsuoka kotamatsuoka commented Apr 19, 2020

@alimcmaster1 alimcmaster1 added Constructors Series/DataFrame/Index/pd.array Constructors Bug labels Apr 19, 2020
@alimcmaster1 alimcmaster1 added this to the 1.1 milestone Apr 19, 2020
@kotamatsuoka kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch from 466a9ba to 82dc446 Compare April 19, 2020 16:11
if dtype is None or np.issubdtype(dtype, np.flexible):
if is_dtype_equal(dtype, "string"):
# GH 33623
nan_dtype = dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be dtype.na_value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't figure out how to fix this from "this will be dtype.na_value".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [15]: pd.Int32Dtype.na_value                                                                                                                                                                                                                          
Out[15]: <NA>

nan_dtype = dtype.na_value

Copy link
Contributor Author

@kotamatsuoka kotamatsuoka Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change to nan_dtype = dtype.na_value, error occurs.

            if not isinstance(dtype, (np.dtype, type(np.dtype))):
>               dtype = dtype.dtype
E               AttributeError: 'NAType' object has no attribute 'dtype'

pandas/core/dtypes/cast.py:1545: AttributeError

So I updated it like this.

if (
            dtype is None
            or is_extension_array_dtype(dtype)
            or np.issubdtype(dtype, np.flexible)
        ):
    nan_dtype = object

@kotamatsuoka kotamatsuoka requested a review from jreback April 20, 2020 03:26
@simonjayhawkins simonjayhawkins added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 21, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs a whatsnew note, but in bug fixes Extension array section in 1.1)

if dtype is None or np.issubdtype(dtype, np.flexible):
if is_dtype_equal(dtype, "string"):
# GH 33623
nan_dtype = dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this

@kotamatsuoka kotamatsuoka requested a review from jreback April 25, 2020 12:01
if dtype is None or np.issubdtype(dtype, np.flexible):
if is_dtype_equal(dtype, "string"):
# GH 33623
nan_dtype = dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [15]: pd.Int32Dtype.na_value                                                                                                                                                                                                                          
Out[15]: <NA>

nan_dtype = dtype.na_value

@kotamatsuoka kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch from 3de5485 to 333e36c Compare April 26, 2020 13:20
@kotamatsuoka kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch from 333e36c to ff26d95 Compare April 26, 2020 13:48
@kotamatsuoka kotamatsuoka requested a review from jreback April 27, 2020 13:03
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok code change and tests in the base extensions looks fine.

  • pls add a whatsnew note (bug fix in Conversion section in 1.1)
  • why are you xfailing? can you show what is failing,

@@ -186,7 +186,9 @@ class TestInterface(base.BaseInterfaceTests):


class TestConstructors(base.BaseConstructorsTests):
pass
@pytest.mark.xfail(reason="bad is-na for empty data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this xfailed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • coerce_to_array() in core/arrays/integer.py doesn't accept array(nan).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would fix this here. What needs to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to allows values.ndim to be 0 in coerce_to_array().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead not pass a 0-dim array to coerce_to_array? It's not clear to me why we need a 0-d array in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added processing to convert np.nan to [].

values = [] if values is np.nan else values

@@ -83,7 +83,9 @@ class TestCasting(BaseInterval, base.BaseCastingTests):


class TestConstructors(BaseInterval, base.BaseConstructorsTests):
pass
@pytest.mark.xfail(reason="bad is-na for empty data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this xfailed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • object is not supported for IntervalArray
  • na_value of IntervalArray is float, so AttributeError: 'float' object has no attribute 'dtype' in construct_1d_arraylike_from_scalar().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would ideally fixed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if nan_dtype is dtype (IntervalDtype), can create df.

if is_interval_dtype(dtype):
    nan_dtype = dtype

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can you add a whatsnew for this?

Can you dig a bit deeper on why integer array & interval needed to be xfailed? If it's not too much additional work it'd be good to get all of these at once.

def test_construct_empty_dataframe(self, dtype):
# GH 33623
result = pd.DataFrame(columns=["a"], dtype=dtype)
expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype)
expected = pd.DataFrame({"a": pd.array([], dtype=dtype})

This seems a bit safer way to get the expected result.

@@ -186,7 +186,9 @@ class TestInterface(base.BaseInterfaceTests):


class TestConstructors(base.BaseConstructorsTests):
pass
@pytest.mark.xfail(reason="bad is-na for empty data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would fix this here. What needs to change?

@@ -83,7 +83,9 @@ class TestCasting(BaseInterval, base.BaseCastingTests):


class TestConstructors(BaseInterval, base.BaseConstructorsTests):
pass
@pytest.mark.xfail(reason="bad is-na for empty data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would ideally fixed here.

@kotamatsuoka kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch 2 times, most recently from 36204d3 to 74e6192 Compare April 29, 2020 06:19
@kotamatsuoka kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch from 74e6192 to 92d75b7 Compare April 29, 2020 07:02
@@ -184,6 +184,8 @@ def coerce_to_array(
-------
tuple of (values, mask)
"""
values = [] if values is np.nan else values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please look at the caller? This indicates that we're passing np.nan to a place where we shouldn't be (probably IntegerArray._from_sequence). That means there may be other ExtensionArrays facing the same issue. I'd much rather fix it at the source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means there may be other ExtensionArrays facing the same issue

Will we work this on other issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the size of the required changes to get this working.

I'm not comfortable merging this until the problem is better understood. We should not be passing np.nan to _from_sequence. We should be passing [].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making changes here may not be necessary once the changes to sanitize_array in #33846 are merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.DataFrame(columns=["a"], dtype="Int64").dtypes now works on master following #33846. have reverted this change.

Comment on lines 245 to 249
if (
dtype is None
or is_extension_array_dtype(dtype)
or np.issubdtype(dtype, np.flexible)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing this fixes the interval case

Suggested change
if (
dtype is None
or is_extension_array_dtype(dtype)
or np.issubdtype(dtype, np.flexible)
):
if dtype is None or (
not is_extension_array_dtype(dtype)
and np.issubdtype(dtype, np.flexible)
):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion doesn't work...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've pushed this change, works on my machine. can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @simonjayhawkins . works on my environment too.

@simonjayhawkins
Copy link
Member

Thanks @kotamatsuoka for working on this. can you add a whatsnew.

@kotamatsuoka
Copy link
Contributor Author

kotamatsuoka commented May 5, 2020

can you add a whatsnew.

@simonjayhawkins
I added a whatnew. Is it collect? This is first time.

@@ -755,7 +755,7 @@ ExtensionArray
- Fixed bug where :meth:`Series.value_counts` would raise on empty input of ``Int64`` dtype (:issue:`33317`)
- Fixed bug in :class:`Series` construction with EA dtype and index but no data or scalar data fails (:issue:`26469`)
- Fixed bug that caused :meth:`Series.__repr__()` to crash for extension types whose elements are multidimensional arrays (:issue:`33770`).

- Fixed bug where :meth:`init_dict` would raise on empty input (:issue:`27953` and :issue:`33623`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always make this user facing. init_dict is a private method. likely you want DataFrame(columns=.., dtype='string') would fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jreback . I updated. Please review.

@kotamatsuoka kotamatsuoka requested a review from jreback May 5, 2020 15:24
@jreback jreback merged commit cb7b294 into pandas-dev:master May 9, 2020
@jreback
Copy link
Contributor

jreback commented May 9, 2020

thanks @kotamatsuoka

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
5 participants