BUG: Create empty dataframe with string dtype fails #33651

kotamatsuoka · 2020-04-19T12:10:29Z

closes BUG: Create empty dataframe with string dtype fails "data type not understood" #33623, closes DataFrame constructor fails with extension dtype and columns #27953
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pandas/core/internals/construction.py

jreback · 2020-04-20T01:19:34Z

pandas/core/internals/construction.py

-            if dtype is None or np.issubdtype(dtype, np.flexible):
+            if is_dtype_equal(dtype, "string"):
+                # GH 33623
+                nan_dtype = dtype


this will be dtype.na_value

can you update this

Sorry, I can't figure out how to fix this from "this will be dtype.na_value".

In [15]: pd.Int32Dtype.na_value Out[15]: <NA>

nan_dtype = dtype.na_value

will change to nan_dtype = dtype.na_value, error occurs.

if not isinstance(dtype, (np.dtype, type(np.dtype))): > dtype = dtype.dtype E AttributeError: 'NAType' object has no attribute 'dtype' pandas/core/dtypes/cast.py:1545: AttributeError

So I updated it like this.

if ( dtype is None or is_extension_array_dtype(dtype) or np.issubdtype(dtype, np.flexible) ): nan_dtype = object

pandas/tests/frame/test_constructors.py

jreback

also needs a whatsnew note, but in bug fixes Extension array section in 1.1)

pandas/tests/extension/test_common.py

jreback · 2020-04-23T17:55:55Z

pandas/core/internals/construction.py

-            if dtype is None or np.issubdtype(dtype, np.flexible):
+            if is_dtype_equal(dtype, "string"):
+                # GH 33623
+                nan_dtype = dtype


can you update this

jreback · 2020-04-25T21:59:52Z

pandas/core/internals/construction.py

-            if dtype is None or np.issubdtype(dtype, np.flexible):
+            if is_dtype_equal(dtype, "string"):
+                # GH 33623
+                nan_dtype = dtype


In [15]: pd.Int32Dtype.na_value Out[15]: <NA>

nan_dtype = dtype.na_value

pandas/tests/extension/arrow/test_bool.py

pandas/tests/extension/base/constructors.py

jreback

ok code change and tests in the base extensions looks fine.

pls add a whatsnew note (bug fix in Conversion section in 1.1)
why are you xfailing? can you show what is failing,

jreback · 2020-04-27T20:37:28Z

pandas/tests/extension/test_integer.py

@@ -186,7 +186,9 @@ class TestInterface(base.BaseInterfaceTests):


 class TestConstructors(base.BaseConstructorsTests):
-    pass
+    @pytest.mark.xfail(reason="bad is-na for empty data")


why is this xfailed?

coerce_to_array() in core/arrays/integer.py doesn't accept array(nan).

Ideally we would fix this here. What needs to change?

need to allows values.ndim to be 0 in coerce_to_array().

Can we instead not pass a 0-dim array to coerce_to_array? It's not clear to me why we need a 0-d array in the first place.

I added processing to convert np.nan to [].

values = [] if values is np.nan else values

jreback · 2020-04-27T20:37:36Z

pandas/tests/extension/test_interval.py

@@ -83,7 +83,9 @@ class TestCasting(BaseInterval, base.BaseCastingTests):


 class TestConstructors(BaseInterval, base.BaseConstructorsTests):
-    pass
+    @pytest.mark.xfail(reason="bad is-na for empty data")


why is this xfailed?

object is not supported for IntervalArray

na_value of IntervalArray is float, so AttributeError: 'float' object has no attribute 'dtype' in construct_1d_arraylike_from_scalar().

I think this would ideally fixed here.

if nan_dtype is dtype (IntervalDtype), can create df.

if is_interval_dtype(dtype): nan_dtype = dtype

pandas/tests/extension/arrow/test_bool.py

TomAugspurger

Thanks. Can you add a whatsnew for this?

Can you dig a bit deeper on why integer array & interval needed to be xfailed? If it's not too much additional work it'd be good to get all of these at once.

TomAugspurger · 2020-04-28T13:44:02Z

pandas/tests/extension/base/constructors.py

+    def test_construct_empty_dataframe(self, dtype):
+        # GH 33623
+        result = pd.DataFrame(columns=["a"], dtype=dtype)
+        expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype)


Suggested change

expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype)

expected = pd.DataFrame({"a": pd.array([], dtype=dtype})

This seems a bit safer way to get the expected result.

TomAugspurger · 2020-04-28T13:44:04Z

pandas/tests/extension/test_integer.py

@@ -186,7 +186,9 @@ class TestInterface(base.BaseInterfaceTests):


 class TestConstructors(base.BaseConstructorsTests):
-    pass
+    @pytest.mark.xfail(reason="bad is-na for empty data")


Ideally we would fix this here. What needs to change?

TomAugspurger · 2020-04-28T13:44:32Z

pandas/tests/extension/test_interval.py

@@ -83,7 +83,9 @@ class TestCasting(BaseInterval, base.BaseCastingTests):


 class TestConstructors(BaseInterval, base.BaseConstructorsTests):
-    pass
+    @pytest.mark.xfail(reason="bad is-na for empty data")


I think this would ideally fixed here.

pandas/tests/extension/arrow/test_bool.py

TomAugspurger · 2020-04-29T11:50:08Z

pandas/core/arrays/integer.py

@@ -184,6 +184,8 @@ def coerce_to_array(
    -------
    tuple of (values, mask)
    """
+    values = [] if values is np.nan else values


Can you please look at the caller? This indicates that we're passing np.nan to a place where we shouldn't be (probably IntegerArray._from_sequence). That means there may be other ExtensionArrays facing the same issue. I'd much rather fix it at the source.

That means there may be other ExtensionArrays facing the same issue

Will we work this on other issues?

Depends on the size of the required changes to get this working.

I'm not comfortable merging this until the problem is better understood. We should not be passing np.nan to _from_sequence. We should be passing [].

making changes here may not be necessary once the changes to sanitize_array in #33846 are merged.

pd.DataFrame(columns=["a"], dtype="Int64").dtypes now works on master following #33846. have reverted this change.

simonjayhawkins · 2020-05-01T11:00:19Z

pandas/core/internals/construction.py

+            if (
+                dtype is None
+                or is_extension_array_dtype(dtype)
+                or np.issubdtype(dtype, np.flexible)
+            ):


I think changing this fixes the interval case

Suggested change

if (

dtype is None

or is_extension_array_dtype(dtype)

or np.issubdtype(dtype, np.flexible)

):

if dtype is None or (

not is_extension_array_dtype(dtype)

and np.issubdtype(dtype, np.flexible)

):

This suggestion doesn't work...

i've pushed this change, works on my machine. can you elaborate?

Thanks @simonjayhawkins . works on my environment too.

…ith-string-dtype

…mment

simonjayhawkins · 2020-05-03T15:54:07Z

Thanks @kotamatsuoka for working on this. can you add a whatsnew.

kotamatsuoka · 2020-05-05T13:50:34Z

can you add a whatsnew.

@simonjayhawkins
I added a whatnew. Is it collect? This is first time.

jreback · 2020-05-05T14:16:43Z

doc/source/whatsnew/v1.1.0.rst

@@ -755,7 +755,7 @@ ExtensionArray
 - Fixed bug where :meth:`Series.value_counts` would raise on empty input of ``Int64`` dtype (:issue:`33317`)
 - Fixed bug in :class:`Series` construction with EA dtype and index but no data or scalar data fails (:issue:`26469`)
 - Fixed bug that caused :meth:`Series.__repr__()` to crash for extension types whose elements are multidimensional arrays (:issue:`33770`).
-
+- Fixed bug where :meth:`init_dict` would raise on empty input (:issue:`27953` and :issue:`33623`)


always make this user facing. init_dict is a private method. likely you want DataFrame(columns=.., dtype='string') would fail

Thanks @jreback . I updated. Please review.

doc/source/whatsnew/v1.1.0.rst

jreback · 2020-05-09T19:56:19Z

thanks @kotamatsuoka

alimcmaster1 added Constructors Series/DataFrame/Index/pd.array Constructors Bug labels Apr 19, 2020

alimcmaster1 added this to the 1.1 milestone Apr 19, 2020

BUG: Create empty dataframe with string dtype fails

82dc446

kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch from 466a9ba to 82dc446 Compare April 19, 2020 16:11

jreback requested changes Apr 20, 2020

View reviewed changes

Changes according to comments

d1da8c8

kotamatsuoka requested a review from jreback April 20, 2020 03:26

simonjayhawkins added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 21, 2020

jreback requested changes Apr 23, 2020

View reviewed changes

Add test of empty dataframe in ExtensionDtype

66eda6a

kotamatsuoka requested a review from jreback April 25, 2020 12:01

jreback requested changes Apr 25, 2020

View reviewed changes

kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch from 3de5485 to 333e36c Compare April 26, 2020 13:20

Remove column fixtures

ff26d95

kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch from 333e36c to ff26d95 Compare April 26, 2020 13:48

kotamatsuoka requested a review from jreback April 27, 2020 13:03

jreback requested changes Apr 27, 2020

View reviewed changes

TomAugspurger reviewed Apr 28, 2020

View reviewed changes

kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch 2 times, most recently from 36204d3 to 74e6192 Compare April 29, 2020 06:19

Remove xfail in test_integer

92d75b7

kotamatsuoka force-pushed the empty-dataframe-with-string-dtype branch from 74e6192 to 92d75b7 Compare April 29, 2020 07:02

TomAugspurger reviewed Apr 29, 2020

View reviewed changes

kotamatsuoka requested review from jreback and TomAugspurger April 30, 2020 13:25

simonjayhawkins reviewed May 1, 2020

View reviewed changes

simonjayhawkins added 2 commits May 3, 2020 16:15

Merge remote-tracking branch 'upstream/master' into empty-dataframe-w…

0363ef0

…ith-string-dtype

revert changes to IntegerArray, fix IntervalArray, update test per co…

5fcfed6

…mment

add a whatsnew.

5ad5f8b

jreback requested changes May 5, 2020

View reviewed changes

kotamatsuoka added 2 commits May 5, 2020 23:52

update the whatsnew

9e15301

Merge branch 'master' into empty-dataframe-with-string-dtype

e8e1ed1

kotamatsuoka requested a review from jreback May 5, 2020 15:24

jreback approved these changes May 9, 2020

View reviewed changes

jreback merged commit cb7b294 into pandas-dev:master May 9, 2020

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

BUG: Create empty dataframe with string dtype fails (pandas-dev#33651)

0f2eca2

	expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype)
	expected = pd.DataFrame({"a": pd.array([], dtype=dtype})

BUG: Create empty dataframe with string dtype fails #33651

BUG: Create empty dataframe with string dtype fails #33651

Conversation

kotamatsuoka commented Apr 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kotamatsuoka Apr 26, 2020 • edited Loading

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins commented May 3, 2020

kotamatsuoka commented May 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 9, 2020

kotamatsuoka commented Apr 19, 2020 •

edited

Loading

kotamatsuoka Apr 26, 2020 •

edited

Loading

kotamatsuoka commented May 5, 2020 •

edited

Loading