numpydev ragged array dtype warning #31203

TomAugspurger · 2020-01-22T12:41:36Z

There are still two failures in the Categorical constructor I'm looking into. Not sure what's best yet.

In [2]: pd.Categorical(['a', ('a', 'b')])
/Users/taugspurger/sandbox/pandas/pandas/core/dtypes/cast.py:1066: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  v = np.array(v, copy=False)
Out[2]:
[a, (a, b)]
Categories (2, object): [(a, b), a]

I think that perhaps the user should see this warning, but have an option to pass through a dtype to silence it... Not sure yet.

Tagged for backport to keep the CI passing on that branch.

TomAugspurger · 2020-01-22T12:46:09Z

I think right now I'd prefer the user take care of that categorical issue by passing in an ndarray. I'm going to update the tests to reflect that, but am certainly open to suggestions for how we can handle it within pandas.

Edit: 78514cf

So users will see the warning if they just pass Categorical(['a', ('a',)]), and the warning won't be especially clear since pandas is the one calling asarray, but I don't see a clean way to get the dtype argument through there.

AlexKirko · 2020-01-22T12:53:34Z

@TomAugspurger
From the data science standpoint, there is no meaning to a categorical variable having some kind of specific type. You wouldn't treat it any differently whether the values it can take are numbers, strings or tuples of DataFrames.
We could consider implementing a dtype=object as a default, either in this PR or in the future.
I'm not familiar with the constructor implementation though, so take this with a grain of salt.

TomAugspurger · 2020-01-22T12:55:20Z

We could consider implementing a dtype=object as a default, either in this PR or in the future.

We don't want that. You can have a categorical holding native types like integers, timestamps, etc. We want the .categories to continue to be backed by the appropriate index for that dtype

In [4]: type(pd.Categorical([1, 2]).categories)
Out[4]: pandas.core.indexes.numeric.Int64Index

AlexKirko · 2020-01-22T13:05:39Z

Seems fair. Especially as the warning is thrown in a pretty obscure use case, and changing the default would mess with all cases.

jreback · 2020-01-22T13:39:32Z

this looks good @TomAugspurger

In [2]: pd.Categorical(['a', ('a', 'b')])
/Users/taugspurger/sandbox/pandas/pandas/core/dtypes/cast.py:1066: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  v = np.array(v, copy=False)
Out[2]:
[a, (a, b)]
Categories (2, object): [(a, b), a]

agreed that this should just pass thru the warning (or maybe we want to have a warning; to having non-scalar based categories is odd); yes you can do it, but you should have to be specific about this

TomAugspurger · 2020-01-22T14:07:14Z

All green. Merging in ~1 hour to get CI passing again. I'll be able to do followups as needed.

TomAugspurger · 2020-01-22T14:08:11Z

pandas/core/indexes/multi.py

@@ -2058,7 +2058,7 @@ def drop(self, codes, level=None, errors="raise"):

        if not isinstance(codes, (np.ndarray, Index)):
            try:
-                codes = com.index_labels_to_array(codes)
+                codes = com.index_labels_to_array(codes, dtype=object)


FYI, at least in our tests, this isn't changing behavior. I added an assert com.index_labels_to_array(codes, dtype=object).dtype == com.index_labels_to_array(codes) temporarily, so we were always inferring object dtype here (in our tests).

simonjayhawkins · 2020-01-22T15:02:29Z

@TomAugspurger This needs a backport?

TomAugspurger · 2020-01-22T15:09:02Z

Yep. Not sure why the bot hasn't picked it up yet.

TomAugspurger · 2020-01-22T15:10:08Z

@meeseeksdev backport to 1.0.x

jbrockmendel · 2020-01-22T16:00:14Z

pandas/core/strings.py

@@ -79,7 +79,7 @@ def cat_core(list_of_columns: List, sep: str):
        return np.sum(arr_of_cols, axis=0)
    list_with_sep = [sep] * (2 * len(list_of_columns) - 1)
    list_with_sep[::2] = list_of_columns
-    arr_with_sep = np.asarray(list_with_sep)
+    arr_with_sep = np.asarray(list_with_sep, dtype=object)


darn, i thought i had already fixed this one

jbrockmendel · 2020-01-22T16:04:13Z

pandas/tests/extension/base/getitem.py

-        array = data_missing._from_sequence([na, fill_value, na])
+        array = data_missing._from_sequence(
+            [na, fill_value, na], dtype=data_missing.dtype
+        )


this sort of changes the interface. do we want authors to handle this on their own?

How does it change the interface?

We are reducing coverage of _from_sequence inferring the dtype from an untyped list. We could restore that if desired (and probably skip for problematic arrays).

We are reducing coverage of _from_sequence inferring the dtype from an untyped list.

Yah, i guess that is a better description than "changes the interface"

mattip · 2020-01-22T17:48:54Z

Sorry for the breakage. and thanks for the quick fix. I thought I had tested this before re-merging the NumPy PR. Please ping me if I can help with any further issues.

Co-authored-by: Tom Augspurger <[email protected]>

TomAugspurger added 4 commits January 22, 2020 06:24

strings

b65852f

from_sequence

d204b83

jsonarray

e4c7289

mi.drop

619d2ce

TomAugspurger added this to the 1.0.0 milestone Jan 22, 2020

TomAugspurger added the Compat pandas objects compatability with Numpy or Python functions label Jan 22, 2020

categorical

78514cf

TomAugspurger commented Jan 22, 2020

View reviewed changes

TomAugspurger merged commit 6752833 into pandas-dev:master Jan 22, 2020

TomAugspurger deleted the 31201-numpydev branch January 22, 2020 14:47

simonjayhawkins added the Still Needs Manual Backport label Jan 22, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 22, 2020

Backport PR pandas-dev#31203: numpydev ragged array dtype warning

dfc19c2

meeseeksmachine mentioned this pull request Jan 22, 2020

Backport PR #31203 on branch 1.0.x (numpydev ragged array dtype warning) #31209

Merged

simonjayhawkins removed the Still Needs Manual Backport label Jan 22, 2020

jbrockmendel reviewed Jan 22, 2020

View reviewed changes

TomAugspurger added a commit that referenced this pull request Jan 22, 2020

Backport PR #31203: numpydev ragged array dtype warning (#31209)

c103002

Co-authored-by: Tom Augspurger <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpydev ragged array dtype warning #31203

numpydev ragged array dtype warning #31203

TomAugspurger commented Jan 22, 2020 •

edited

Loading

TomAugspurger commented Jan 22, 2020 •

edited

Loading

AlexKirko commented Jan 22, 2020 •

edited

Loading

TomAugspurger commented Jan 22, 2020

AlexKirko commented Jan 22, 2020

jreback commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

TomAugspurger Jan 22, 2020

simonjayhawkins commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

jbrockmendel Jan 22, 2020

jbrockmendel Jan 22, 2020

TomAugspurger Jan 22, 2020

jbrockmendel Jan 22, 2020

mattip commented Jan 22, 2020

numpydev ragged array dtype warning #31203

numpydev ragged array dtype warning #31203

Conversation

TomAugspurger commented Jan 22, 2020 • edited Loading

TomAugspurger commented Jan 22, 2020 • edited Loading

AlexKirko commented Jan 22, 2020 • edited Loading

TomAugspurger commented Jan 22, 2020

AlexKirko commented Jan 22, 2020

jreback commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

TomAugspurger Jan 22, 2020

Choose a reason for hiding this comment

simonjayhawkins commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020

jbrockmendel Jan 22, 2020

Choose a reason for hiding this comment

jbrockmendel Jan 22, 2020

Choose a reason for hiding this comment

TomAugspurger Jan 22, 2020

Choose a reason for hiding this comment

jbrockmendel Jan 22, 2020

Choose a reason for hiding this comment

mattip commented Jan 22, 2020

TomAugspurger commented Jan 22, 2020 •

edited

Loading

TomAugspurger commented Jan 22, 2020 •

edited

Loading

AlexKirko commented Jan 22, 2020 •

edited

Loading