Skip to content

numpydev ragged array dtype warning #31203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 22, 2020

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jan 22, 2020

Closes #31201

There are still two failures in the Categorical constructor I'm looking into. Not sure what's best yet.

In [2]: pd.Categorical(['a', ('a', 'b')])
/Users/taugspurger/sandbox/pandas/pandas/core/dtypes/cast.py:1066: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  v = np.array(v, copy=False)
Out[2]:
[a, (a, b)]
Categories (2, object): [(a, b), a]

I think that perhaps the user should see this warning, but have an option to pass through a dtype to silence it... Not sure yet.

Tagged for backport to keep the CI passing on that branch.

@TomAugspurger TomAugspurger added this to the 1.0.0 milestone Jan 22, 2020
@TomAugspurger TomAugspurger added the Compat pandas objects compatability with Numpy or Python functions label Jan 22, 2020
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 22, 2020

I think right now I'd prefer the user take care of that categorical issue by passing in an ndarray. I'm going to update the tests to reflect that, but am certainly open to suggestions for how we can handle it within pandas.

Edit: 78514cf

So users will see the warning if they just pass Categorical(['a', ('a',)]), and the warning won't be especially clear since pandas is the one calling asarray, but I don't see a clean way to get the dtype argument through there.

@AlexKirko
Copy link
Member

AlexKirko commented Jan 22, 2020

@TomAugspurger
From the data science standpoint, there is no meaning to a categorical variable having some kind of specific type. You wouldn't treat it any differently whether the values it can take are numbers, strings or tuples of DataFrames.
We could consider implementing a dtype=object as a default, either in this PR or in the future.
I'm not familiar with the constructor implementation though, so take this with a grain of salt.

@TomAugspurger
Copy link
Contributor Author

We could consider implementing a dtype=object as a default, either in this PR or in the future.

We don't want that. You can have a categorical holding native types like integers, timestamps, etc. We want the .categories to continue to be backed by the appropriate index for that dtype

In [4]: type(pd.Categorical([1, 2]).categories)
Out[4]: pandas.core.indexes.numeric.Int64Index

@AlexKirko
Copy link
Member

Seems fair. Especially as the warning is thrown in a pretty obscure use case, and changing the default would mess with all cases.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2020

this looks good @TomAugspurger

In [2]: pd.Categorical(['a', ('a', 'b')])
/Users/taugspurger/sandbox/pandas/pandas/core/dtypes/cast.py:1066: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  v = np.array(v, copy=False)
Out[2]:
[a, (a, b)]
Categories (2, object): [(a, b), a]

agreed that this should just pass thru the warning (or maybe we want to have a warning; to having non-scalar based categories is odd); yes you can do it, but you should have to be specific about this

@TomAugspurger
Copy link
Contributor Author

All green. Merging in ~1 hour to get CI passing again. I'll be able to do followups as needed.

@@ -2058,7 +2058,7 @@ def drop(self, codes, level=None, errors="raise"):

if not isinstance(codes, (np.ndarray, Index)):
try:
codes = com.index_labels_to_array(codes)
codes = com.index_labels_to_array(codes, dtype=object)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, at least in our tests, this isn't changing behavior. I added an assert com.index_labels_to_array(codes, dtype=object).dtype == com.index_labels_to_array(codes) temporarily, so we were always inferring object dtype here (in our tests).

@TomAugspurger TomAugspurger merged commit 6752833 into pandas-dev:master Jan 22, 2020
@TomAugspurger TomAugspurger deleted the 31201-numpydev branch January 22, 2020 14:47
@simonjayhawkins
Copy link
Member

@TomAugspurger This needs a backport?

@TomAugspurger
Copy link
Contributor Author

Yep. Not sure why the bot hasn't picked it up yet.

@TomAugspurger
Copy link
Contributor Author

@meeseeksdev backport to 1.0.x

@@ -79,7 +79,7 @@ def cat_core(list_of_columns: List, sep: str):
return np.sum(arr_of_cols, axis=0)
list_with_sep = [sep] * (2 * len(list_of_columns) - 1)
list_with_sep[::2] = list_of_columns
arr_with_sep = np.asarray(list_with_sep)
arr_with_sep = np.asarray(list_with_sep, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

darn, i thought i had already fixed this one

array = data_missing._from_sequence([na, fill_value, na])
array = data_missing._from_sequence(
[na, fill_value, na], dtype=data_missing.dtype
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sort of changes the interface. do we want authors to handle this on their own?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it change the interface?

We are reducing coverage of _from_sequence inferring the dtype from an untyped list. We could restore that if desired (and probably skip for problematic arrays).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are reducing coverage of _from_sequence inferring the dtype from an untyped list.

Yah, i guess that is a better description than "changes the interface"

@mattip
Copy link
Contributor

mattip commented Jan 22, 2020

Sorry for the breakage. and thanks for the quick fix. I thought I had tested this before re-merging the NumPy PR. Please ping me if I can help with any further issues.

TomAugspurger added a commit that referenced this pull request Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COMPAT: numpy object array coercion warnings
6 participants