Skip to content

BUG: pd.array([timedelta_like_strings]) should infer TimedeltaArray #33558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Apr 15, 2020 · 19 comments · Fixed by #41731
Closed

BUG: pd.array([timedelta_like_strings]) should infer TimedeltaArray #33558

jbrockmendel opened this issue Apr 15, 2020 · 19 comments · Fixed by #41731
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors
Milestone

Comments

@jbrockmendel
Copy link
Member

>>> left = pd.array(['59 days', '59 days', pd.NaT]))
>>> left
<StringArray>
['59 days', '59 days', <NA>]
Length: 3, dtype: string

>>> left = pd.array(['59 days', '59 days', pd.NaT], dtype='m8[ns]')
>>> left
<TimedeltaArray>
['59 days', '59 days', NaT]
Length: 3, dtype: timedelta64[ns]

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 15, 2020
@jorisvandenbossche
Copy link
Member

Why should it infer that? If passing strings., I think inferring a string type is best? And for conversions from string to timedelta64, we also have pd.to_timedelta

@TomAugspurger
Copy link
Contributor

I think that'd be too magical. I think the result dtype of pd.array should only depend on the types of the values, not the values themselves.

@jbrockmendel
Copy link
Member Author

  1. pd.array should mirror pd.Series and pd.Index.
  2. The presence of pd.NaT should be a big hint

@jorisvandenbossche
Copy link
Member

I don't think we necessarily need to copy "mistakes" from Series/Index constructors into pd.array. Whether this case is one is debatable for sure.
But personally, I don't think we should automatically convert strings to datetimelike dtypes (without dtype being specified).

BTW, pd.array already does not fully mirror Series/Index, as it has the "future" behaviour fo defaulting to eg nullable integers.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 15, 2020

Mmm I missed the NaT originally. That does change things, but I think it moves me to returning something like an object-dtype PandasArray.

What would your proposed behavior for something like

pd.array(['2 days', pd.NaT, 'a'])

be? A case where a value isn't valid for to_timedelta.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 15, 2020

The presence of NaT could, besides the option to coerce to datetimelike as you propose, also be a reason to not infer string dtype though, if we want to differentiate different missing value sentinels.

This is actually also what happens with ints. You get object dtype instead of float or datetimelike:

In [5]: pd.Series([1, 2, pd.NaT]) 
Out[5]: 
0      1
1      2
2    NaT
dtype: object

EDIT: so yes what Tom also said ;)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 15, 2020

If automatically coercing strings, the question also becomes which formats do automatically coerce and which not (apart from totally different strings as in Tom's example).

For example with datetime64:

In [10]: pd.Series(["2012/01/12", pd.NaT])    
Out[10]: 
0   2012-01-12
1          NaT
dtype: datetime64[ns]

In [11]: pd.Series(["01/12/2012", pd.NaT])    
Out[11]: 
0    01/12/2012
1           NaT
dtype: object

The one coerces, the other not, while both work fine in pd.to_datetime and result in the same datetime

@jbrockmendel
Copy link
Member Author

I'm coming at this from a high-level view of "pd.array/pd.Index/pd.Series" should share code and behavior (#27460). If there is not consensus on that, then this topic is moot.

The "01/12/2012" example is a good one. That strikes me as surprising behavior in the Series constructor.

@jorisvandenbossche
Copy link
Member

To be clear, I am fully on board that in the long term, those should share code (and basically Index/Series should just call pd.array under the hood, or something close to that).

But we could also deprecate the behaviour of inferring strings like that in Series/Index, in the course of unifying things.

@jbrockmendel
Copy link
Member Author

But we could also deprecate the behaviour of inferring strings like that in Series/Index, in the course of unifying things.

Conditional on the behavior+code converging, i have no real objection to that.

@TomAugspurger
Copy link
Contributor

+1 to consistency / sharing code. I think the only intentional difference is that array can return the new nullable types. For Series we'll need to wait for a global option / flag to the Series constructor to opt into that (and Index will need to wait for EA-backed indexes).

Are we agreed that the expected dtype for these mixed cases (strings and NaT) is object?

@jorisvandenbossche
Copy link
Member

Are we agreed that the expected dtype for these mixed cases (strings and NaT) is object?

I think that is best.

An option could also be to raise an error (I don't know how easy it is code-wise), and requiring users to specify dtype=object explicitly if they want object dtype. In the idea that, for this specific case (with people passing pd.NaT), the result wil almost never be what you want, so better to notify people of it.
Although that is also a slippery slope (in which cases to raise, and in which to default to object dtype), so probably object dtype is best (if not coercing to datetimelike).

@jbrockmendel
Copy link
Member Author

Are we agreed that the expected dtype for these mixed cases (strings and NaT) is object?

For the example with "a" in it, yes. For the datetime/timedelta-like strings, i'll need to give some thought to how id like Series/Index to behave longer-term.

@jbrockmendel jbrockmendel added API Design Constructors Series/DataFrame/Index/pd.array Constructors and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020
@jorisvandenbossche
Copy link
Member

Reopening this because I think there is still agreement that the current behaviour of inferring string dtype is not the best. I think the idea would be to rather return object dtype?
(or alternatively raise an error, to notify people they might be doing something wrong)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 12, 2020

I'm not sure what the current behavior is, but if possible I'd like to keep the inference of pd.array to just consider types. Something like pd.array([list_of_strings]) inferring based on values worries me (ambiguities, computational expense).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 12, 2020

See the example in the top post: there is a pd.NaT in the list_of_strings, which is not a string or not a generic NA value (it's a datetime-like), so should that still infer string dtype?

something like pd.array([list_of_strings]) inferring based on values worries me (ambiguities, computational expense).

If a list is provided, and not dtype is specified, isn't "inferring based on values" the only thing we can do?

@jbrockmendel
Copy link
Member Author

yah, i closed bc i thought there was consensus that the title of the issue was wrong; i.e. you guys convinced me.

@TomAugspurger
Copy link
Contributor

Ah, so Joris you reopened it to go the other way? To deprecate Series / Index's behavior of inferring?

@jorisvandenbossche
Copy link
Member

Sorry for the confusion here ;) (and ignore the title of the issue for a moment, that's indeed in need of an update if we agree what this issue is about)

The only reason that I reopened this is because (I thought) our earlier discussion (@TomAugspurger see your comment above at #33558 (comment) which says "Are we agreed that the expected dtype for these mixed cases (strings and NaT) is object?") concluded that the current behaviour (inferring this mixed case to string) is not wanted (we want to infer mixed case to object instead).
So since the example code that started this issue is not doing on master what we want it to do, it seems worth it to have an open issue about this (even though the original proposal of @jbrockmendel when opening this issue is different).

Ah, so Joris you reopened it to go the other way? To deprecate Series / Index's behavior of inferring?

Well, I was not actaully thinking about that, but now you say it .. ;) That's maybe indeed what we should do.

There are basically two "different" behaviours to consider (and when reopening this issue I was actually only thinking about the first):

  • pd.array(['59 days', pd.NaT]) -> infers string dtype
  • pd.Series(['59 days', pd.NaT]) -> infers timedelta64[ns] dtype

Ideally, we want to have both of those consistent, agreed?
And it seems there is also agreement on inferring as object dtype? (but noting again that raising an error is also still an option, users can specify dtype=object if they really meant to create an object-dtype series)

For the first one, I think we can still simply change, because pd.array()'s behaviour is still in flux anyway (string dtype is experimental).
The Series behaviour, if we agree on changing this default inference, indeed is something we need to deprecate. The question might be if we already want to deprecate this now, or have this integrated in a general move to pd.array behaviour / nullable dtypes we need to do at some point (there are other differences in behaviour between pd.Series() inference and pd.array() inference as well)

@jreback jreback added this to the 1.3 milestone May 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants