-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: require scalar result from the scalar .at indexer #33153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @pandas-dev/pandas-core |
The actual |
Seems to me #26989 does not address using .at when not all levels are specified. Perhaps you meant another issue? |
That might be how it is coded, but for example also the reference docstring of |
thinking this through again, would be ok with raising on a return value that is non -scalar (eg we have duplicates) |
As a user, I think of @TomAugspurger thoughts? |
Agreed with Brock. |
Or otherwise they might need to do checks afterwards because you don't know whether you got back a scalar, series or dataframe Indexing inherently can always generate errors depending on your data (eg the key you are checking is not present), so in general your code might always need to be robust against that. But I think being certain here about the return type would be nice. |
|
It can be a problem of the duplicate labels, but not all our methods therefore need to be able to handle such duplicate labels. We have other methods that raise an error in such a case. |
Hmm maybe I'm missing something but my feeling is that |
As an implementation matter, if we wanted to disallow non-scalar results, this would require a deprecation cycle, since ATM Series.at specifically supports non-unique index. That deprecation wouldn't be enforced until 2.0 which is a ways off. In the interim, raising AttributeError for DataFrame.at is clearly a bug and should be fixed (#33047). Fixing that now doesn't preclude later deciding to deprecate the behavior for both Series/DataFrame. |
Yes, I suppose that indeed happened like that. But the initial implementation also didn't consider duplicates (I checked a couple of version back, and then .at with a duplicate label returned an ndarray, so didn't even bother to check for it, I suppose for performance reasons as it assumed scalars). So now we are actually considering duplicate labels, we can make a choice. We have a lot of places in pandas where the return type of a method can be all kinds of things, while in general it is nice to have stricter typing (eg |
I think the inconsistency with
this is not easy at all to be more strict here. |
Since this is blocking a number of PRs I would say @jorisvandenbossche pls make an issue for discussion if you think we ought to consider fixing duplicate label indexing (for both .loc and .at). |
This is that issue? |
Ah, read over "both loc and at". But, I don't think those necessarily need to be handled / discussed at the same time. As there is quite some difference: Now, if no one agrees with me, I am of course fine to go with the majority. I only think this is a chance to clean up a part of our indexing API. |
So on the call we discussed this more, and there was not much appetetite for restricting the behaviour for indexes with duplicates keys, so I think the conclusion is that we can fix the DataFrame cases to return a Series/DataFrame instead of raising an error for duplicate keys. |
@jorisvandenbossche Is the return type in various cases of a possible non-scalar result also part of this issue, or should it be separated off into another issue? |
@jorisvandenbossche I think all the raising cases have been fixed. Can you double check? |
Moved off the 1.1 milestone. |
The
.at
indexer is documented as the fast, restricted version ofloc
to "access a scalar value" (https://pandas.pydata.org/docs/user_guide/indexing.html#fast-scalar-value-getting-and-setting).However, currently, there is not always a guarantee that the result will actually be a scalar. For example, with a Series with duplicate index values:
There are some other possible cases that could give a non-scalar value as well:
However, those cases currently all fail (error or produce wrong result, see below for examples).
So a question that can be posed: should we consider those cases as bugs, or should we rather actually require a scalar result and thus see the series behaviour above as too liberal?
Since those linked issues have open PRs to "fix" them, I think we should first decide on what guarantees on the
.at
API we want to provide. And personally, I think taking the documented intention of accessing a scalar would be a plus (it gives a certainty about the type of the result, i.e. always a scalar, and not sometimes a series or dataframe depening on the value of the label passed toat
).If we prefer the more strict scalar result requirement, we can deprecate the Series case (but only if accessing a duplicated label) and provide better error messages for the error cases.
Examples of the failing behaviours (on pandas 1.0.1):
The text was updated successfully, but these errors were encountered: