-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DTA/TDA/PA setitem incorrectly allowing i8 #33717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
value = self._unbox_scalar(value) | ||
value = array(value) | ||
if is_string_dtype(value.dtype): | ||
# We got a StringArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this assumption important or is value pre-validated
>>> value = [1, 2, object()]
>>> value = pd.array(value)
>>> value
<PandasArray>
[1, 2, <object object at 0x0000015F4B631DA0>]
Length: 3, dtype: object
>>>
>>> pd.core.dtypes.common.is_string_dtype(value.dtype)
True
>>>
might be worth adding type annotations to _validate_setitem_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes, that looks like a problem with is_string_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel addressing here? or followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a "problem" with is_string_dtype
, because that function simply checks for object dtype. That's how it has been for a long time, and not sure if it is easy to solve, as actually checking for strings would mean to infer the dtypes from the values each time. We have #15585 about this.
(it's a problem in practice of course, since it makes this function rather useless. But the solution is probably improving the new "string" dtype so this can become the default string dtype in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right actually I would say this is a bug in is_string_dtype
which is actually just checking for object
dtype and not an inferred string itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't infer string values from a dtype object, though ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure you can, this is excatly what infer_from_dtype
does; we just don't do it inside this function because it can be non-cheap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to punt on is_string_dtype by instead checking is_dtype_equal(value.dtype, "string")
if is_string_dtype(value.dtype): | ||
# We got a StringArray | ||
try: | ||
# TODO: Could use from_sequence_of_strings if implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So related to my comment above, this TODO item is not correct (also the "We got a StringArray" comment above is thus not correct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a suggestion for what to do here? _from_sequence is clearly too permissive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is _from_sequence
too permissive for this purpose? It seems that whathever is allowed to create an array of this dtype, can also be allowed to set values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this gets back to your point of from_sequence being too permissive for DTA/TDA/PA; it would let through e.g. float64 or int64 which shouldnt allow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This then becomes a bit ambiguous on what to allow and what not. Why allow ints to create but not to set, while allow strings in both creating and setting?
So why not disallow setting with string as well?
I suppose the answer is: that is the current behaviour .. But so to come back to your question what to use here: if you only want to allow actual strings, and not eg integers in an object dtype array, you can actually use infer_dtype
to check it are strings or datetime objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im assuming thats a rhetorical question; LMK if you actually are looking for an explanation.
So I'm clear: are you advocating for the status quo, or do you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's not really a rhetorical question, I actually find it inconsistent to allow certain data types in creation and others in setting (I am fine with being more strict in setting compared to creation, but then we should be strict altogether and also not allow strings in setting, IMO).
But it's a question that doesn't need to be solved in this PR, that for sure!
To be clear: I am fine with the code, I only commented about the comment not really being correct. But now you updated the dtype check, I assume the comment is actually correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually find it inconsistent to allow certain data types in creation and others in setting
Yes. The upshot is that the preferred way of minimizing inconsistencies is to make from_sequence more strict. Let's leave that for another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, can you rebase.
@jorisvandenbossche ok here?
I supppose this is worth a whatsnew note as well |
Yes, all good |
thanks @jbrockmendel if you can add a whatsnew in a followup |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff