Skip to content

BUG: DTA/TDA/PA setitem incorrectly allowing i8 #33717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 30, 2020

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

value = self._unbox_scalar(value)
value = array(value)
if is_string_dtype(value.dtype):
# We got a StringArray
Copy link
Member

@simonjayhawkins simonjayhawkins Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this assumption important or is value pre-validated

>>> value = [1, 2, object()]
>>> value = pd.array(value)
>>> value
<PandasArray>
[1, 2, <object object at 0x0000015F4B631DA0>]
Length: 3, dtype: object
>>>
>>> pd.core.dtypes.common.is_string_dtype(value.dtype)
True
>>>

might be worth adding type annotations to _validate_setitem_value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes, that looks like a problem with is_string_dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel addressing here? or followup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a "problem" with is_string_dtype, because that function simply checks for object dtype. That's how it has been for a long time, and not sure if it is easy to solve, as actually checking for strings would mean to infer the dtypes from the values each time. We have #15585 about this.

(it's a problem in practice of course, since it makes this function rather useless. But the solution is probably improving the new "string" dtype so this can become the default string dtype in the future)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right actually I would say this is a bug in is_string_dtype which is actually just checking for object dtype and not an inferred string itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't infer string values from a dtype object, though ..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure you can, this is excatly what infer_from_dtype does; we just don't do it inside this function because it can be non-cheap

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to punt on is_string_dtype by instead checking is_dtype_equal(value.dtype, "string")

@jreback jreback added the Datetime Datetime data dtype label Apr 23, 2020
if is_string_dtype(value.dtype):
# We got a StringArray
try:
# TODO: Could use from_sequence_of_strings if implemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So related to my comment above, this TODO item is not correct (also the "We got a StringArray" comment above is thus not correct)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion for what to do here? _from_sequence is clearly too permissive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is _from_sequence too permissive for this purpose? It seems that whathever is allowed to create an array of this dtype, can also be allowed to set values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this gets back to your point of from_sequence being too permissive for DTA/TDA/PA; it would let through e.g. float64 or int64 which shouldnt allow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This then becomes a bit ambiguous on what to allow and what not. Why allow ints to create but not to set, while allow strings in both creating and setting?
So why not disallow setting with string as well?

I suppose the answer is: that is the current behaviour .. But so to come back to your question what to use here: if you only want to allow actual strings, and not eg integers in an object dtype array, you can actually use infer_dtype to check it are strings or datetime objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im assuming thats a rhetorical question; LMK if you actually are looking for an explanation.

So I'm clear: are you advocating for the status quo, or do you have something else in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not really a rhetorical question, I actually find it inconsistent to allow certain data types in creation and others in setting (I am fine with being more strict in setting compared to creation, but then we should be strict altogether and also not allow strings in setting, IMO).
But it's a question that doesn't need to be solved in this PR, that for sure!

To be clear: I am fine with the code, I only commented about the comment not really being correct. But now you updated the dtype check, I assume the comment is actually correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually find it inconsistent to allow certain data types in creation and others in setting

Yes. The upshot is that the preferred way of minimizing inconsistencies is to make from_sequence more strict. Let's leave that for another thread.

@jreback jreback added this to the 1.1 milestone Apr 23, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, can you rebase.

@jorisvandenbossche ok here?

@jreback
Copy link
Contributor

jreback commented Apr 30, 2020

I supppose this is worth a whatsnew note as well

@jorisvandenbossche
Copy link
Member

Yes, all good

@jreback jreback merged commit 085752f into pandas-dev:master Apr 30, 2020
@jreback
Copy link
Contributor

jreback commented Apr 30, 2020

thanks @jbrockmendel

if you can add a whatsnew in a followup

@jbrockmendel jbrockmendel deleted the dtlike-helpers-2 branch May 4, 2020 20:03
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants