Skip to content

BUG: Fixed PandasArray.__setitem__ with str #28119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Aug 23, 2019

Closes #28118
Closes #28150

@TomAugspurger TomAugspurger added this to the 1.0 milestone Aug 23, 2019
@jbrockmendel
Copy link
Member

LGTM

That said, I'm just noticing that below the changed code PandasArray can change its dtype and pin a new underlying ndarray. Does that seem sketchy to anyone else?

@simonjayhawkins simonjayhawkins added ExtensionArray Extending pandas with custom dtypes or arrays. Bug labels Aug 23, 2019
@TomAugspurger
Copy link
Contributor Author

PandasArray can change its dtype and pin a new underlying ndarray.

I'm not too bothered by the new underlying ndarray part, since it's private. What part bothers you?

I do notice that .astype should probably be with copy=False.

@jbrockmendel
Copy link
Member

I'm not too bothered by the new underlying ndarray part, since it's private. What part bothers you?

It's liable to cause surprises with view-like semantics. e.g. I'd expect parr[:] or np.asarray(parr) to stay in sync with parr. (this discussion probably belongs in a separate issue)

@TomAugspurger
Copy link
Contributor Author

Agreed that this can go in a followup :)

if is_object_dtype(self.dtype._dtype):
t = np.dtype(object)
else:
t = self.dtype._dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that hits this branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_setitem_object_typecode[None] hits it (setting a string into an integer array).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler to leave the original code then just convert a np.str to no.object (which is what we do inside blocks manager and other places); maybe have a function to do this rather than rewriting logic all over the place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's appropriate for PandasArray. The idea is to take an arbitrary numpy array and box it in an extension array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that’s exactly what is done in ObjectBlock now

pls refactor rather than adding logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it. I think I was mimicking the behavior of Series.__setitem__

In [4]: x = np.array([1, 2, 3])

In [5]: s = pd.Series(x)

In [6]: s.values is x
Out[6]: True

In [7]: s[0] = 'a'

In [8]: s.values is x
Out[8]: False

But I'm happy to be stricter here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, we'll also inherit things like

In [11]: x = np.array([1, 2, 3])

In [12]: x[0] = 5.5

In [13]: x
Out[13]: array([5, 2, 3])

But maybe that's OK, if the intent is to be close to NumPy here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extent can we punt on the float treatment for now? I think there's a case to be made that we should raise instead of casting there, but don't want to bog this down any more.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think our options are to always raise when the dtypes don't match, or adopt NumPy's behavior. I don't think I have a preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought that pushes me towards raising is that if/when this is backing a Block, we want Block.setitem to try to set it on block.values and then fall back to casting.

@TomAugspurger
Copy link
Contributor Author

Merging in a few hours.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor whatsnew comments, merge on green.

@TomAugspurger
Copy link
Contributor Author

Fixed the whatsnew. Merging.

@TomAugspurger TomAugspurger merged commit 5a227a4 into pandas-dev:master Sep 17, 2019
@TomAugspurger TomAugspurger deleted the PandasArray-setitem-object branch September 17, 2019 20:21
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* BUG: Fixed PandasArray.__setitem__ with str

Closes pandas-dev#28118
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* BUG: Fixed PandasArray.__setitem__ with str

Closes pandas-dev#28118
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/BUG: PandasArray __setitem__ can change underlying buffer PandasArray.__setitem__ fails for strings
5 participants