-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: 1D slices over extension types turn into N-dimensional slices over ExtensionArrays #42430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is the logging output itself the problem? If so, couldn't you patch |
Workaround around pandas-dev/pandas#42430.
@jbrockmendel The pandas/pandas/core/arrays/base.py Lines 294 to 307 in b72b6b2
|
I think the docstring doesn't include it because the annotation doesn't include it, which is in turn because the typing module has trouble with it https://github.com/pandas-dev/pandas/blob/master/pandas/_typing.py#L204 |
Sorry @jbrockmendel , I missed the notification for your earlier response (#42430 (comment)). Our extension type currently uses a workaround similar to what you describe. We check the inputs to def __getitem__(self, item) -> Union["TensorArray", "TensorElement"]:
"""
See docstring in `Extension Array` class in `pandas/core/arrays/base.py`
for information about this method.
"""
# Return scalar if single value is selected, a TensorElement for single array
# element, or TensorArray for slice
if isinstance(item, int):
[...]
else:
# BEGIN workaround for Pandas issue #42430
if (pd.__version__ == "1.3.0" and isinstance(item, tuple) and len(item) > 1
and item[0] == Ellipsis):
if len(item) > 2:
# Hopefully this case is not possible, but can't be sure
raise ValueError(f"Workaround Pandas issue #42430 not implemented "
f"for tuple length > 2")
item = item[1]
# END workaround for issue #42430
[...] (see original code here). Having to use this kind of workaround is cumbersome for extension type developers. It's not clear what other variations of "multidimensional indexing expression that happens to work on a 1D numpy array" we're supposed to handle. If Pandas starts throwing additional variations on the theme at us, our extension types will break again. Personally, I think that this issue would be best resolved by adding the following code to the class def getitem_block_index(self, slicer: slice) -> Block:
"""
Override the superclass's implementation of this method, which assumes a 2D block, with
code that works on a 1D extension array.
"""
return type(self)(self.values[slicer], self._mgr_locs, ndim=self.ndim) I'd be happy to put in a PR with this change if it's acceptable. |
I guess I'd be OK with that short term, but longer-term there is a decent chance that we're going to move to having EAs support 2D, in which case that would be reverted and you'd need to handle this case |
BTW looking at your code, you might want to subclass NDArrayBackedExtensionArray rather than ExtensionArray; many of its methods are better-optimized. |
@jbrockmendel is |
Fair point, it isn't publicly exposed and I don't have any plans on exposing it anytime soon. On the other hand, I'm mostly responsible for that class and it isn't going anywhere. |
I've filed #42787 with the fix described above, with a slightly smaller patch due to other changes to the same code. @jbrockmendel we'll take a look at the |
Not really. There has been a lack of consensus about it for while now. |
Doesn't it make more sense to have a "Tensor" ExtensionArray, for storing one tensor per observation, all with the same dimensions? That would allow for having a "Time series" column or an "Image" column, for example. |
Are all of your tensors the same shape? This sounds like a use case for e.g. xarray |
I mean having mixed data. Consider a case in which you have medical data. You have some scalar measurements for each patient, each on different columns, an electrocardiogram as one 1d tensor column, an image of the brain activity, as one 2d tensor column, etc. |
@WillAyd the use case described here sounds a little bit like the nested data work you were doing a while back. thoughts? |
Sometimes it's useful to have a column of tensors without abandoning Pandas entirely. Our own use cases are similar to what @vnmabus describes -- the tensors are just one feature among many. For example we may have a DataFrame with one row for each token in a document and one column for each feature of the token. One of those features could be a BERT embedding or an n-gram, but most of them will be categorical, numeric, or span-valued. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Expected output:
Actual output:
Problem description
If a DataFrame contains extension types, the block manager turns a 1D slice in the form
DataFrame.iloc[slice(x, y, z)]
into an N-dimensional slice over the ExtensionArray of the form__getitem__((Ellipsis, slice(x, y, z))
. This problem breaks the__getitem__()
method on one of our extension types in Text Extensions for Pandas.The
__getitem__
methods in Pandas' built-in extension types happens to work when fed this kind of n-dimensional slice--- they just ignore the
Ellipsis
in the first position of the tuple. But I don't think they're supposed to accept that type of input.The root cause is the new method
Block.getitem_block_index()
, added during some performance optimizations introduced in #40353:This method assumes that the contents of the block are 2-dimensional. The
ExtensionBlock
class inherits this method fromEABackedBlock
, which inherits it fromBlock
.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : f00ed8f
python : 3.7.10.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: