Skip to content

BUG: inconsistent behavior and crash in DataFrame.__setitem__ when >=3d ndarray is used #53366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
determ1ne opened this issue May 24, 2023 · 3 comments
Open
3 tasks done
Labels
Bug Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@determ1ne
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# BUG Behavior 1:

df = pd.DataFrame(np.zeros((4, 1)))
# error, expected
# ValueError: Expected a 1D array, got an array with shape (4, 2)
df['A'] = np.zeros((4, 2))
# no error, not expected
df["A"] = np.zeros((4, 2, 3))
print(df) # exception here

'''
>>> print(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/azuk/pandasdev/pandas/pandas/core/frame.py", line 1096, in __repr__
    return self.to_string(**repr_params)
  File "/home/azuk/pandasdev/pandas/pandas/core/frame.py", line 1273, in to_string
    return fmt.DataFrameRenderer(formatter).to_string(
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1099, in to_string
    string = string_formatter.to_string()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 30, in to_string
    text = self._get_string_representation()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 45, in _get_string_representation
    strcols = self._get_strcols()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 36, in _get_strcols
    strcols = self.fmt.get_strcols()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 614, in get_strcols
    strcols = self._get_strcols_without_index()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 878, in _get_strcols_without_index
    fmt_values = self.format_col(i)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 892, in format_col
    return format_array(
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1295, in format_array
    return fmt_obj.get_result()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1328, in get_result
    fmt_values = self._format_strings()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1576, in _format_strings
    return list(self.get_result_as_array())
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1543, in get_result_as_array
    formatted_values = format_values_with(float_format)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1523, in format_values_with
    result = _trim_zeros_float(values, self.decimal)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1972, in _trim_zeros_float
    while should_trim(trimmed):
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1969, in should_trim
    numbers = [x for x in values if is_number_with_decimal(x)]
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1969, in <listcomp>
    numbers = [x for x in values if is_number_with_decimal(x)]
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1959, in is_number_with_decimal
    return re.match(number_regex, x) is not None
  File "/home/azuk/.conda/envs/pandas-dev/lib/python3.10/re.py", line 190, in match
    return _compile(pattern, flags).match(string)
TypeError: cannot use a string pattern on a bytes-like object
'''

# BUG Behavior 2:

df = pd.DataFrame(np.zeros((4, 1)))
# ok 
df['A'] = np.zeros((4, 1))
# no error, not expected here
# expcted ValueError: Expected a 1D array, got an array with shape (4, 2)
df['A'] = np.zeros((4, 2))

Issue Description

Input array demension is only checked in BlockManager.insert.

if value.ndim == 2:
value = value.T
if len(value) > 1:
raise ValueError(
f"Expected a 1D array, got an array with shape {value.T.shape}"
)

  1. The code only checks for 2d ndarray, so a >=3d ndarray can be set to crash DataFrame.
  2. The code only checks for inserting, so the value replacing for Series will not raise an Exception.

The issue shares a same reason for #51925 .

Expected Behavior

ValueException should be raised for both situations.

Installed Versions

INSTALLED VERSIONS ------------------ commit : f5a5c8d python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-70-generic Version : #77-Ubuntu SMP Tue Mar 21 14:02:37 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+828.gf5a5c8d7f0
numpy : 1.24.3
pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.7.2 pip : 23.1.2 Cython : 0.29.33 pytest : 7.3.1
hypothesis : 6.75.3
sphinx : 6.2.1
blosc : None feather : None xlsxwriter : 3.1.1
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : 1.0.3
psycopg2 : 2.9.3
jinja2 : 3.1.2 IPython : 8.13.2 pandas_datareader: None bs4 : 4.12.2 bottleneck : 1.3.7
brotli :
fastparquet : 2023.4.0
fsspec : 2023.5.0
gcsfs : 2023.5.0
matplotlib : 3.7.1
numba : 0.57.0
numexpr : 2.8.4
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None pyarrow : 12.0.0 pyreadstat : 1.2.1
pyxlsb : 1.0.10
s3fs : 2023.5.0
scipy : 1.10.1
snappy :
sqlalchemy : 2.0.15
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.5.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

@determ1ne determ1ne added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 24, 2023
determ1ne added a commit to determ1ne/pandas that referenced this issue May 24, 2023
@rhshadrach rhshadrach added the Indexing Related to indexing on series/frames, not to indexes themselves label May 24, 2023
@lithomas1 lithomas1 added Error Reporting Incorrect or improved errors from pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 24, 2023
@adrien-berchet
Copy link

adrien-berchet commented Feb 24, 2024

I had the same issue, anything new on this?

Also, note that it only happens for numpy.array objects. Casting it to list works properly:

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3))

crashes as reported while

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3)).tolist()

works and the DF is:

In [2]: df
Out[2]: 
     a                                   b
0  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
1  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
2  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
3  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

which is what I wanted to achieve before I got this issue.

Finally, trying to use the constructor detects the issue and crashes with a more understandable error:

pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:653, in _extract_index(data)
    651         raw_lengths.append(len(val))
    652     elif isinstance(val, np.ndarray) and val.ndim > 1:
--> 653         raise ValueError("Per-column arrays must each be 1-dimensional")
    655 if not indexes and not raw_lengths:
    656     raise ValueError("If using all scalar values, you must pass an index")

ValueError: Per-column arrays must each be 1-dimensional

(and again, casting to a list also works properly in this case)

EDIT: I can reproduce this issue with pandas==1.5.3 and pandas==2.2.1.

@determ1ne
Copy link
Author

I had the same issue, anything new on this?

Also, note that it only happens for numpy.array objects. Casting it to list works properly:

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3))

crashes as reported while

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3)).tolist()

works and the DF is:

In [2]: df
Out[2]: 
     a                                   b
0  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
1  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
2  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
3  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

which is what I wanted to achieve before I got this issue.

Finally, trying to use the constructor detects the issue and crashes with a more understandable error:

pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:653, in _extract_index(data)
    651         raw_lengths.append(len(val))
    652     elif isinstance(val, np.ndarray) and val.ndim > 1:
--> 653         raise ValueError("Per-column arrays must each be 1-dimensional")
    655 if not indexes and not raw_lengths:
    656     raise ValueError("If using all scalar values, you must pass an index")

ValueError: Per-column arrays must each be 1-dimensional

(and again, casting to a list also works properly in this case)

EDIT: I can reproduce this issue with pandas==1.5.3 and pandas==2.2.1.

The last code snippet pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))}) worked properly and raised the corresponding error. Casting to a list works because lists are always 1-d.

I didn't dig into how pandas deal with memory when PR #53367 is opened, but managed to block invalid DataFrame.__setitem__ as consistence to the ValueError you mentioned.

@ebo
Copy link

ebo commented Jun 7, 2024

I have started working with CryoSat-2 data, and the RADAR waveform is 3D, and while I can convert it with np.tolist, it would be nice if there was some way to use non 1-dimentional data within a DataFrame.

For reference, here is a trivial script to replicate. The data is available from ESA https://earth.esa.int/eogateway/missions/cryosat/data.

import pandas as pd
from netCDF4 import Dataset

fname = "CS_OFFL_SIR_SAR_1B_20220302T004121_20220302T004737_E001.nc"
with Dataset(fname, mode='r') as CS:
test_df = pd.DataFrame({
'Waveform' : CS.variables['pwr_waveform_20_ku'][:]
})

This gives the error: "ValueError: Per-column arrays must each be 1-dimensional"

If anyone knows of any tricks to get Pandas to work with multidimentional data without casting to a list, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

5 participants