BUG: inconsistent behavior and crash in `DataFrame.setitem` when >=3d ndarray is used #53366

determ1ne · 2023-05-24T08:30:52Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# BUG Behavior 1:

df = pd.DataFrame(np.zeros((4, 1)))
# error, expected
# ValueError: Expected a 1D array, got an array with shape (4, 2)
df['A'] = np.zeros((4, 2))
# no error, not expected
df["A"] = np.zeros((4, 2, 3))
print(df) # exception here

'''
>>> print(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/azuk/pandasdev/pandas/pandas/core/frame.py", line 1096, in __repr__
    return self.to_string(**repr_params)
  File "/home/azuk/pandasdev/pandas/pandas/core/frame.py", line 1273, in to_string
    return fmt.DataFrameRenderer(formatter).to_string(
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1099, in to_string
    string = string_formatter.to_string()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 30, in to_string
    text = self._get_string_representation()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 45, in _get_string_representation
    strcols = self._get_strcols()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 36, in _get_strcols
    strcols = self.fmt.get_strcols()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 614, in get_strcols
    strcols = self._get_strcols_without_index()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 878, in _get_strcols_without_index
    fmt_values = self.format_col(i)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 892, in format_col
    return format_array(
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1295, in format_array
    return fmt_obj.get_result()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1328, in get_result
    fmt_values = self._format_strings()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1576, in _format_strings
    return list(self.get_result_as_array())
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1543, in get_result_as_array
    formatted_values = format_values_with(float_format)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1523, in format_values_with
    result = _trim_zeros_float(values, self.decimal)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1972, in _trim_zeros_float
    while should_trim(trimmed):
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1969, in should_trim
    numbers = [x for x in values if is_number_with_decimal(x)]
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1969, in <listcomp>
    numbers = [x for x in values if is_number_with_decimal(x)]
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1959, in is_number_with_decimal
    return re.match(number_regex, x) is not None
  File "/home/azuk/.conda/envs/pandas-dev/lib/python3.10/re.py", line 190, in match
    return _compile(pattern, flags).match(string)
TypeError: cannot use a string pattern on a bytes-like object
'''

# BUG Behavior 2:

df = pd.DataFrame(np.zeros((4, 1)))
# ok 
df['A'] = np.zeros((4, 1))
# no error, not expected here
# expcted ValueError: Expected a 1D array, got an array with shape (4, 2)
df['A'] = np.zeros((4, 2))

Issue Description

Input array demension is only checked in BlockManager.insert.

pandas/pandas/core/internals/managers.py

Lines 1404 to 1409 in f5a5c8d

    
           if value.ndim == 2: 
        
               value = value.T 
        
               if len(value) > 1: 
        
                   raise ValueError( 
        
                       f"Expected a 1D array, got an array with shape {value.T.shape}" 
        
                   )

The code only checks for 2d ndarray, so a >=3d ndarray can be set to crash DataFrame.
The code only checks for inserting, so the value replacing for Series will not raise an Exception.

The issue shares a same reason for #51925 .

Expected Behavior

ValueException should be raised for both situations.

Installed Versions

INSTALLED VERSIONS ------------------ commit : f5a5c8d python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-70-generic Version : #77-Ubuntu SMP Tue Mar 21 14:02:37 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+828.gf5a5c8d7f0
numpy : 1.24.3
pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.7.2 pip : 23.1.2 Cython : 0.29.33 pytest : 7.3.1
hypothesis : 6.75.3
sphinx : 6.2.1
blosc : None feather : None xlsxwriter : 3.1.1
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : 1.0.3
psycopg2 : 2.9.3
jinja2 : 3.1.2 IPython : 8.13.2 pandas_datareader: None bs4 : 4.12.2 bottleneck : 1.3.7
brotli :
fastparquet : 2023.4.0
fsspec : 2023.5.0
gcsfs : 2023.5.0
matplotlib : 3.7.1
numba : 0.57.0
numexpr : 2.8.4
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None pyarrow : 12.0.0 pyreadstat : 1.2.1
pyxlsb : 1.0.10
s3fs : 2023.5.0
scipy : 1.10.1
snappy :
sqlalchemy : 2.0.15
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.5.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

adrien-berchet · 2024-02-24T10:19:45Z

I had the same issue, anything new on this?

Also, note that it only happens for numpy.array objects. Casting it to list works properly:

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3))

crashes as reported while

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3)).tolist()

works and the DF is:

In [2]: df
Out[2]: 
     a                                   b
0  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
1  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
2  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
3  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

which is what I wanted to achieve before I got this issue.

Finally, trying to use the constructor detects the issue and crashes with a more understandable error:

pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:653, in _extract_index(data)
    651         raw_lengths.append(len(val))
    652     elif isinstance(val, np.ndarray) and val.ndim > 1:
--> 653         raise ValueError("Per-column arrays must each be 1-dimensional")
    655 if not indexes and not raw_lengths:
    656     raise ValueError("If using all scalar values, you must pass an index")

ValueError: Per-column arrays must each be 1-dimensional

(and again, casting to a list also works properly in this case)

EDIT: I can reproduce this issue with pandas==1.5.3 and pandas==2.2.1.

determ1ne · 2024-02-27T02:43:08Z

I had the same issue, anything new on this?

Also, note that it only happens for numpy.array objects. Casting it to list works properly:

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3))

crashes as reported while

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3)).tolist()

works and the DF is:

In [2]: df
Out[2]: 
     a                                   b
0  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
1  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
2  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
3  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

which is what I wanted to achieve before I got this issue.

Finally, trying to use the constructor detects the issue and crashes with a more understandable error:

pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:653, in _extract_index(data)
    651         raw_lengths.append(len(val))
    652     elif isinstance(val, np.ndarray) and val.ndim > 1:
--> 653         raise ValueError("Per-column arrays must each be 1-dimensional")
    655 if not indexes and not raw_lengths:
    656     raise ValueError("If using all scalar values, you must pass an index")

ValueError: Per-column arrays must each be 1-dimensional

(and again, casting to a list also works properly in this case)

EDIT: I can reproduce this issue with pandas==1.5.3 and pandas==2.2.1.

The last code snippet pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))}) worked properly and raised the corresponding error. Casting to a list works because lists are always 1-d.

I didn't dig into how pandas deal with memory when PR #53367 is opened, but managed to block invalid DataFrame.__setitem__ as consistence to the ValueError you mentioned.

ebo · 2024-06-07T19:39:29Z

I have started working with CryoSat-2 data, and the RADAR waveform is 3D, and while I can convert it with np.tolist, it would be nice if there was some way to use non 1-dimentional data within a DataFrame.

For reference, here is a trivial script to replicate. The data is available from ESA https://earth.esa.int/eogateway/missions/cryosat/data.

import pandas as pd
from netCDF4 import Dataset

fname = "CS_OFFL_SIR_SAR_1B_20220302T004121_20220302T004737_E001.nc"
with Dataset(fname, mode='r') as CS:
test_df = pd.DataFrame({
'Waveform' : CS.variables['pwr_waveform_20_ku'][:]
})

This gives the error: "ValueError: Per-column arrays must each be 1-dimensional"

If anyone knows of any tricks to get Pandas to work with multidimentional data without casting to a list, please let me know.

determ1ne added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 24, 2023

determ1ne added a commit to determ1ne/pandas that referenced this issue May 24, 2023

DOC: Fixing pandas-dev#51925 and pandas-dev#53366

b821bd1

determ1ne mentioned this issue May 24, 2023

BUG: DataFrame bugs in ndarray assignments #53367

Closed

5 tasks

rhshadrach added the Indexing Related to indexing on series/frames, not to indexes themselves label May 24, 2023

lithomas1 added Error Reporting Incorrect or improved errors from pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: inconsistent behavior and crash in `DataFrame.setitem` when >=3d ndarray is used #53366

BUG: inconsistent behavior and crash in `DataFrame.setitem` when >=3d ndarray is used #53366

determ1ne commented May 24, 2023

adrien-berchet commented Feb 24, 2024 •

edited

Loading

determ1ne commented Feb 27, 2024

ebo commented Jun 7, 2024 •

edited

Loading

BUG: inconsistent behavior and crash in DataFrame.__setitem__ when >=3d ndarray is used #53366

BUG: inconsistent behavior and crash in DataFrame.__setitem__ when >=3d ndarray is used #53366

Comments

determ1ne commented May 24, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

adrien-berchet commented Feb 24, 2024 • edited Loading

determ1ne commented Feb 27, 2024

ebo commented Jun 7, 2024 • edited Loading

BUG: inconsistent behavior and crash in `DataFrame.setitem` when >=3d ndarray is used #53366

BUG: inconsistent behavior and crash in `DataFrame.setitem` when >=3d ndarray is used #53366

adrien-berchet commented Feb 24, 2024 •

edited

Loading

ebo commented Jun 7, 2024 •

edited

Loading