Skip to content

BUG: Cannot append to DataFrame with Timestamp column with non-nanosecond unit #55374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
AdrianDAlessandro opened this issue Oct 3, 2023 · 9 comments
Closed
3 tasks done
Assignees
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Milestone

Comments

@AdrianDAlessandro
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({"time": pd.Timestamp(1513393355, unit="us"), "A":[0]}) # Note micro-second unit
df.loc[1] = df.loc[0] # <-- This raises a ValueError for incompatible shapes

Issue Description

Appending a row to a DataFrame with .loc is broken when one of the columns contains a timestamp that has units that are not nanoseconds. Attempting to do so raises a ValueError.

Expected Behavior

The new row should be appended to the DataFrame successfully. The below example works:

import pandas as pd
df = pd.DataFrame({"time": pd.Timestamp(1513393355, unit="ns"), "A":[0]}) # Note nano-second unit
df.loc[1] = df.loc[0]

Installed Versions

INSTALLED VERSIONS

commit : 77bc67a
python : 3.11.1.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Wed Jul 5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0+untagged.33100.g77bc67a.dirty
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.2.1
Cython : 0.29.33
pytest : 7.4.0
hypothesis : 6.82.5
sphinx : 6.2.1
blosc : 1.11.1
feather : None
xlsxwriter : 3.1.2
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.7.0
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.2
numba : 0.57.1
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : 1.2.2
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2023.6.0
scipy : 1.11.2
sqlalchemy : 2.0.20
tables : None
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

@AdrianDAlessandro AdrianDAlessandro added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 3, 2023
AdrianDAlessandro added a commit to ImperialCollegeLondon/gridlington-datahub that referenced this issue Oct 3, 2023
AdrianDAlessandro added a commit to ImperialCollegeLondon/gridlington-datahub that referenced this issue Oct 3, 2023
AdrianDAlessandro added a commit to ImperialCollegeLondon/gridlington-datahub that referenced this issue Oct 3, 2023
@jadelsoufi
Copy link

I could reproduce the issue and found that the error disappears if the timestamp is higher to pandas.Timestamp.max that is equal to Timestamp('2262-04-11 23:47:16.854775807'), whatever the units ('m', 's', 'ms',...)

so for example the below code with 'us' units works fine:

df = pd.DataFrame({"time": pd.Timestamp(9.3e15, unit="us"), "A":[0]})
df.loc[1] = df.loc[0]

I know it doesn't help too much but it can give some direction for the investigation.

@AdrianDAlessandro
Copy link
Contributor Author

I should say that when I was digging into this error to begin with, the error occurred in the BlockManager inside the concat function. But the concat function doesn't return the same error when used directly. It seemed like the blocks are being calculated differently when assigning a Series to a DataFrame row with .loc

>>> import pandas as pd
>>> df = pd.DataFrame({"time": pd.Timestamp(1513393355, unit="us"), "A":[0]})
>>> pd.concat([df,df])
                        time  A
0 1970-01-01 00:25:13.393355  0
0 1970-01-01 00:25:13.393355  0
>>> df.loc[1] = df.loc[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/indexing.py", line 885, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/indexing.py", line 1883, in _setitem_with_indexer
    self._setitem_with_indexer_missing(indexer, value)
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/indexing.py", line 2241, in _setitem_with_indexer_missing
    self.obj._mgr = self.obj._append(value)._mgr
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 10227, in _append
    result = concat(
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 393, in concat
    return op.get_result()
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 680, in get_result
    new_data = concatenate_managers(
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/internals/concat.py", line 199, in concatenate_managers
    return BlockManager(tuple(blocks), axes)
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 916, in __init__
    self._verify_integrity()
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 923, in _verify_integrity
    raise_construction_error(tot_items, block.shape[1:], self.axes)
  File "/.../.venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 2118, in raise_construction_error
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (1, 2), indices imply (2, 2)

@jadelsoufi
Copy link

jadelsoufi commented Oct 9, 2023

currently investigating concat.py files, dtype: datetime64[ns] is always assigned to self.objs[1]._mgr whatever the units of the pd.Timestamp() instance, in reshape.concat.py line 665 :

            for obj in self.objs:
                indexers = {}
                for ax, new_labels in enumerate(self.new_axes):
                    # ::-1 to convert BlockManager ax to DataFrame ax
                    if ax == self.bm_axis:
                        # Suppress reindexing on concat axis
                        continue

                    # 1-ax to convert BlockManager axis to DataFrame axis
                    obj_labels = obj.axes[1 - ax]
                    if not new_labels.equals(obj_labels):
                        indexers[ax] = obj_labels.get_indexer(new_labels)

                mgrs_indexers.append((obj._mgr, indexers))

@jadelsoufi
Copy link

take

@maxdolle
Copy link

Not necessarily a duplicate but pretty similar to #55067

@rachtsingh
Copy link

I think I have a similar issue, with pandas 2.1.1 and pyarrow 13.0.0:

ipdb> for c in inputs:
    print(c.dtypes)

ts       datetime64[us, UTC]
ident                 object
val                   object
dtype: object
ts       datetime64[us, UTC]
ident                 object
val                   object
dtype: object
ts       datetime64[ns, UTC]
ident         string[python]
val                  Float32
dtype: object
ts       datetime64[us, UTC]
ident         string[python]
val                  Float64
dtype: object
ts       datetime64[us, UTC]
ident                 object
val                  Float64
dtype: object
ts       datetime64[us, UTC]
ident                 object
val                  Float64
dtype: object

ipdb> print(pd.concat(inputs).dtypes)
ts       object
ident    object
val      object
dtype: object

Let me know if I should open a new issue because it's substantially different; I know some of these have already been addressed in main (and I didn't have a chance to check main) so didn't want to spam.

@rachtsingh
Copy link

Ah, this is a bug in pyarrow==13.0.0, downgrading to 12.0.0 solves that.

@jorisvandenbossche jorisvandenbossche added Non-Nano datetime64/timedelta64 with non-nanosecond resolution and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 26, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.1.3 milestone Oct 26, 2023
@jorisvandenbossche jorisvandenbossche modified the milestones: 2.1.3, 2.1.4 Nov 13, 2023
@jbrockmendel
Copy link
Member

No longer raises on main, possibly closed by #53641?

@lithomas1 lithomas1 modified the milestones: 2.1.4, 2.2 Dec 8, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@mroeschke
Copy link
Member

Looks to have been addressed by #53641 so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
8 participants