Skip to content

BUG: AssertionError: Number of Block dimensions (1) must equal number of axes (2) when typing a column #35460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
frd-glovo opened this issue Jul 29, 2020 · 3 comments · Fixed by #36115
Closed
2 of 3 tasks
Labels
Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Milestone

Comments

@frd-glovo
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame(
            {"e1_timestamp": [],
                "e2_timestamp": []
            }
        )

# The problematic code. Remove this and there are no exceptions. 

for timefield in [
        "e1_timestamp",
        "e2_timestamp",
    ]:
        df[timefield] = pd.to_datetime(df[timefield], utc=True)
#

df = df.append({
                    "orderId": 1,
                    "e1_timestamp": pd.to_datetime(pd.NaT, utc=True),
                    "e2_timestamp": pd.to_datetime(pd.NaT, utc=True),
                },ignore_index=True,)
df

Problem description

Why initializing an empty dataframe, defining its type as shown in the example and then appending a value, there is an AssertionError: Number of Block dimensions (1) must equal number of axes (2) exception.

If we remove the for loop, it does work:

df = pd.DataFrame(
            {"e1_timestamp": [],
                "e2_timestamp": []
            }
        )

df = df.append({
                    
                    "e1_timestamp": pd.to_datetime(pd.NaT, utc=True),
                    "e2_timestamp": pd.to_datetime(pd.NaT, utc=True),
                },ignore_index=True,)


df

We see that we could do df = df.astype({"e1_timestamp": 'datetime64[ns]', "e1_timestamp": 'datetime64[ns]'}) instead of that for loop, but there seems to be a bug in this new version. With pandas==1.0.5 the for loop does not raise an exception.

Expected Output

There should be no assertion error

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.184-linuxkit
Version : #1 SMP Tue Jul 2 22:58:16 UTC 2019
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.16.0
pytz : 2019.1
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.4.0
Cython : None
pytest : 4.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.0
sqlalchemy : 1.2.16
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
numba : None

@frd-glovo frd-glovo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2020
@frd-glovo
Copy link
Author

Actually, it does not seem to be due do the for. This fails too:

import pandas as pd
import numpy as np
df = pd.DataFrame(
            {"e1_timestamp": [],
                "e2_timestamp": []
            }
        )

# The problematic code. Remove this and there are no exceptions. 

timefields = [
            "e1_timestamp",
            "e2_timestamp",
        ]

df = df.astype({field : 'datetime64[ns, UTC]' for field in timefields})
#

df = df.append({
                    "orderId": 1,
                    "e1_timestamp": pd.to_datetime(pd.NaT, utc=True),
                    "e2_timestamp": pd.to_datetime(pd.NaT, utc=True),
                },ignore_index=True,)
df

However, if we remove the UTC bit in the type assignment:

import pandas as pd
import numpy as np
df = pd.DataFrame(
            {"e1_timestamp": [],
                "e2_timestamp": []
            }
        )
timefields = [
            "e1_timestamp",
            "e2_timestamp",
        ]

df = df.astype({field : 'datetime64[ns]' for field in timefields})
# Not using datetime64[ns, UTC] makes the exception disappear.

df = df.append({
                    "orderId": 1,
                    "e1_timestamp": pd.to_datetime(pd.NaT,utc=True),
                    "e2_timestamp": pd.to_datetime(pd.NaT,utc=True),
                },ignore_index=True,)
df

it doesnt error anymore. However the type of e1_timestamp is datetime64[ns] instead of the desired datetime64[ns, UTC]

@simonjayhawkins
Copy link
Member

Thanks @frd-glovo for the report.

a simpler example with the same error

>>> import pandas as pd
>>>
>>> pd.__version__
'1.2.0.dev0+10.g3b1d4f1ee'
>>>
>>> df = pd.DataFrame(columns=["a"]).astype("datetime64[ns, UTC]")
>>>
>>> df.append({"a": pd.NaT,}, ignore_index=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\frame.py", line 7737, in append
    return concat(
  File "C:\Users\simon\pandas\pandas\core\reshape\concat.py", line 287, in concat
    return op.get_result()
  File "C:\Users\simon\pandas\pandas\core\reshape\concat.py", line 502, in get_result
    new_data = concatenate_block_managers(
  File "C:\Users\simon\pandas\pandas\core\internals\concat.py", line 84, in concatenate_block_managers
    return BlockManager(blocks, axes)
  File "C:\Users\simon\pandas\pandas\core\internals\managers.py", line 133, in __init__
    raise AssertionError(
AssertionError: Number of Block dimensions (1) must equal number of axes (2)
>>>

it appears appending pd.NaT or a tz-naive timestamp to an empty DataFrame raises this error. tz-aware timestamps append OK. Other types coerce the column type.

on a non-empty DataFrame, appending pd.NaT also works.

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version Datetime Datetime data dtype Timezones Timezone data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Jul 29, 2020
@simonjayhawkins
Copy link
Member

#35038 caused this regression

6509028 is the first bad commit
commit 6509028
Author: Simon Hawkins [email protected]
Date: Thu Jul 16 23:50:38 2020 +0100

BUG: DataFrame.append with empty DataFrame and Series with tz-aware datetime value allocated object column (#35038)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants