Skip to content

BUG: datetime ExtensionDtype do not work with DataFrame #35767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
marco-neumann-by opened this issue Aug 17, 2020 · 7 comments
Closed
3 tasks done

BUG: datetime ExtensionDtype do not work with DataFrame #35767

marco-neumann-by opened this issue Aug 17, 2020 · 7 comments
Labels
Bug Closing Candidate May be closeable, needs more eyeballs ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@marco-neumann-by
Copy link

  • I have checked that this issue has not already been reported. (at least I couldn't find one)

  • I have confirmed this bug exists on the latest version of pandas. (1.1.0)

  • (optional) I have confirmed this bug exists on the master branch of pandas. (934e9f840ebd2e8b5a5181b19a23e033bd3985a5)


Code Sample, a copy-pastable example

This is some high-level example that lead to the investion. It relies on rle-array (commit dfa79295a580d533ee9d2ea901e8808496dbcdc9 was used), because the pandas-provided DatetimeArray uses a NumPy dtype or DatetimeTZDtype. Both cases somewhat work (see "Problem description").

import pandas as pd
from rle_array import RLEArray

array = RLEArray._from_sequence([], dtype="datetime64[ns]")
df = pd.DataFrame({"x": array})
Traceback (most recent call last):
  File "bug.py", line 5, in <module>
    pd.DataFrame({"x": array})
  File ".../lib/python3.8/site-packages/pandas/core/frame.py", line 467, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File ".../lib/python3.8/site-packages/pandas/core/internals/construction.py", line 283, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File ".../lib/python3.8/site-packages/pandas/core/internals/construction.py", line 93, in arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File ".../lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1650, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File ".../lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1703, in form_blocks
    block_type = get_block_type(v)
  File ".../lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 2672, in get_block_type
    assert not is_datetime64tz_dtype(values.dtype)
AssertionError

Problem description

See here:

def get_block_type(values, dtype=None):
"""
Find the appropriate Block subclass to use for the given values and dtype.
Parameters
----------
values : ndarray-like
dtype : numpy or pandas dtype
Returns
-------
cls : class, subclass of Block
"""
dtype = dtype or values.dtype
vtype = dtype.type
if is_sparse(dtype):
# Need this first(ish) so that Sparse[datetime] is sparse
cls = ExtensionBlock
elif is_categorical_dtype(values.dtype):
cls = CategoricalBlock
elif issubclass(vtype, np.datetime64):
assert not is_datetime64tz_dtype(values.dtype)
cls = DatetimeBlock
elif is_datetime64tz_dtype(values.dtype):
cls = DatetimeTZBlock
elif is_interval_dtype(dtype) or is_period_dtype(dtype):
cls = ObjectValuesExtensionBlock
elif is_extension_array_dtype(values.dtype):
cls = ExtensionBlock
elif issubclass(vtype, np.floating):
cls = FloatBlock
elif issubclass(vtype, np.timedelta64):
assert issubclass(vtype, np.integer)
cls = TimeDeltaBlock
elif issubclass(vtype, np.complexfloating):
cls = ComplexBlock
elif issubclass(vtype, np.integer):
cls = IntBlock
elif dtype == np.bool_:
cls = BoolBlock
else:
cls = ObjectBlock
return cls

datetime (and also interval) types are checked BEFORE extension types which means that extension datetime types never end up in ExtensionBlocks. The latter one would be useful if:

  • the datetime objects is not compatible with NumPy
  • the data should not be converted to to NumPy (e.g. due to compression, like in the rle-array case)

Furthermore the invariant issubclass(vtype, np.datetime64) => not is_datetime64tz_dtype(values.dtype) does NOT hold for all extension dtypes, at least not under the current implementation of is_datetime64tz_dtype:

if isinstance(arr_or_dtype, ExtensionDtype):
# GH#33400 fastpath for dtype object
return arr_or_dtype.kind == "M"
if arr_or_dtype is None:
return False
return DatetimeTZDtype.is_dtype(arr_or_dtype)

Expected Output

The code example works and df._data shows that the data ends up in an ExtensionBlock.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : d9fff2792bf16178d4e450fe7384244e50635733
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.1.1
setuptools       : 47.1.0
Cython           : None
pytest           : 6.0.1
hypothesis       : None
sphinx           : 3.2.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.16.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.50.1
@marco-neumann-by marco-neumann-by added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 17, 2020
@sbrugman
Copy link
Contributor

Related to #35762

@jorisvandenbossche
Copy link
Member

@marco-neumann-by thanks for the report!

I know this is messy in pandas itself (with DatetimeArray sometimes having a numpy dtype and sometimes a DatetimeTZDtype), but in principle your extension array should always have an ExtensionDtype. If that is not the case, pandas cannot guarantee that it will work (or rather it is guaranteed to not work, since we rely on the dtype for several aspects of EA functionality).

I suppose in this case the RLEArray is a wrapper of other ExtensionArrays, and therefore wrapping DatetimeArray? But the dtype of the RLEArray, you still control? (and thus can ensure to return an extension dtype?)

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 17, 2020
@jorisvandenbossche
Copy link
Member

@sbrugman I think it are two different issues. The one you linked is specifically related to the Sparse dtype (and the problem that Sparse[datetime64] is apparently regarded as a datetime-like dtype), which is a special case in pandas and is not very well tested or supported. But it is not related to DatetimeArray being a non-proper ExtensionArray with a numpy dtype (in case of tz-naive data).
(but let's discuss the details of this issue in #35762)

@marco-neumann-by
Copy link
Author

@jorisvandenbossche RLEArray has it's own dtype (RLEDtype("datetime64[ns]") in the example above). Please look at the linked get_block_type code snipped: the issue is that vtype (which is NOT an extension dtype and to my knowledge always refers to NumPy types) is checked before is_extension_array_dtype(values.dtype).

@jorisvandenbossche
Copy link
Member

Ah, I missed the difference between dtype and type.type in the checks in get_block_type. That's indeed not something we should make assumption about (basically we now assume that any dtype with a dtype.type of np.datetime64 will be a numpy datetime64[ns] dtype).

I am not fully sure what the reason is the check being written like it is now. We should probably test what fails if we change it to a more strict check for datetime64[ns] dtype.

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Aug 18, 2020
@marco-neumann-by
Copy link
Author

I had a look at the code in get_block_type: Re-ordering it doesn't work trivially, because of the pandas-provided datetime extension types will otherwise end up in an ExtensionBlock which will break a lot of things. So we have the following "conflict":

  • DatetimeArray is implemented in a way that it relies on DatetimeBlock/DatetimeTZBlock but at the same time has an extension dtype
  • external extension arrays (even when they hold datetime data) probably want to end up in ExtensionBlock

So I think either DatetimeArray needs some changes or some special handling specifically for the DatetimeArray is added to get_block_type.

@mroeschke mroeschke removed this from the 2.0 milestone Jan 13, 2023
@jbrockmendel
Copy link
Member

get_block_type has been re-worked since the OP. Does the issue persist?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

5 participants