Skip to content

BUG: setting dtype='timestamp[ns][pyarrow]' removes precision from strings #49767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
MarcoGorelli opened this issue Nov 18, 2022 · 7 comments
Open
2 of 3 tasks
Labels
Arrow pyarrow functionality Bug datetime.date stdlib datetime.date support ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Nov 18, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import *
import sys
import datetime as dt
import pyarrow as pa
import io

data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
print(ser)
print()
ser = pd.Series(data, dtype='datetime64[ns]')
print(ser)

Issue Description

The above outputs:

0    2020-01-01 01:01:01.000001
1           1999-01-01 00:00:00
dtype: timestamp[ns][pyarrow]

0   2020-01-01 01:01:01.000001
1   1999-01-01 00:00:00.000000
dtype: datetime64[ns]

Note how, in the pyarrow case, the date formats are different.

This causes issues when parsing with to_datetime and specifying format=:

In [1]: data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

In [2]: ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
   ...: string_data = ser.to_csv(index=False, header=None).splitlines()
   ...: pd.to_datetime(string_data, format='%Y-%m-%d %H:%M:%S.%f')
   ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [2], line 3
      1 ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
      2 string_data = ser.to_csv(index=False, header=None).splitlines()
----> 3 pd.to_datetime(string_data, format='%Y-%m-%d %H:%M:%S.%f')

File ~/pandas-dev/pandas/core/tools/datetimes.py:1110, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1108         result = _convert_and_box_cache(argc, cache_array)
   1109     else:
-> 1110         result = convert_listlike(argc, format)
   1111 else:
   1112     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:436, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    433         return res
    435 utc = tz == "utc"
--> 436 result, tz_parsed = objects_to_datetime64ns(
    437     arg,
    438     dayfirst=dayfirst,
    439     yearfirst=yearfirst,
    440     utc=utc,
    441     errors=errors,
    442     require_iso8601=require_iso8601,
    443     allow_object=True,
    444     format=format,
    445     exact=exact,
    446 )
    448 if tz_parsed is not None:
    449     # We can take a shortcut since the datetime64 numpy array
    450     # is in UTC
    451     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/pandas-dev/pandas/core/arrays/datetimes.py:2144, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
   2142 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2143 try:
-> 2144     result, tz_parsed = tslib.array_to_datetime(
   2145         data.ravel("K"),
   2146         errors=errors,
   2147         utc=utc,
   2148         dayfirst=dayfirst,
   2149         yearfirst=yearfirst,
   2150         require_iso8601=require_iso8601,
   2151         format=format,
   2152         exact=exact,
   2153     )
   2154     result = result.reshape(data.shape, order=order)
   2155 except OverflowError as err:
   2156     # Exception is raised when a part of date is greater than 32 bit signed int

File ~/pandas-dev/pandas/_libs/tslib.pyx:442, in pandas._libs.tslib.array_to_datetime()
    440 @cython.wraparound(False)
    441 @cython.boundscheck(False)
--> 442 cpdef array_to_datetime(
    443     ndarray[object] values,
    444     str errors='raise',

File ~/pandas-dev/pandas/_libs/tslib.pyx:622, in pandas._libs.tslib.array_to_datetime()
    620     continue
    621 elif is_raise:
--> 622     raise ValueError(
    623         f"time data \"{val}\" at position {i} doesn't "
    624         f"match format \"{format}\""

ValueError: time data "1999-01-01 00:00:00" at position 1 doesn't match format "%Y-%m-%d %H:%M:%S.%f"

Expected Behavior

If it printed

0    2020-01-01 01:01:01.000001
1    1999-01-01 00:00:00.000000
dtype: timestamp[ns][pyarrow]

, just like it does for datetime64[ns], then the issue would be solve

Installed Versions

INSTALLED VERSIONS

commit : 8020bf1
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0.dev0+697.g8020bf1b25.dirty
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 59.8.0
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : 6.56.4
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.6.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 0.8.3
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.2
numba : 0.56.3
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : 1.2.0
pyxlsb : 1.0.10
s3fs : 2021.11.0
scipy : 1.9.3
snappy :
sqlalchemy : 1.4.43
tables : 3.7.0
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : None
pyqt5 : None
None

@MarcoGorelli MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 18, 2022
@MarcoGorelli
Copy link
Member Author

This only became apparent when the historic issue of to_datetime not respecting exact for ISO8601 formats was solved #49333

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Nov 18, 2022

Much simpler reproducer of the underlying issue:

In [1]: str(Timestamp(2020, 1, 1, 1, 1, 1, 1))
Out[1]: '2020-01-01 01:01:01.000001'

In [2]: str(Timestamp(2020, 1, 1, 1))
Out[2]: '2020-01-01 01:00:00'

Same thing happens in the standard library:

In [5]: str(dt.datetime(2000, 1, 1, 1, 1, 1))
Out[5]: '2000-01-01 01:01:01'

In [6]: str(dt.datetime(2000, 1, 1, 1, 1, 1, 123))
Out[6]: '2000-01-01 01:01:01.000123'

Could it be that

raise TypeError(
"ExtensionArray formatting should use ExtensionArrayFormatter"
)

should always print with %Y-%m-%d %H:%M:%S.%f?
Maybe this only needs doing when saving to csv?
Will look into this more

@MarcoGorelli
Copy link
Member Author

Looks like date_format in to_csv doesn't work with the arrow type either:

In [1]: data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

In [2]: ser1 = pd.Series(data, dtype='timestamp[ns][pyarrow]')
   ...: ser2 = pd.Series(data, dtype='datetime64[ns]')
   ...: 

In [3]: ser1.to_csv(date_format='%d-%m-%Y')
Out[3]: ',0\n0,2020-01-01 01:01:01.000001\n1,1999-01-01 00:00:00\n'

In [4]: ser2.to_csv(date_format='%d-%m-%Y')
Out[4]: ',0\n0,01-01-2020\n1,01-01-1999\n'

@MarcoGorelli
Copy link
Member Author

I think this might be the issue

if isinstance(arr, np.ndarray):

here arr.dtype.kind is 'M', not sure why it's excluded.

Would

-    if isinstance(arr, np.ndarray):
+    if isinstance(arr, np.ndarray) or is_extension_array_dtype(arr.dtype):

work? Will try

@mroeschke
Copy link
Member

ensure_wrapped_if_datetimelike is called in a lot of places, so I'm not sure if it will always be valid to convert an ArrowExtensionArray (with a datelike type) to a DatetimeArray/TimedeltaArray

cc @jbrockmendel

@jbrockmendel
Copy link
Member

haven't looked closely at the rest of the thread, but ensure_wrapped_if_datetimelike should only wrap np.ndarrays

@MarcoGorelli
Copy link
Member Author

ok thanks, I'll see where else this could be handled

@rhshadrach rhshadrach added ExtensionArray Extending pandas with custom dtypes or arrays. datetime.date stdlib datetime.date support Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug datetime.date stdlib datetime.date support ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

4 participants