BUG: setting dtype='timestamp[ns][pyarrow]' removes precision from strings #49767

MarcoGorelli · 2022-11-18T10:45:24Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import *
import sys
import datetime as dt
import pyarrow as pa
import io

data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
print(ser)
print()
ser = pd.Series(data, dtype='datetime64[ns]')
print(ser)

Issue Description

The above outputs:

0    2020-01-01 01:01:01.000001
1           1999-01-01 00:00:00
dtype: timestamp[ns][pyarrow]

0   2020-01-01 01:01:01.000001
1   1999-01-01 00:00:00.000000
dtype: datetime64[ns]

Note how, in the pyarrow case, the date formats are different.

This causes issues when parsing with to_datetime and specifying format=:

In [1]: data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

In [2]: ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
   ...: string_data = ser.to_csv(index=False, header=None).splitlines()
   ...: pd.to_datetime(string_data, format='%Y-%m-%d %H:%M:%S.%f')
   ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [2], line 3
      1 ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
      2 string_data = ser.to_csv(index=False, header=None).splitlines()
----> 3 pd.to_datetime(string_data, format='%Y-%m-%d %H:%M:%S.%f')

File ~/pandas-dev/pandas/core/tools/datetimes.py:1110, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1108         result = _convert_and_box_cache(argc, cache_array)
   1109     else:
-> 1110         result = convert_listlike(argc, format)
   1111 else:
   1112     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:436, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    433         return res
    435 utc = tz == "utc"
--> 436 result, tz_parsed = objects_to_datetime64ns(
    437     arg,
    438     dayfirst=dayfirst,
    439     yearfirst=yearfirst,
    440     utc=utc,
    441     errors=errors,
    442     require_iso8601=require_iso8601,
    443     allow_object=True,
    444     format=format,
    445     exact=exact,
    446 )
    448 if tz_parsed is not None:
    449     # We can take a shortcut since the datetime64 numpy array
    450     # is in UTC
    451     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/pandas-dev/pandas/core/arrays/datetimes.py:2144, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
   2142 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2143 try:
-> 2144     result, tz_parsed = tslib.array_to_datetime(
   2145         data.ravel("K"),
   2146         errors=errors,
   2147         utc=utc,
   2148         dayfirst=dayfirst,
   2149         yearfirst=yearfirst,
   2150         require_iso8601=require_iso8601,
   2151         format=format,
   2152         exact=exact,
   2153     )
   2154     result = result.reshape(data.shape, order=order)
   2155 except OverflowError as err:
   2156     # Exception is raised when a part of date is greater than 32 bit signed int

File ~/pandas-dev/pandas/_libs/tslib.pyx:442, in pandas._libs.tslib.array_to_datetime()
    440 @cython.wraparound(False)
    441 @cython.boundscheck(False)
--> 442 cpdef array_to_datetime(
    443     ndarray[object] values,
    444     str errors='raise',

File ~/pandas-dev/pandas/_libs/tslib.pyx:622, in pandas._libs.tslib.array_to_datetime()
    620     continue
    621 elif is_raise:
--> 622     raise ValueError(
    623         f"time data \"{val}\" at position {i} doesn't "
    624         f"match format \"{format}\""

ValueError: time data "1999-01-01 00:00:00" at position 1 doesn't match format "%Y-%m-%d %H:%M:%S.%f"

Expected Behavior

If it printed

0    2020-01-01 01:01:01.000001
1    1999-01-01 00:00:00.000000
dtype: timestamp[ns][pyarrow]

, just like it does for datetime64[ns], then the issue would be solve

Installed Versions

INSTALLED VERSIONS

commit : 8020bf1
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0.dev0+697.g8020bf1b25.dirty
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 59.8.0
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : 6.56.4
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.6.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 0.8.3
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.2
numba : 0.56.3
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : 1.2.0
pyxlsb : 1.0.10
s3fs : 2021.11.0
scipy : 1.9.3
snappy :
sqlalchemy : 1.4.43
tables : 3.7.0
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : None
pyqt5 : None
None

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2022-11-18T10:47:04Z

This only became apparent when the historic issue of to_datetime not respecting exact for ISO8601 formats was solved #49333

MarcoGorelli · 2022-11-18T17:05:43Z

Much simpler reproducer of the underlying issue:

In [1]: str(Timestamp(2020, 1, 1, 1, 1, 1, 1))
Out[1]: '2020-01-01 01:01:01.000001'

In [2]: str(Timestamp(2020, 1, 1, 1))
Out[2]: '2020-01-01 01:00:00'

Same thing happens in the standard library:

In [5]: str(dt.datetime(2000, 1, 1, 1, 1, 1))
Out[5]: '2000-01-01 01:01:01'

In [6]: str(dt.datetime(2000, 1, 1, 1, 1, 1, 123))
Out[6]: '2000-01-01 01:01:01.000123'

Could it be that

pandas/pandas/io/formats/format.py

Lines 1406 to 1408 in 1c51e60

    
           raise TypeError( 
        
               "ExtensionArray formatting should use ExtensionArrayFormatter" 
        
           )

should always print with %Y-%m-%d %H:%M:%S.%f?
Maybe this only needs doing when saving to csv?
Will look into this more

MarcoGorelli · 2022-11-18T17:23:59Z

Looks like date_format in to_csv doesn't work with the arrow type either:

In [1]: data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

In [2]: ser1 = pd.Series(data, dtype='timestamp[ns][pyarrow]')
   ...: ser2 = pd.Series(data, dtype='datetime64[ns]')
   ...: 

In [3]: ser1.to_csv(date_format='%d-%m-%Y')
Out[3]: ',0\n0,2020-01-01 01:01:01.000001\n1,1999-01-01 00:00:00\n'

In [4]: ser2.to_csv(date_format='%d-%m-%Y')
Out[4]: ',0\n0,01-01-2020\n1,01-01-1999\n'

MarcoGorelli · 2022-11-18T17:46:23Z

I think this might be the issue

pandas/pandas/core/construction.py

Line 467 in 2517199

if isinstance(arr, np.ndarray):

here arr.dtype.kind is 'M', not sure why it's excluded.

Would

-    if isinstance(arr, np.ndarray):
+    if isinstance(arr, np.ndarray) or is_extension_array_dtype(arr.dtype):

work? Will try

mroeschke · 2022-11-18T18:17:16Z

ensure_wrapped_if_datetimelike is called in a lot of places, so I'm not sure if it will always be valid to convert an ArrowExtensionArray (with a datelike type) to a DatetimeArray/TimedeltaArray

cc @jbrockmendel

jbrockmendel · 2022-11-18T18:19:06Z

haven't looked closely at the rest of the thread, but ensure_wrapped_if_datetimelike should only wrap np.ndarrays

MarcoGorelli · 2022-11-18T18:24:24Z

ok thanks, I'll see where else this could be handled

MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 18, 2022

rhshadrach added ExtensionArray Extending pandas with custom dtypes or arrays. datetime.date stdlib datetime.date support Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 20, 2022

jbrockmendel mentioned this issue Jan 20, 2023

Datetime parsing (PDEP-4): allow mixture of ISO formatted strings? #50411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: setting dtype='timestamp[ns][pyarrow]' removes precision from strings #49767

BUG: setting dtype='timestamp[ns][pyarrow]' removes precision from strings #49767

MarcoGorelli commented Nov 18, 2022 •

edited

Loading

INSTALLED VERSIONS

MarcoGorelli commented Nov 18, 2022

MarcoGorelli commented Nov 18, 2022 •

edited

Loading

MarcoGorelli commented Nov 18, 2022

MarcoGorelli commented Nov 18, 2022

mroeschke commented Nov 18, 2022

jbrockmendel commented Nov 18, 2022

MarcoGorelli commented Nov 18, 2022

BUG: setting dtype='timestamp[ns][pyarrow]' removes precision from strings #49767

BUG: setting dtype='timestamp[ns][pyarrow]' removes precision from strings #49767

Comments

MarcoGorelli commented Nov 18, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

MarcoGorelli commented Nov 18, 2022

MarcoGorelli commented Nov 18, 2022 • edited Loading

MarcoGorelli commented Nov 18, 2022

MarcoGorelli commented Nov 18, 2022

mroeschke commented Nov 18, 2022

jbrockmendel commented Nov 18, 2022

MarcoGorelli commented Nov 18, 2022

MarcoGorelli commented Nov 18, 2022 •

edited

Loading

MarcoGorelli commented Nov 18, 2022 •

edited

Loading