Skip to content

BUG: inconsistant parsing between Timestamp and to_datetime #52167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
arnaudlegout opened this issue Mar 24, 2023 · 22 comments · Fixed by #52195
Closed
3 tasks done

BUG: inconsistant parsing between Timestamp and to_datetime #52167

arnaudlegout opened this issue Mar 24, 2023 · 22 comments · Fixed by #52195
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@arnaudlegout
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

pd.Timestamp('10 june 2000 8:30')
>>>
Timestamp('2000-06-10 08:30:00')

pd.to_datetime('10 june 2000 8:30')
>>>
 UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime('10 june 2000 8:30')

Timestamp('2000-06-10 08:30:00')

Issue Description

to_datetime used to be able to parse date using month in English. There is no ambiguity in this case (no doubt what is the day position), but it raises now a UserWarning.

Surprisingly, a Timestamp can be constructed without any warning with the same string that raised a Warning with to_datetime

Expected Behavior

Ideally, I would like to have no user warning with a string such as '8 July 2010' or 'July 8 2012'.
At the minimum to_datetime and Timestamp should raise on the same strings.

Installed Versions

INSTALLED VERSIONS

commit : c2a7f1a
python : 3.11.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 2.0.0rc1
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

@arnaudlegout arnaudlegout added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 24, 2023
@arnaudlegout
Copy link
Contributor Author

date_range also correctly parse without a UserWarning
pd.date_range('10 june 2000 8:30', periods=2)

@MarcoGorelli
Copy link
Member

thanks for the report

this is expected: to_datetime can work on an array of elements, so it needs to check that they can all be parsed with the same format, whereas Timestamp only parses a single element so there's no risk of inconsistency between elements

having said that, the warning could probably be extended to say

Could not infer format, so each element will be parsed individually, falling back to `dateutil`.
To ensure parsing is consistent and as-expected, please specify a format.
Or, to accept the current behavior, pass `format='mixed'`

@arnaudlegout
Copy link
Contributor Author

arnaudlegout commented Mar 24, 2023

I still don't get it. I understand that to_datetime has to possibly parse a sequence of dates. It tries to infer the format from the first date and apply the same formatting to others (unless the risky format='mixed' option is used). But in my case, there is a single date passed to to_datetime and there is no ambiguity on this first date. I don't understand why there should be a UserWarning here.

@MarcoGorelli
Copy link
Member

arguably we could only warn if there are at least two non-null elements

maybe that's worth doing - interested in submitting a PR?

@arnaudlegout
Copy link
Contributor Author

I had a look at the code. It looks a bit harder than I thought.
The exception is produced by core.tools.datetimes._guess_datetime_format_for_array. This method simply tries to guess the datetime format from the first non-NaN element. If this element format cannot be guessed the warning is raised. So this warning is not raised when two elements in the same array do not have the same format, but when the first non-NaN element format cannot be guessed.

Suppressing this warning in this method when there is a single element seems a wrong idea.

In my opinion, this warning should be raised later if this method returns None and the array is of length > 1.
I would move the warning from core.tools.datetimes._guess_datetime_format_for_array to
core/tools/datetimes._convert_listlike_datetimes

In core/tools/datetimes._convert_listlike_datetimes I would write something like

    if format is None:
        format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)

    # NEW CODE BEGIN
    if format is None and len(arg) > 1:
                warnings.warn(
                "Could not infer format, so each element will be parsed "
                "individually, falling back to `dateutil`. To ensure parsing is "
                "consistent and as-expected, please specify a format.",
                UserWarning,
                stacklevel=find_stack_level(),
            )        
    # NEW CODE END


    # `format` could be inferred, or user didn't ask for mixed-format parsing.
    if format is not None and format != "mixed":
        return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)

    result, tz_parsed = objects_to_datetime64ns(
        arg,
        dayfirst=dayfirst,
        yearfirst=yearfirst,
        utc=utc,
        errors=errors,
        allow_object=True,
    )

However, I am not completely sure of the possible side effect of this change. Is there anybody who can tell me whether what I propose makes sense, or whether I am missing something?

@arnaudlegout
Copy link
Contributor Author

By the way, is the message of the warning correct. I have no idea what "falling back to 'dateutil'" means and implies. I would personally just drop this sentence from the warning message as it seems uninformative for a user.

@MarcoGorelli
Copy link
Member

I'd suggest making the change within _guess_datetime_format_for_array, i.e.

-            warnings.warn(
-                "Could not infer format, so each element will be parsed "
-                "individually, falling back to `dateutil`. To ensure parsing is "
-                "consistent and as-expected, please specify a format.",
-                UserWarning,
-                stacklevel=find_stack_level(),
-            )
+            if tslib.first_non_null(arr[first_non_null+1:]) != -1:
+                warnings.warn(
+                    "Could not infer format, so each element will be parsed "
+                    "individually, falling back to `dateutil`. To ensure parsing is "
+                    "consistent and as-expected, please specify a format.",
+                    UserWarning,
+                    stacklevel=find_stack_level(),
+                )

@arnaudlegout
Copy link
Contributor Author

Do you have any specific reasons for this suggestion?

_guess_datetime_format_for_array is called to know whether the format can be guessed, the length of the passed array should be irrelevant to the behavior of this method. I expect this method to return either a format or None, not to raise a warning linked to the length of the array.

According to me it is more relevant to raise the warning in _convert_listlike_datetimes.

@MarcoGorelli
Copy link
Member

because in there we're already doing the check for the first non-null item

if you wanted to do it one level up, you'd need to do that check again, or return the index of the first non-null item

@MarcoGorelli
Copy link
Member

I'll open a PR so we can get this in by the 2.0 release

@arnaudlegout
Copy link
Contributor Author

@MarcoGorelli you were faster than me!

I got your point on checking for non-null item.

Still have one comment on the warning message.

"Could not infer format, so each element will be parsed "
"individually, falling back to `dateutil`. To ensure parsing is "
"consistent and as-expected, please specify a format."

When I read this message, I understand that the date cannot be parsed. Indeed, if you cannot find the format of a date, how it can be parsed anyway later. This "Could not infer format, so each element will be parsed individually" seems contradictory. Then I believed this warning was raised when the date format is not uniform among dates in the array (which could explain the message), but it is not the case.

My naive understanding is that to parse a date, you must infer its format. Passing a format simply speed up the parsing because the format does not need to be inferred. It seems the internal is more complex.

Then I have no idea what " falling back to dateutil." refers to and why it is useful in the warning message. It seems to expose an internal detail.

Currently, this warning is confusing and I believe it ought to be revised. I cannot really make a proposal as my understanding of the date parsing internal is not good enough, but I can give feedback on a new version if needed.

NOTE: I am still not fully understanding how dates are parsed (seems most code is cython and I still not completely understand how to navigate this code in pycharm, by the way, is there any documentation on how to seamlessly navigate python and cython/C code in an IDE).

@MarcoGorelli
Copy link
Member

@MarcoGorelli you were faster than me!

Apologies - I'd normally prefer to mentor people through contributions, but with the final 2.0.0 release on the horizon (possibly next week!), I figured I'd just go ahead and try to get this in

Regarding the warning - I'll try to explain what's going on:

  • first, pandas tries to infer the format. It can't guess all formats, but if it can guess the format of the given element, then it will be parsed using pandas' own parsers
  • if pandas can't infer the format, then a library called dateutil (which is one of pandas' dependencies) will be used to parse the elements one-by-one. This is slower, but also riskier because dateutil doesn't tell us which format it used, and so there's no guarantee that the same format will be used for each element

So, this is what the warning is trying to tell users about. If you have suggestions for how to better phrase it, that would be extremely welcome

is there any documentation on how to seamlessly navigate python and cython/C code in an IDE

I wish 😄 But the legendary @WillAyd has some blog posts on debugging C extensions at https://willayd.com/

@arnaudlegout
Copy link
Contributor Author

Thanks for the clarification, I would suggest :

"Could not infer format. Falling back to a slower and possibly inconsistent per-date parsing. Specifying the format is recommended for faster and consistent parsing." (24 words, 164 chars vs. 26 words, 170 chars in the original message)

I wish 😄 But the legendary @WillAyd has some blog posts on debugging C extensions at https://willayd.com/

great, thanks!

@ax-va
Copy link

ax-va commented Jul 17, 2023

Hey! I have the same problem, if I use read_csv
data = pd.read_csv('data.csv', index_col='Date', parse_dates=True)

The data in data.csv are

Date,Fremont Bridge Total,Fremont Bridge East Sidewalk,Fremont Bridge West Sidewalk
11/01/2019 12:00:00 AM,12,7,5
11/01/2019 01:00:00 AM,7,0,7
11/01/2019 02:00:00 AM,1,0,1
11/01/2019 03:00:00 AM,6,6,0
11/01/2019 04:00:00 AM,6,5,1
11/01/2019 05:00:00 AM,20,9,11
11/01/2019 06:00:00 AM,97,43,54

The warning message what I get then:

UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

But I get a correctly parsed datetime format

                     Fremont Bridge Total  Fremont Bridge East Sidewalk  Fremont Bridge West Sidewalk
Date                                                                                                 
2019-11-01 00:00:00                    12                             7                             5
2019-11-01 01:00:00                     7                             0                             7
2019-11-01 02:00:00                     1                             0                             1
2019-11-01 03:00:00                     6                             6                             0
2019-11-01 04:00:00                     6                             5                             1
2019-11-01 05:00:00                    20                             9                            11
2019-11-01 06:00:00                    97                            43                            54

data.info() gives

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7 entries, 2019-11-01 00:00:00 to 2019-11-01 06:00:00
Data columns (total 3 columns):
 #   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
 0   Fremont Bridge Total          7 non-null      int64
 1   Fremont Bridge East Sidewalk  7 non-null      int64
 2   Fremont Bridge West Sidewalk  7 non-null      int64
dtypes: int64(3)

If I try to set a format, for example,
data = pd.read_csv('data.csv', index_col='Date', parse_dates=True, date_format="%Y-%m-%d %H:%M:%S")
I get no warning, but the date is not parsed

                        Fremont Bridge Total  Fremont Bridge East Sidewalk  Fremont Bridge West Sidewalk
Date                                                                                                    
11/01/2019 12:00:00 AM                    12                             7                             5
11/01/2019 01:00:00 AM                     7                             0                             7
11/01/2019 02:00:00 AM                     1                             0                             1
11/01/2019 03:00:00 AM                     6                             6                             0
11/01/2019 04:00:00 AM                     6                             5                             1
11/01/2019 05:00:00 AM                    20                             9                            11
11/01/2019 06:00:00 AM                    97                            43                            54

data.info() gives

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, 11/01/2019 12:00:00 AM to 11/01/2019 06:00:00 AM
Data columns (total 3 columns):
 #   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
 0   Fremont Bridge Total          7 non-null      int64
 1   Fremont Bridge East Sidewalk  7 non-null      int64
 2   Fremont Bridge West Sidewalk  7 non-null      int64
dtypes: int64(3)
memory usage: 224.0+ bytes

Pandas is of version of 2.0.3

@MarcoGorelli
Copy link
Member

your format's not correct, you're missing the meridiem directive

@ax-va
Copy link

ax-va commented Jul 17, 2023

You are right.

pd.read_csv('data.csv',` index_col='Date', parse_dates=True, date_format="%m/%d/%Y %I:%M:%S `%p")

parses correctly

@Chinaskidev
Copy link

hi everyone, im new using pandas, I am trying to analyze some data and I get the following UserWarning:

UserWarning: Could not infer format, so each element will be parsed individually, falling back to dateutil. To ensure parsing is consistent and as-expected, please specify a format. df['Fecha'] = pd.to_datetime(df.Date)

this is my dataframe code: df['Fecha'] = pd.to_datetime(df.Date)

df.drop(columns='Date', inplace=True)

Does anyone have a solution?

@MarcoGorelli
Copy link
Member

please specify a format

You need to do this

@ReetiMauryaCrest
Copy link

ReetiMauryaCrest commented Nov 12, 2024

@MarcoGorelli I also have an issue with this date parsing message. I am trying to parse a date series which has 2 formats merged together: 08-06-2024 00:00:00 and 8/14/2024 12:00:00 AM (I know that this is an odd case, but can't do anything about it). When I try to parse this, I get the error mentioned above. What do you suggest is the best way for me to handle this case as I don't completely understand how mixed type are handled.

Someone on stack over flow suggested to read data in both formats and merge:

date1 = pd.to_datetime(df['date'], errors='coerce', format='%Y-%m-%d')
date2 = pd.to_datetime(df['date'], errors='coerce', format='%d.%m.%Y')
df['date'] = date1.fillna(date2)

but I also don't understand this one and my only option remains as making a custom function to infer my date and apply it manually so that dateutil don't give inconsistent result. So, is it the best option to handle this?

@afranklin238
Copy link

I have the same problem as @ReetiMauryaCrest. I don't have control over how the data is entered, so I get a mixture of formats for dates. I want to be sure I'm using best practices, what would y'all suggest?

@MarcoGorelli
Copy link
Member

hey

I don't completely understand how mixed type are handled

if you pass format='mixed' then formats are inferred row-by-row

but personally i'd suggest doing some validation before your data reaches pandas

@ReetiMauryaCrest
Copy link

Thanks @MarcoGorelli I ended up making my custom parsing function based on 2 date formats I was receiving and applying it on the row. My main concern was that it would be too slow but I haven't found any large delays till now, so I think it would work for me for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
6 participants