-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Infer best datetime format from a sample #52626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this
couple of initial comments:
- this looks very complicated, is a simpler solution possible?
- this should be deterministic, and shouldn't depend on the result of
np.random
- can we use, say, the first 10 non-null elements? Or 10 equally spaced elements?
Thanks for the quick reply!
Probably, I'll try to make it simpler.
Will fix |
…time_format_inference_test
…eoGrin/pandas into datetime_format_inference_test
…eoGrin/pandas into datetime_format_inference_test
@MarcoGorelli this is ready for review (don't know if I should tag or just wait) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating!
this still look quite complicated - this is going to have be maintained long-term by multiple people and so I'm a bit hesitant to add a lot of code. Is there a simpler solution which would still improve datetime format inference?
.. ipython:: python | ||
:okwarning: | ||
|
||
pd.to_datetime([1, "foo"], errors="coerce") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the warning UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please...
is raised when any value is a string (if no format was found). Before, it was raised when the first non-null value was a string, so wouldn't be raised in this example, but would be raised on pd.to_datetime(["foo", 1], errors="coerce")
for instance.
Thanks for the reply!
I understand the concern, but I'm not sure how to make it simpler. One thing I can easily remove, and is perhaps a bit complicated, is the |
yeah |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for your PR I'm afraid this introduces too much complexity, sorry, it's going to be too hard to maintain, there's already very few people comfortable with this part of the codebase If you can find a simpler solution which improves format inference without greatly increasing the maintenance burden, then we can take. For example, just trying 5 elements and taking a majority vote would probably be an improvement and still be quite simple |
doc/source/whatsnew/v2.1.0.rst
file if fixing a bug or adding a new feature.Summary
pd.to_datetime
now tries to infer the datetime format of each string by consideringa random sample (instead of the first non-null sample),
and tries to find the format which work for most strings. If several
formats work as well, the one which matches the
dayfirst
parameter is returned. Ifformat="mixed"
, pandas does the same thing, then tries the second best format on thestrings which failed to parse with the first best format, and so on (instead of parsing each row
independently) (#52508).
Previous behavior:
New behavior:
Design questions:
dayfirst
parameter too often? Right now, we always prefer the format which matches the most strings, and only considerdayfirst
if several formats match as many strings (and raise a warning if we contradictdayfirst
). Forerror="raise"
it makes sense (we only select a format if it matches all non-null values), but I'm wondering if it's a problem forerrors="coerce"
orerrors="ignore"
: in this case, ifdayfirst = True
but%m-%d%-%Y
matches more values than%d-%m%-%Y
, we select%m-%d%-%Y
(and raise a warning). I think it's okay but it may be somewhat confusing for users. Some ideas otherwise:dayfirst=None
(instead ofFalse
) as the default inpd.to_datetime
, and respectdayfirst
if it's provided.dayfirst
: for instance ifdayfirst = True
,%m-%d%-%Y
needs to match >10% more values than%d-%m%-%Y
to be chosen.