Skip to content

BUG: Fix localize_pydatetime using meta datetimes as Timestamp (#25734) #25746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ArtificialQualia
Copy link
Contributor

Proposed fix for #25734

The issue was due to localize_pydatetime assuming anything that was not PyDateTime_CheckExact must be a pandas Timestamp. However, in the rare case a datetime was created with a custom metaclass, that datetime would fail PyDateTime_CheckExact, but obviously was not a Timestamp.

Therefore, I made the not PyDateTime_CheckExact check more explicitly look for a Timestamp. As the Timestamp class could not be imported for a isinstance check due to a circular import, I took the route that many of the other is_* style checks took of adding a _typ property to Timestamp and inspecting that to determine identity.

@codecov
Copy link

codecov bot commented Mar 16, 2019

Codecov Report

Merging #25746 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25746      +/-   ##
==========================================
- Coverage   91.25%   91.25%   -0.01%     
==========================================
  Files         172      172              
  Lines       52977    52977              
==========================================
- Hits        48343    48342       -1     
- Misses       4634     4635       +1
Flag Coverage Δ
#multiple 89.82% <ø> (ø) ⬆️
#single 41.74% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/util/testing.py 88.98% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65c0441...2c4ee4d. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 16, 2019

Codecov Report

Merging #25746 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #25746   +/-   ##
=======================================
  Coverage   91.25%   91.25%           
=======================================
  Files         172      172           
  Lines       52977    52977           
=======================================
  Hits        48343    48343           
  Misses       4634     4634
Flag Coverage Δ
#multiple 89.82% <ø> (ø) ⬆️
#single 41.74% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65c0441...27657f1. Read the comment docs.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the fix

@mroeschke
Copy link
Member

This looks to be a regression in 0.24 (from this PR #21691) and eligible for backporting.

@mroeschke mroeschke added Bug Frequency DateOffsets labels Mar 16, 2019
@mroeschke
Copy link
Member

Will need to isort this file.

ERROR: /home/vsts/work/1/s/pandas/tests/tslibs/test_conversion.py Imports are incorrectly sorted.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls show if this causes any perf issues

return isinstance(obj, datetime)

class FakeDatetime(MetaDatetime("NewBase", (datetime,), {})):
pass
Copy link
Contributor Author

@ArtificialQualia ArtificialQualia Mar 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A metaclass is needed in this test case to be able to override __instancecheck__


with monkeypatch.context() as m:
# monkeypatch datetime everywhere
for mod_name, module in list(sys.modules.items()):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least some level of monkeypatching is necessary for the original issue to show up.

However, this entire for block could be replaced with m.setattr("pandas.tseries.offsets.datetime", FakeDatetime), but I wanted to future proof this test better and replicate what freezegun and other libraries that might monkeypatch datetime would do.

@pep8speaks
Copy link

pep8speaks commented Mar 18, 2019

Hello @ArtificialQualia! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-18 02:37:58 UTC

@ArtificialQualia
Copy link
Contributor Author

pls show if this causes any perf issues

I ran the perf testing and did have some significant deviations with many of them in places I wouldn't expect to see any differences. I'll run them again tonight and see if the deviations are consistent.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final comment clarification note from me otherwise LGTM.

Okay to backport to the 0.24 branch @jreback? This was a regression in 0.24.

pass

with monkeypatch.context() as m:
# monkeypatch datetime everywhere
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this comment to describe what freezegun (and other libraries) are doing?

@jreback
Copy link
Contributor

jreback commented Mar 18, 2019

no on the backport and this need perf checking

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear what problem you are trying to solve here. can you elaborate.

@@ -136,7 +136,7 @@ Categorical
Datetimelike
^^^^^^^^^^^^

-
- Bug with :class:`Timestamp` and :class:`CustomBusinessDay` arithmetic throwing an exception with datetime subclasses (:issue:`25734`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't throw an exception, rather we raise an exception. Can you even be more explicit here, what does this have to do with CBD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to scope this description to more what was in the original issue, but really it is due to freezegun monkeypatching datetime with a subclass of datetime. A few of the offsets.py classes turn things into datetime in their apply functions. That's why this affects CustomBusinessDay and not, for instance, BusinessDay. After a brief look, a few other classes might be affected by this bug, like Easter and FY5253.

However, in writing a custom datetime class for testing, another bug was discovered in convert_datetime_to_tsobject which prevents any subclassed datetime from being converted into a Timestamp. freezegun was not affected by this by chance because it happened to implement the nanosecond property in their custom datetime class, but that fix has been included in this PR as mentioned in another comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so say this affects subclasses of datetime. note btw, we don't provide any compat with this, but as usual will take patches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this affects datetime subclasses is already in the release notes, I'm not sure what you would like changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:CustomBusinessDay arithmetic is misleading

@jbrockmendel
Copy link
Member

I ran the perf testing and did have some significant deviations with many of them in places I wouldn't expect to see any differences. I'll run them again tonight and see if the deviations are consistent.

Yah this is a common issue with asv when looking at cython code. If it persists, just show some relevant %timeit results.

Because this is replacing a C function call with a python attribute lookup, I'd expect the performance to take a hit. The question is whether this is called enough times for that perf hit to be non-trivial.

Also, there are a handful of places where we use not Datetime_CheckExact(x) as a proxy for isinstance(x, Timestamp). Ideally if we fix one of these, we should fix them all.

@ArtificialQualia
Copy link
Contributor Author

%timeit Results

For the performance tests that over both runs had a factor > 1.1 (did not test the factor < 0.9) here are the %timeit results:

multiindex_object.GetLoc.time_large_get_loc
This PR:

In [4]: %timeit get_loc.time_large_get_loc()
11.8 µs ± 52.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Master:

In [4]: %timeit get_loc.time_large_get_loc()
11.7 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

inference.NumericInferOps.time_subtract(<class 'numpy.int16'>)
This PR:

In [5]: %timeit nio.time_subtract(np.int16)
431 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Master:

In [5]: %timeit nio.time_subtract(np.int16)
428 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeseries.ResampleSeries.time_resample('datetime', '5min', 'mean')
This PR:

In [4]: %timeit rs.time_resample('datetime', '5min', 'mean')
3.68 ms ± 56.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Master:

In [4]: %timeit rs.time_resample('datetime', '5min', 'mean')
3.78 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

series_methods.NSort.time_nsmallest('last')
This PR:

In [4]: %timeit ns.time_nsmallest('last')
2.04 ms ± 7.51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Master:

In [4]: %timeit ns.time_nsmallest('last')
2.3 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

multiindex_object.Integer.time_get_indexer
This PR:

In [4]: %timeit i.time_get_indexer()
1.27 ms ± 9.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Master:

In [4]: %timeit i.time_get_indexer()
1.28 ms ± 2.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Out of these, the only 'significant' result was series_methods.NSort.time_nsmallest('last') which is a little faster in this PR's code, but that's probably a fluke.

@jbrockmendel
Copy link
Member

For the %timeit checks, I was suggesting that you run just the affected code, e.g %timeit localize_pydatetime(whatever) (for values that would be affected by the PR). Or if there are other functions that call localize_pydatetime a lot in a loop, time those.

@ArtificialQualia
Copy link
Contributor Author

ArtificialQualia commented Mar 18, 2019

Also, there are a handful of places where we use not Datetime_CheckExact(x) as a proxy for isinstance(x, Timestamp). Ideally if we fix one of these, we should fix them all.

I'm happy to do that, but should that be done in a separate issue/PR?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so rather than using PyDateTime_CheckExact, if we switch this to PyDateTime_Check it will likely work, though IIRC we did this as a perf issue? @jbrockmendel do you remember

@@ -136,7 +136,7 @@ Categorical
Datetimelike
^^^^^^^^^^^^

-
- Bug with :class:`Timestamp` and :class:`CustomBusinessDay` arithmetic throwing an exception with datetime subclasses (:issue:`25734`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so say this affects subclasses of datetime. note btw, we don't provide any compat with this, but as usual will take patches.

@ArtificialQualia
Copy link
Contributor Author

so rather than using PyDateTime_CheckExact, if we switch this to PyDateTime_Check it will likely work, though IIRC we did this as a perf issue? @jbrockmendel do you remember

I thought that might work, but since we are using not PyDateTime_CheckExact as a proxy for isinstance(x, Timestamp), it wouldn't work to switch that out for PyDateTime_Check since Timestamp is itself a subclass of datetime.

@jbrockmendel
Copy link
Member

so rather than using PyDateTime_CheckExact, if we switch this to PyDateTime_Check it will likely work, though IIRC we did this as a perf issue?

@ArtificialQualia's comment about not PyDateTime_CheckExact(x) being a proxy for isinstance(x, Timestamp) is correct, as is @jreback's point that we are using this proxy for performance reasons.

An ideal solution would be to implement a is_timestamp function with C performance. Some speculation about what that might look like:

  • if we somehow merged Timestamp with _Timestamp, so that Timestamp itself were a cdef class, then I think type(x) is Timestamp would be a pointer comparison, i.e. fast. Might merit checking on the cython mailing list about this.
  • this might require a refactor to avoid circular dependencies. Presumably such an is_timestamp would be defined in tslibs.timestamps (which depends on tslibs.conversion), but ATM we use this heuristic in tslibs.conversion (alsotslibs.timedeltas and tslib)

@@ -136,7 +136,7 @@ Categorical
Datetimelike
^^^^^^^^^^^^

-
- Bug with :class:`Timestamp` and :class:`CustomBusinessDay` arithmetic throwing an exception with datetime subclasses (:issue:`25734`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:CustomBusinessDay arithmetic is misleading

obj.value += ts.nanosecond
obj.dts.ps = ts.nanosecond * 1000
except AttributeError:
# probably a subclass of datetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this probably?

@jbrockmendel
Copy link
Member

pinged the cython mailing list about fast isinstance checks: https://groups.google.com/forum/#!topic/cython-users/AQSGs-Sx9Pc

It sounds like isinstance(x, _Timestamp) might be significantly faster than isinstance(x, Timestamp)

@jbrockmendel
Copy link
Member

After doing some profiling, it looks like the try/except AttributeError check is way slower than basically every other alternative:

Within cython, had a function run each of the following 10k times. Then ran that function in %timeit

                                 Timestamp     datetime
isinstance(x, _Timestamp) -->    16.1µs        17.3µs
isinstance(x, Timestamp)  -->    74.6µs        198µs
not PyDateTime_CheckExact(x) --> 119ns         113 ns
try/except  -->                  3.22 ms       3.21 ms

So the best "do it right" (isinstance(x, _Timestamp)) is 135x slower than the current check, but 200x better than the try/except check.

There are a couple of extant places where we use this check. I'm going to make a PR to get rid of those. @ArtificialQualia are you OK with putting this on hold until a performant solution is in place?

@ArtificialQualia
Copy link
Contributor Author

ArtificialQualia commented Mar 20, 2019

@jbrockmendel I tried to replicate your results based on your testing criteria, and got some similar results with some exceptions. For instance:

  • isinstance(x, _Timestamp) was as fast as not PyDateTime_CheckExact(x) within a few nanoseconds.
  • try/except was only in the ms order of magnitude when it actually catches an exception. When no exception was raised (which would almost always be the case in this code), it was only about 200x slower than isinstance(x, _Timestamp)

Either way, it seems to me as well the answer is to use isinstance(x, _Timestamp).

I'm happy to put this on hold until your other PR is merged.

@jbrockmendel
Copy link
Member

I tried to replicate your results based on your testing criteria, and got some similar results with some exceptions

When I tried this, I found the signature of the is_timestamp functions affected the results quite a bit. In particular, the timings I reported all had the signature cdef bint is_timestamp(datetime obj) as opposed to cdef bint is_timestamp(object obj). Can that account for the difference in results?

I'm happy to put this on hold until your other PR is merged.

My first attempt at this failed and I don't know when I'll be able to circle back to it. Better not to wait on me.

@ArtificialQualia
Copy link
Contributor Author

When I tried this, I found the signature of the is_timestamp functions affected the results quite a bit. In particular, the timings I reported all had the signature cdef bint is_timestamp(datetime obj) as opposed to cdef bint is_timestamp(object obj). Can that account for the difference in results?

My signature was cpdef inline void test_1(datetime dt), so there are some differences but the arg was the same at least.

I found a very large difference depending on how _Timestamp was imported. If it was imported with a python import then 10,000 isinstance(dt, _Timestamp) took 55.7 µs and 167 µs for Timestamp and datetime respectively. But if I did a cimport _Timestamp (requires additional .pxd changes of course) then that went back down to 70.2 ns and 64.5 ns respectively. Maybe that is where the difference is? Which import method did you use?

I'm happy to make the PR for the larger scope of changes, but if you were thinking of having someone else do it that is fine with me as well.

@ArtificialQualia
Copy link
Contributor Author

#25853 implements this

@ArtificialQualia ArtificialQualia deleted the fix-localize_pydatetime branch April 5, 2019 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Frequency DateOffsets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas 0.24.x throws an error when using freezegun with CustomBusinessDay
5 participants