Skip to content

NaT inheritance #17435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Sep 5, 2017 · 3 comments · Fixed by #17793
Closed

NaT inheritance #17435

jbrockmendel opened this issue Sep 5, 2017 · 3 comments · Fixed by #17793
Labels
Clean Internals Related to non-user accessible pandas implementation

Comments

@jbrockmendel
Copy link
Member

pd.NaT is an instance of NaTType. NaTType inherits from _NaT, which inherits from _Timestamp, which inherits from datetime.

How would people feel about taking _Timestamp out of that chain? Note that _Timestamp is not exported, so no users should be checking e.g. isinstance(x, _Timestamp).

The motivation: if we can take _Timestamp out of the inheritance chain, then the entire NaT hierarchy along with the instance itself can be defined in its own dedicated module, with no intra-pandas dependencies other than util. This comes out to a little over 500 lines.

This opens up opportunities for simplification elsewhere, since there are a bunch of modules that import NaT but don't care about most of the rest of tslib (the same applies to the timezones functions in #17274). Of particular interest are _libs.lib, _libs.period, tseries.frequencies, and tseries.offsets.

The other upside is that with few dependencies, the module becomes much easier to test/benchmark/maintain.

Without _NaT inheriting from _Timestamp, a methods in _Timestamp become eligible for the @cython.final decorator. My (limited) understanding is that this provides a small performance improvement.

The main downside I see is that the docstring pinning done by _make_error_func, _make_nat_func, and _make_nan_func is a bit of a hassle.

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Sep 5, 2017
@gfyoung
Copy link
Member

gfyoung commented Sep 5, 2017

Hmm...this is an interesting proposal. Being able to disentangle the mess of Cython code beneath pandas would be good. Give it a shot!

@jbrockmendel
Copy link
Member Author

In case anyone is curious where I'm going with the whole tslibs thing:

Once the current set of PRs is resolved, I'll make one to separate out _NaT, NaTType and NaT. There are some very small optimizations that can be made, nothing worth writing home about. What is worthwhile is that lib can just import NaT and not have a dependency on tslib. (and algos which cimports from lib, and groupby which cimports from algos...)

Once NaT is separated, and assuming #17363 is merged, then all of the _TSObject logic can be separated (it takes some work to disentangle the references to Timestamp, but is worth it) into tslibs.conversion. index._to_i8 thematically belongs in this module, as do lib.gmtime, lib.to_datetime, lib.to_timestamp (which is a misnomer), lib.array_to_timestamp, and lib.time64_to_datetime.

tslibs.timedeltas clocks in at about 1400 lines, defines _Timedelta. A few funcs from core.indexes.timedeltas can naturally be moved up here. tslibs.timestamps (~1100 lines) defines _Timestamp. Timestamp and Timedelta still need to be defined in tslib because they refer to each other.

@jbrockmendel
Copy link
Member Author

OK, we've gotten to a point in tslibs where decisions need to be made about NaTType.

The proposed moduletslibs.nattype defines _NaT, NaTType, NaT, and _checknull_with_nat. It has no pandas dependencies outside of util (not even datetime.pxd), so is very light-weight. This in turn allows us to separate out _TSObject logic and disentangle lib from tslib. It is a big win.

That said, it is not a matter of cut/paste the same way as other tslibs modules have been. There are three potential sticking points.

  1. At the moment _NaT subclasses _Timestamp (which subclasses datetime). In the proposed change, _NaT subclasses datetime directly. This doesn't matter in and of itself, but it is technically a change that reviewers should be aware of.

  2. The two places where the _Timestamp inheritance are currently used are in _NaT.__sub__ and _NaT.__add__, which currently use:

            result = _Timestamp.__add__(self, other)
            # Timestamp.__add__ doesn't return DatetimeIndex/TimedeltaIndex
            if result is NotImplemented:
                return result

Under the proposed change, these NotImplementeds are implemented directly in the _NaT methods, which entails nonzero code duplication.

  1. A handful of NaTType method docstrings are currently copied from their Timestamp analogues. Many of these can just be gotten from datetime, but for some the workaround is to define the method in nattype, then import NaTType into tslib and patch the docstring there. So nattype is not 100% stand-alone.

@jreback I'd like to get your feedback on this when convenient; no hurry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants