-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Provide ExtensionDtype.construct_from_string by default #26562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide ExtensionDtype.construct_from_string by default #26562
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26562 +/- ##
==========================================
- Coverage 91.77% 91.76% -0.02%
==========================================
Files 174 174
Lines 50649 50652 +3
==========================================
- Hits 46483 46479 -4
- Misses 4166 4173 +7
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26562 +/- ##
==========================================
- Coverage 91.87% 91.87% -0.01%
==========================================
Files 174 174
Lines 50663 50657 -6
==========================================
- Hits 46548 46539 -9
- Misses 4115 4118 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a good idea to me.
I think the example in the docstring can then be removed? As the example is then the actual implementation.
Or update it to say that the default implementation is adequate in case that the dtype can be constructed without any arguments.
Seems reasonable. I'm not sure what the motivation was for not doing that in the first place. Possibly that we internally call |
Hello @datapythonista! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-06-05 15:37:26 UTC |
Thanks for the feedback. I fixed the docstrings, improving the example to show how to overwrite the method, and remove the method of the category dtype, since now was the same as the parent class. I think in a follow up PR we could make a generic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger IIRC we explicitly turned off the parsing of datetime w/tz dtypes (in the Dtype constructor).
@datapythonista thus I don't think this is an instructive example here (no objection to moving the integer usecase here though);
there is also a usage in intervals I think?
this would need a test for existence of this method as well
pandas/core/dtypes/base.py
Outdated
@@ -174,15 +174,26 @@ def construct_array_type(cls): | |||
@classmethod | |||
def construct_from_string(cls, string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add types
…rror and removing CategoricalDtype.construct_from_string
pandas/tests/dtypes/test_dtypes.py
Outdated
@@ -51,6 +51,36 @@ def test_pickle(self): | |||
assert result == self.dtype | |||
|
|||
|
|||
class TestBaseDtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such tests are already in the base extension tests (tests/extension/base/dtype), or can be expanded there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the feeling that those were only base tests for implementing tests for custom extensions, couldn't find a place where I thought it'd make sense to have the abstract class test. Where exactly would you have them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those were only base tests for implementing tests for custom extensions
Yes, but we have custom extensions ourselves, so we use those tests ourselves (and in addition we also have a few extra test EAs).
For example in tests/extension/decimal/array.py
you can remove the custom implementation and let it inherit, and then it is already tested in that way.
where I thought it'd make sense to have the abstract class test
What do you mean with abstract class test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand what you want to test.
@jreback wanted a test for the existence of the method construct_from_string
, which makes sense to me. Ideally, we'd already have a test, testing that the method raises an exception when called directly from ExtensionDtype
(what I called the abstract class before). Then I'd just replace that by checking that now it's implemented, and it can construct an instance given the string of the name of the dtype.
I saw that in extension/base/dtype.py
we already have a test for construct_from_string
, but that's meant for custom extensions (ours or from a third-party) to test that their implementation of construct_from_string
follows the expected behavior. The way of using that test is to create a subclass of BaseDtypeTests
.
I don't think we want to test that the method is implemented in the ExtensionDtype
(the base/abstract class) by checking a subclass of it. So, I see two options:
- Create a subclass of
BaseDtypeTests
defining that thedtype
to test is not a subclass butExtensionDtype
itself (it will fail for the abstract methods that still exist in `ExtensionDtype). - Create an independent test for
ExtensionDtype
and not their subclasses, that doesn't subclassBaseDtypeTests
(this is what I've done here, not testing onlyconstruct_from_string
but all the methods that I could tests, since there weren't tests for the the "default"/base implementations).
Am I missing something here? Sorry I don't understand exactly what you propose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see @jorisvandenbossche comments on where tests should go, we have a very general framework for these already.
pandas/tests/dtypes/test_dtypes.py
Outdated
assert isinstance(dtype_instance, self.dtype.__class__) | ||
with pytest.raises(TypeError, match="Cannot construct a 'DummyDtype' " | ||
"from 'another_type'"): | ||
self.dtype.__class__.construct_from_string('another_type') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a useful test to add to the base extension tests, and the test_eq
as well. The defaults tests are maybe harder though.
pandas/tests/dtypes/test_dtypes.py
Outdated
@@ -51,6 +51,36 @@ def test_pickle(self): | |||
assert result == self.dtype | |||
|
|||
|
|||
class TestBaseDtype: | |||
def setup_method(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we really don't want to use setup_method generally at all.
pandas/tests/dtypes/test_dtypes.py
Outdated
@@ -51,6 +51,36 @@ def test_pickle(self): | |||
assert result == self.dtype | |||
|
|||
|
|||
class TestBaseDtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see @jorisvandenbossche comments on where tests should go, we have a very general framework for these already.
@jorisvandenbossche @jreback I think the tests are in the location you meant now. Let me know if you have any further comment. |
Thanks! That's what I was thinking about. You are happy with that as well? I think there are still some custom implementation of |
I think the tests now are testing something slightly different as before. They're testing that new extension dtypes satisfy the implementation of the base class. Before I was testing the base class itself. I think both are reasonable, I'm happy this way too, but I think it also made sense what I implemented before. I deleted the two Thanks for the review, let me know if there is anything else that should be changed, I think it should be ready now. |
thanks @datapythonista |
This apparently caused some performance issues: http://pandas.pydata.org/speed/pandas/index.html#sparse.SparseDataFrameConstructor.time_from_scipy?commits=5a724b5cd796a6ede3cb95b8687eaf561e9d57b2 Will open a proper issue later. |
It seems a bit strange that this PR is the cause of that regression, as SparseDtype has its own implementation of |
Quick profiling of the specific benchmark points out the culprit: in the dtype checking code if doing a try/except it formats the full series in the error message. |
Ah, thanks for digging into it! It's unfortunate that such a minor change can have such broad consequences on performance. I assume this is from something like |
else: | ||
raise TypeError("cannot construct a CategoricalDtype") | ||
except AttributeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anybody know why we had this "except AttributeError: pass" here?
We have had similar performance regressions before. In general we need to be careful with what we print in error messages, but this is of course easy to overlook, as we did in this case. But I would say: that's what we have benchmarks for, and they worked nicely to catch it here. |
I did #26776 with a possible fix. I also don't think it is that an important fix, as there are some strange things going on in |
git diff upstream/master -u -- "*.py" | flake8 --diff
I think it makes sense to provide a standard
construct_from_string
by default, instead of forcing subclasses ofExtensionDtype
to implement it.This way we can define a simple dtype with:
instead of:
CC: @TomAugspurger