Provide ExtensionDtype.construct_from_string by default #26562

datapythonista · 2019-05-29T16:26:27Z

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

I think it makes sense to provide a standard construct_from_string by default, instead of forcing subclasses of ExtensionDtype to implement it.

This way we can define a simple dtype with:

class SimpleDtype(pandas.core.dtypes.dtypes.ExtensionDtype):
    name = 'simple'

    @property
    def type(self):
        return object

instead of:

class SimpleDtype(pandas.core.dtypes.dtypes.ExtensionDtype):
    name = 'simple'

    @property
    def type(self):
        return object

    @classmethod
    def construct_from_string(cls, string):
        if string != cls.name:
            raise TypeError("Cannot construct a '{}' from '{}'".format(
                cls.__name__, string))
        return cls()

CC: @TomAugspurger

codecov · 2019-05-29T17:05:46Z

Codecov Report

Merging #26562 into master will decrease coverage by 0.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master   #26562      +/-   ##
==========================================
- Coverage   91.77%   91.76%   -0.02%     
==========================================
  Files         174      174              
  Lines       50649    50652       +3     
==========================================
- Hits        46483    46479       -4     
- Misses       4166     4173       +7

Flag	Coverage Δ
#multiple	`90.29% <0%> (-0.01%)`	⬇️
#single	`41.69% <0%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/base.py	`94.44% <0%> (-5.56%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a91da0c...20ec4e4. Read the comment docs.

codecov · 2019-05-29T17:05:57Z

Codecov Report

Merging #26562 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26562      +/-   ##
==========================================
- Coverage   91.87%   91.87%   -0.01%     
==========================================
  Files         174      174              
  Lines       50663    50657       -6     
==========================================
- Hits        46548    46539       -9     
- Misses       4115     4118       +3

Flag	Coverage Δ
#multiple	`90.4% <100%> (ø)`	⬆️
#single	`41.93% <100%> (+0.08%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/integer.py	`96.3% <ø> (-0.05%)`	⬇️
pandas/core/dtypes/dtypes.py	`97.54% <ø> (+0.2%)`	⬆️
pandas/core/dtypes/base.py	`100% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0c41f7...2897ad2. Read the comment docs.

jorisvandenbossche

Seems a good idea to me.

I think the example in the docstring can then be removed? As the example is then the actual implementation.
Or update it to say that the default implementation is adequate in case that the dtype can be constructed without any arguments.

TomAugspurger · 2019-05-29T18:45:44Z

Seems reasonable. I'm not sure what the motivation was for not doing that in the first place. Possibly that we internally call construct_from_string() inside try / except TypeError blocks in a few places, which would potentially mask the issue? But I think our tests should be adequate.

…ring

…rwrite of the method

pep8speaks · 2019-05-29T19:43:22Z

Hello @datapythonista! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-05 15:37:26 UTC

datapythonista · 2019-05-29T19:52:08Z

Thanks for the feedback. I fixed the docstrings, improving the example to show how to overwrite the method, and remove the method of the category dtype, since now was the same as the parent class.

I think in a follow up PR we could make a generic construct_from_string method that extracts the arguments from a string like datetime64[ns, UTC] and returns an instance of the class passing the list of everything in the squared brackets as *args. This way we shouldn't have to overwrite construct_from_string for any dtype. I guess I'm not missing anything.

jreback

@TomAugspurger IIRC we explicitly turned off the parsing of datetime w/tz dtypes (in the Dtype constructor).

@datapythonista thus I don't think this is an instructive example here (no objection to moving the integer usecase here though);

there is also a usage in intervals I think?

this would need a test for existence of this method as well

jreback · 2019-05-30T01:05:46Z

pandas/core/dtypes/base.py

@@ -174,15 +174,26 @@ def construct_array_type(cls):
    @classmethod
    def construct_from_string(cls, string):


can you add types

…ring

…rror and removing CategoricalDtype.construct_from_string

jorisvandenbossche · 2019-05-30T11:55:51Z

pandas/tests/dtypes/test_dtypes.py

@@ -51,6 +51,36 @@ def test_pickle(self):
        assert result == self.dtype


+class TestBaseDtype:


Such tests are already in the base extension tests (tests/extension/base/dtype), or can be expanded there

I had the feeling that those were only base tests for implementing tests for custom extensions, couldn't find a place where I thought it'd make sense to have the abstract class test. Where exactly would you have them?

those were only base tests for implementing tests for custom extensions

Yes, but we have custom extensions ourselves, so we use those tests ourselves (and in addition we also have a few extra test EAs).

For example in tests/extension/decimal/array.py you can remove the custom implementation and let it inherit, and then it is already tested in that way.

where I thought it'd make sense to have the abstract class test

What do you mean with abstract class test?

I don't think I understand what you want to test.

@jreback wanted a test for the existence of the method construct_from_string, which makes sense to me. Ideally, we'd already have a test, testing that the method raises an exception when called directly from ExtensionDtype (what I called the abstract class before). Then I'd just replace that by checking that now it's implemented, and it can construct an instance given the string of the name of the dtype.

I saw that in extension/base/dtype.py we already have a test for construct_from_string, but that's meant for custom extensions (ours or from a third-party) to test that their implementation of construct_from_string follows the expected behavior. The way of using that test is to create a subclass of BaseDtypeTests.

I don't think we want to test that the method is implemented in the ExtensionDtype (the base/abstract class) by checking a subclass of it. So, I see two options:

Create a subclass of BaseDtypeTests defining that the dtype to test is not a subclass but ExtensionDtype itself (it will fail for the abstract methods that still exist in `ExtensionDtype).

Create an independent test for ExtensionDtype and not their subclasses, that doesn't subclass BaseDtypeTests (this is what I've done here, not testing only construct_from_string but all the methods that I could tests, since there weren't tests for the the "default"/base implementations).

Am I missing something here? Sorry I don't understand exactly what you propose.

see @jorisvandenbossche comments on where tests should go, we have a very general framework for these already.

jorisvandenbossche · 2019-05-30T16:51:58Z

pandas/tests/dtypes/test_dtypes.py

+        assert isinstance(dtype_instance, self.dtype.__class__)
+        with pytest.raises(TypeError, match="Cannot construct a 'DummyDtype' "
+                                            "from 'another_type'"):
+            self.dtype.__class__.construct_from_string('another_type')


I think this would be a useful test to add to the base extension tests, and the test_eq as well. The defaults tests are maybe harder though.

…ring

jreback · 2019-06-01T14:43:37Z

pandas/tests/dtypes/test_dtypes.py

@@ -51,6 +51,36 @@ def test_pickle(self):
        assert result == self.dtype


+class TestBaseDtype:
+    def setup_method(self):


we really don't want to use setup_method generally at all.

jreback · 2019-06-01T14:44:03Z

pandas/tests/dtypes/test_dtypes.py

@@ -51,6 +51,36 @@ def test_pickle(self):
        assert result == self.dtype


+class TestBaseDtype:


see @jorisvandenbossche comments on where tests should go, we have a very general framework for these already.

…ring

datapythonista · 2019-06-05T16:18:20Z

@jorisvandenbossche @jreback I think the tests are in the location you meant now. Let me know if you have any further comment.

jorisvandenbossche · 2019-06-05T18:51:18Z

Thanks! That's what I was thinking about. You are happy with that as well?

I think there are still some custom implementation of construct_from_string (that are now identical as the parent one) in the test arrays that can be removed (in decimal, json, arrow/bool)

datapythonista · 2019-06-06T10:13:21Z

I think the tests now are testing something slightly different as before. They're testing that new extension dtypes satisfy the implementation of the base class. Before I was testing the base class itself. I think both are reasonable, I'm happy this way too, but I think it also made sense what I implemented before.

I deleted the two construct_from_string methods that were repeating the new base class behavior, there are no more to delete.

Thanks for the review, let me know if there is anything else that should be changed, I think it should be ready now.

jreback · 2019-06-06T14:28:27Z

thanks @datapythonista

TomAugspurger · 2019-06-08T12:29:24Z

This apparently caused some performance issues: http://pandas.pydata.org/speed/pandas/index.html#sparse.SparseDataFrameConstructor.time_from_scipy?commits=5a724b5cd796a6ede3cb95b8687eaf561e9d57b2

Will open a proper issue later.

jorisvandenbossche · 2019-06-10T16:38:20Z

It seems a bit strange that this PR is the cause of that regression, as SparseDtype has its own implementation of construct_from_dtype.

jorisvandenbossche · 2019-06-10T20:13:20Z

Quick profiling of the specific benchmark points out the culprit: in the dtype checking code if doing a try/except it formats the full series in the error message.

TomAugspurger · 2019-06-10T20:29:49Z

Ah, thanks for digging into it!

It's unfortunate that such a minor change can have such broad consequences on performance. I assume this is from something like pandas_dtype(thing), where thing is an array-like rather than a dtype? If we were more careful about not passing data to pandas_dtype, we would have been OK?

jorisvandenbossche · 2019-06-11T07:33:46Z

pandas/core/dtypes/dtypes.py

-            else:
-                raise TypeError("cannot construct a CategoricalDtype")
-        except AttributeError:
-            pass


Does anybody know why we had this "except AttributeError: pass" here?

jorisvandenbossche · 2019-06-11T07:38:41Z

We have had similar performance regressions before. In general we need to be careful with what we print in error messages, but this is of course easy to overlook, as we did in this case. But I would say: that's what we have benchmarks for, and they worked nicely to catch it here.

jorisvandenbossche · 2019-06-11T07:51:53Z

I did #26776 with a possible fix.

I also don't think it is that an important fix, as there are some strange things going on in SparseDataFrame._init_spmatrix, where they pass a dict of {colname: SparseSeries} to Index.difference method (while it should pass something like dict.values() I think), and so it is the conversion of this dict to an Index that is taking a lot of time (in there it is checking if this object is categorical, leading to the error message, as dicts are not special cased there, and leading to printing a dict of 1000 serieses).
Given that this is all deprecated, I am not going to put effort in cleaning that up.

Provide Extension.Dtype.construct_from_string by default

20ec4e4

datapythonista added the Dtype Conversions Unexpected or buggy dtype conversions label May 29, 2019

jorisvandenbossche approved these changes May 29, 2019

View reviewed changes

datapythonista added 2 commits May 29, 2019 20:02

Merge remote-tracking branch 'upstream/master' into construct_from_st…

abb5eb1

…ring

Improving construct_from_string docstring, and remove unnecessary ove…

661d522

…rwrite of the method

Ignoring flake8 false positive

8ff1965

jreback requested changes May 30, 2019

View reviewed changes

Marc Garcia added 3 commits May 30, 2019 10:03

Merge remote-tracking branch 'upstream/master' into construct_from_st…

609910c

…ring

Addressing CI and PR comments: implementing tests, fixing docstring e…

a50301a

…rror and removing CategoricalDtype.construct_from_string

Adding type annotations

e04ed79

jorisvandenbossche reviewed May 30, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into construct_from_st…

ef2d005

…ring

jreback requested changes Jun 1, 2019

View reviewed changes

Marc Garcia added 3 commits June 5, 2019 15:11

Merge remote-tracking branch 'upstream/master' into construct_from_st…

54cca4f

…ring

Moving tests to pandas/tests/extension/

7fa61d2

Fixing test (error message)

2897ad2

TomAugspurger added this to the 0.25.0 milestone Jun 6, 2019

TomAugspurger approved these changes Jun 6, 2019

View reviewed changes

jreback approved these changes Jun 6, 2019

View reviewed changes

jreback merged commit 5a724b5 into pandas-dev:master Jun 6, 2019

jorisvandenbossche reviewed Jun 11, 2019

View reviewed changes

jorisvandenbossche mentioned this pull request Jun 11, 2019

PERF: avoid printing object in Dtype.construct_from_string message #26776

Merged

simonjayhawkins mentioned this pull request Dec 12, 2019

CLN: more consistent error message for ExtensionDtype.construct_from_string #30247

Merged

		@@ -174,15 +174,26 @@ def construct_array_type(cls):
		@classmethod
		def construct_from_string(cls, string):

		@@ -51,6 +51,36 @@ def test_pickle(self):
		assert result == self.dtype


		class TestBaseDtype:

Uh oh!

Provide ExtensionDtype.construct_from_string by default #26562

Provide ExtensionDtype.construct_from_string by default #26562

Uh oh!

Conversation

datapythonista commented May 29, 2019

Uh oh!

codecov bot commented May 29, 2019

Codecov Report

Uh oh!

codecov bot commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented May 29, 2019

Uh oh!

pep8speaks commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-06-05 15:37:26 UTC

Uh oh!

datapythonista commented May 29, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

datapythonista commented Jun 5, 2019

Uh oh!

jorisvandenbossche commented Jun 5, 2019

Uh oh!

datapythonista commented Jun 6, 2019

Uh oh!

jreback commented Jun 6, 2019

Uh oh!

TomAugspurger commented Jun 8, 2019

Uh oh!

jorisvandenbossche commented Jun 10, 2019

Uh oh!

jorisvandenbossche commented Jun 10, 2019

Uh oh!

TomAugspurger commented Jun 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 11, 2019

Uh oh!

jorisvandenbossche commented Jun 11, 2019

Uh oh!

Uh oh!

codecov bot commented May 29, 2019 •

edited

Loading

pep8speaks commented May 29, 2019 •

edited

Loading