Skip to content

Provide ExtensionDtype.construct_from_string by default #26562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 6, 2019
Merged

Provide ExtensionDtype.construct_from_string by default #26562

merged 11 commits into from
Jun 6, 2019

Conversation

datapythonista
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I think it makes sense to provide a standard construct_from_string by default, instead of forcing subclasses of ExtensionDtype to implement it.

This way we can define a simple dtype with:

class SimpleDtype(pandas.core.dtypes.dtypes.ExtensionDtype):
    name = 'simple'

    @property
    def type(self):
        return object

instead of:

class SimpleDtype(pandas.core.dtypes.dtypes.ExtensionDtype):
    name = 'simple'

    @property
    def type(self):
        return object

    @classmethod
    def construct_from_string(cls, string):
        if string != cls.name:
            raise TypeError("Cannot construct a '{}' from '{}'".format(
                cls.__name__, string))
        return cls()

CC: @TomAugspurger

@datapythonista datapythonista added the Dtype Conversions Unexpected or buggy dtype conversions label May 29, 2019
@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #26562 into master will decrease coverage by 0.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26562      +/-   ##
==========================================
- Coverage   91.77%   91.76%   -0.02%     
==========================================
  Files         174      174              
  Lines       50649    50652       +3     
==========================================
- Hits        46483    46479       -4     
- Misses       4166     4173       +7
Flag Coverage Δ
#multiple 90.29% <0%> (-0.01%) ⬇️
#single 41.69% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/base.py 94.44% <0%> (-5.56%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a91da0c...20ec4e4. Read the comment docs.

@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #26562 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26562      +/-   ##
==========================================
- Coverage   91.87%   91.87%   -0.01%     
==========================================
  Files         174      174              
  Lines       50663    50657       -6     
==========================================
- Hits        46548    46539       -9     
- Misses       4115     4118       +3
Flag Coverage Δ
#multiple 90.4% <100%> (ø) ⬆️
#single 41.93% <100%> (+0.08%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/integer.py 96.3% <ø> (-0.05%) ⬇️
pandas/core/dtypes/dtypes.py 97.54% <ø> (+0.2%) ⬆️
pandas/core/dtypes/base.py 100% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0c41f7...2897ad2. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a good idea to me.

I think the example in the docstring can then be removed? As the example is then the actual implementation.
Or update it to say that the default implementation is adequate in case that the dtype can be constructed without any arguments.

@TomAugspurger
Copy link
Contributor

Seems reasonable. I'm not sure what the motivation was for not doing that in the first place. Possibly that we internally call construct_from_string() inside try / except TypeError blocks in a few places, which would potentially mask the issue? But I think our tests should be adequate.

@pep8speaks
Copy link

pep8speaks commented May 29, 2019

Hello @datapythonista! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-05 15:37:26 UTC

@datapythonista
Copy link
Member Author

Thanks for the feedback. I fixed the docstrings, improving the example to show how to overwrite the method, and remove the method of the category dtype, since now was the same as the parent class.

I think in a follow up PR we could make a generic construct_from_string method that extracts the arguments from a string like datetime64[ns, UTC] and returns an instance of the class passing the list of everything in the squared brackets as *args. This way we shouldn't have to overwrite construct_from_string for any dtype. I guess I'm not missing anything.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger IIRC we explicitly turned off the parsing of datetime w/tz dtypes (in the Dtype constructor).

@datapythonista thus I don't think this is an instructive example here (no objection to moving the integer usecase here though);

there is also a usage in intervals I think?

this would need a test for existence of this method as well

@@ -174,15 +174,26 @@ def construct_array_type(cls):
@classmethod
def construct_from_string(cls, string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add types

@@ -51,6 +51,36 @@ def test_pickle(self):
assert result == self.dtype


class TestBaseDtype:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such tests are already in the base extension tests (tests/extension/base/dtype), or can be expanded there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the feeling that those were only base tests for implementing tests for custom extensions, couldn't find a place where I thought it'd make sense to have the abstract class test. Where exactly would you have them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those were only base tests for implementing tests for custom extensions

Yes, but we have custom extensions ourselves, so we use those tests ourselves (and in addition we also have a few extra test EAs).

For example in tests/extension/decimal/array.py you can remove the custom implementation and let it inherit, and then it is already tested in that way.

where I thought it'd make sense to have the abstract class test

What do you mean with abstract class test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand what you want to test.

@jreback wanted a test for the existence of the method construct_from_string, which makes sense to me. Ideally, we'd already have a test, testing that the method raises an exception when called directly from ExtensionDtype (what I called the abstract class before). Then I'd just replace that by checking that now it's implemented, and it can construct an instance given the string of the name of the dtype.

I saw that in extension/base/dtype.py we already have a test for construct_from_string, but that's meant for custom extensions (ours or from a third-party) to test that their implementation of construct_from_string follows the expected behavior. The way of using that test is to create a subclass of BaseDtypeTests.

I don't think we want to test that the method is implemented in the ExtensionDtype (the base/abstract class) by checking a subclass of it. So, I see two options:

  • Create a subclass of BaseDtypeTests defining that the dtype to test is not a subclass but ExtensionDtype itself (it will fail for the abstract methods that still exist in `ExtensionDtype).
  • Create an independent test for ExtensionDtype and not their subclasses, that doesn't subclass BaseDtypeTests (this is what I've done here, not testing only construct_from_string but all the methods that I could tests, since there weren't tests for the the "default"/base implementations).

Am I missing something here? Sorry I don't understand exactly what you propose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see @jorisvandenbossche comments on where tests should go, we have a very general framework for these already.

assert isinstance(dtype_instance, self.dtype.__class__)
with pytest.raises(TypeError, match="Cannot construct a 'DummyDtype' "
"from 'another_type'"):
self.dtype.__class__.construct_from_string('another_type')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a useful test to add to the base extension tests, and the test_eq as well. The defaults tests are maybe harder though.

@@ -51,6 +51,36 @@ def test_pickle(self):
assert result == self.dtype


class TestBaseDtype:
def setup_method(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really don't want to use setup_method generally at all.

@@ -51,6 +51,36 @@ def test_pickle(self):
assert result == self.dtype


class TestBaseDtype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see @jorisvandenbossche comments on where tests should go, we have a very general framework for these already.

@datapythonista
Copy link
Member Author

@jorisvandenbossche @jreback I think the tests are in the location you meant now. Let me know if you have any further comment.

@jorisvandenbossche
Copy link
Member

Thanks! That's what I was thinking about. You are happy with that as well?

I think there are still some custom implementation of construct_from_string (that are now identical as the parent one) in the test arrays that can be removed (in decimal, json, arrow/bool)

@datapythonista
Copy link
Member Author

I think the tests now are testing something slightly different as before. They're testing that new extension dtypes satisfy the implementation of the base class. Before I was testing the base class itself. I think both are reasonable, I'm happy this way too, but I think it also made sense what I implemented before.

I deleted the two construct_from_string methods that were repeating the new base class behavior, there are no more to delete.

Thanks for the review, let me know if there is anything else that should be changed, I think it should be ready now.

@TomAugspurger TomAugspurger added this to the 0.25.0 milestone Jun 6, 2019
@jreback jreback merged commit 5a724b5 into pandas-dev:master Jun 6, 2019
@jreback
Copy link
Contributor

jreback commented Jun 6, 2019

thanks @datapythonista

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche
Copy link
Member

It seems a bit strange that this PR is the cause of that regression, as SparseDtype has its own implementation of construct_from_dtype.

@jorisvandenbossche
Copy link
Member

Quick profiling of the specific benchmark points out the culprit: in the dtype checking code if doing a try/except it formats the full series in the error message.

@TomAugspurger
Copy link
Contributor

Ah, thanks for digging into it!

It's unfortunate that such a minor change can have such broad consequences on performance. I assume this is from something like pandas_dtype(thing), where thing is an array-like rather than a dtype? If we were more careful about not passing data to pandas_dtype, we would have been OK?

else:
raise TypeError("cannot construct a CategoricalDtype")
except AttributeError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anybody know why we had this "except AttributeError: pass" here?

@jorisvandenbossche
Copy link
Member

We have had similar performance regressions before. In general we need to be careful with what we print in error messages, but this is of course easy to overlook, as we did in this case. But I would say: that's what we have benchmarks for, and they worked nicely to catch it here.

@jorisvandenbossche
Copy link
Member

I did #26776 with a possible fix.

I also don't think it is that an important fix, as there are some strange things going on in SparseDataFrame._init_spmatrix, where they pass a dict of {colname: SparseSeries} to Index.difference method (while it should pass something like dict.values() I think), and so it is the conversion of this dict to an Index that is taking a lot of time (in there it is checking if this object is categorical, leading to the error message, as dicts are not special cased there, and leading to printing a dict of 1000 serieses).
Given that this is all deprecated, I am not going to put effort in cleaning that up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants