Skip to content

Bugfix for constructing empty Series, with index, using ExtensionDtyp… #44615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

Bugfix for constructing empty Series, with index, using ExtensionDtyp… #44615

wants to merge 12 commits into from

Conversation

venaturum
Copy link
Contributor

@venaturum venaturum commented Nov 25, 2021

…e with na_value of None

@pep8speaks
Copy link

pep8speaks commented Nov 25, 2021

Hello @venaturum! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-03-01 00:18:02 UTC

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add tests and whatsnew

@venaturum
Copy link
Contributor Author

@phofl a test for this would require defining an ExtensionDtype and ExtensionArray which is quite a lot of code to be adding. Are you aware of any existing "dummy" classes in the test suites that could be used?

@phofl
Copy link
Member

phofl commented Nov 25, 2021

Just use the example from the issue?

@venaturum
Copy link
Contributor Author

venaturum commented Nov 25, 2021

Although the example works for demonstrating the recursion error it doesn't end up giving the expected result due to implementations in the IntervalArray class.

I'll have a play and see how minimal I can make a test.

@venaturum
Copy link
Contributor Author

@phofl

This is as small as I can make it

import pandas as pd
from pandas.testing import assert_series_equal
import numbers
from collections.abc import Iterable
from pandas.api.extensions import (
    ExtensionArray,
    ExtensionDtype,
    register_extension_dtype,
)

def test_empty_series_construction_with_na_value_None():

    class DummyClass:

        def __init__(self):
            pass

    @register_extension_dtype
    class DummyDtype(ExtensionDtype):

        type = DummyClass
        name = "dummy"
        na_value = None

        @classmethod
        def construct_from_string(cls, string):
            if string == cls.name:
                return cls()
            else:
                raise TypeError("Cannot construct a '{}' from " "'{}'".format(cls, string))

        @classmethod
        def construct_array_type(cls):
            return DummyArray


    class DummyArray(ExtensionArray):

        _dtype = DummyDtype()

        def __init__(self, values):
            self._values = values

        @property
        def dtype(self):
            return self._dtype

        def __len__(self):
            return len(self._values)

        def __getitem__(self, idx):
            if isinstance(idx, numbers.Integral):
                return self._values[idx]
            elif isinstance(idx, (Iterable, slice)):
                return DummyArray(self._values[idx])
            else:
                raise TypeError("Index type not supported", idx)

        @classmethod
        def _from_sequence(cls, scalars, dtype=None, copy=False):
            if isinstance(scalars, DummyClass):
                scalars = [scalars]
            return DummyArray(scalars)

    result = pd.Series(index=[1, 2, 3], dtype="dummy")
    expected = pd.Series([None] * 3, index=[1, 2, 3], dtype="dummy")
    assert_series_equal(result, expected)

I would need some guidance on where it in the test suite it belongs.

@phofl
Copy link
Member

phofl commented Nov 25, 2021

Int64Dtype works for example

@venaturum
Copy link
Contributor Author

@phofl

Unfortunately it won't pass the test, for the same reasons as IntervalArray

import pandas as pd
from pandas import Int64Dtype

Int64Dtype.na_value = None
dtype = Int64Dtype()

pd.Series(index=[1,2,3], dtype=dtype)

This results in

1    <NA>
2    <NA>
3    <NA>
dtype: Int64

when the desired result is

1    None
2    None
3    None
dtype: Int64

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the right place for this. tests of course are the first thing to do here.

@venaturum
Copy link
Contributor Author

@jreback would it be better handled within the following function?

def create_series_with_explicit_dtype(

@jbrockmendel
Copy link
Member

Unfortunately it won't pass the test, for the same reasons as IntervalArray [...]

Yah it'd be pretty convenient if subclasses could change Int64Dtype.na_value to, say, np.nan, which would be way less fragile.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2021

closing as this needs tests and new implementation. ping when ready / open a new PR

@jreback jreback closed this Dec 5, 2021
@venaturum
Copy link
Contributor Author

@jreback

Hey Jeff, the test here does the job, I just wish I could make it slimmer but I think that would require a different approach

As for a new implementation I'm happy to try something else but need guidance on where in the code it belongs.

Cheers.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2021

@venaturum not averse to a test like that

but should be included as a first step

@jreback jreback reopened this Dec 5, 2021
@venaturum
Copy link
Contributor Author

@venaturum not averse to a test like that

but should be included as a first step

Would the test be best placed in pandas\tests\extension\test_common.py ? ?

@jbrockmendel
Copy link
Member

Would the test be best placed in pandas\tests\extension\test_common.py ? ?

try tests.extension.base.constructors

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@venaturum pls add a whatsnew note in 1.4 EA section

cc @jbrockmendel @phofl thoughts here

@jreback jreback added Bug ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Dec 14, 2021
@venaturum
Copy link
Contributor Author

Note there is a test here

def test_series_constructor_no_data_with_index(self, dtype, na_value):
which tests what we need it to, it just needs to be used in conjunction with a Dtype with None na_value. Perhaps all the tests in this suite should be tested with an appropriate Dtype?

return self.data

@property
def dtype(self):
return DummyDtype()
return DummyArray._dtype()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the inderection of _dtype? what is wrong with what was here before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because I'm subclassing DummyDtype and DummyArray in order to test several possible na_values

https://github.com/venaturum/pandas/blob/69117c1ebd1621a69164129169232c83a7439b19/pandas/tests/extension/test_common.py#L138

@pytest.mark.parametrize("na_value", [np.nan, pd.NA, None])
def test_empty_series_construction(na_value):
    class TempDType(DummyDtype):
        @classmethod
        def construct_array_type(cls):
            return TempArray

    TempDType.na_value = na_value

    class TempArray(DummyArray):
        _dtype = TempDType

    result = pd.Series(index=[1, 2, 3], dtype=TempDType())
    expected = pd.Series([na_value] * 3, index=[1, 2, 3], dtype=TempDType())
    tm.assert_series_equal(result, expected)

@venaturum venaturum marked this pull request as draft January 28, 2022 02:04
@venaturum
Copy link
Contributor Author

@jreback @jbrockmendel

A different approach to the tests can be found here

https://github.com/venaturum/pandas/blob/GH44602_dtype_na_value/pandas/tests/extension/test_xx.py

What is different:

  • existing test code is left alone
  • the BaseConstructorsTests suite is ran over the custom ExtensionArray, which includes a test for #GH44602
  • only na_value=None is being tested as opposed to alternatives such as pd.NA or np.nan

If this is a better approach I will abandon this PR and start again, or if there is some better way of integrating this approach into the existing tests then please let me know. Alternatively, I'm happy for someone else to take this and run with it.

scalars = [scalars]
return DummyArray(scalars)

def take(self, indices, allow_fill=False, fill_value=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need all of the new methods this implements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jbrockmendel , apologies for the slow reply. These methods, and those in the alternative solution are the minimum required for the tests to pass.

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, ping the current reviewers, and we can reopen.

@mroeschke mroeschke closed this Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: na_value = None in ExtensionDtype causes ExtensionArray tests to fail
6 participants