Skip to content

gh-372 : Fixing Series.astype() #519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Feb 25, 2023
Merged

gh-372 : Fixing Series.astype() #519

merged 34 commits into from
Feb 25, 2023

Conversation

ramvikrams
Copy link
Contributor

@ramvikrams ramvikrams commented Jan 25, 2023

I have created this draft PR and I'll keep commiting to it, so you can also check whether the changes are acceptable or not.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jan 25, 2023

You need to build a table first before doing a PR. Test locally as well before pushing. You have syntax errors.

@ramvikrams
Copy link
Contributor Author

You need to build a table first before doing a PR. Test locally as well before pushing. You have syntax errors.

Work is still under progress I'll keep commiting I created this draft PR so that you could be updated with the progress

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jan 25, 2023

Work is still under progress I'll keep commiting I created this draft PR so that you could be updated with the progress

OK, but I think you should build the table first before pushing any more code, so I can review the table. Once we have the table looking right, then we can move forward with code.

Also, you should test locally before pushing to avoid CI failures. E.g., poetry run poe test will run all the tests and will give you much faster feedback.

@ramvikrams
Copy link
Contributor Author

Work is still under progress I'll keep commiting I created this draft PR so that you could be updated with the progress

OK, but I think you should build the table first before pushing any more code, so I can review the table. Once we have the table looking right, then we can move forward with code.

Also, you should test locally before pushing to avoid CI failures. E.g., poetry run poe test will run all the tests and will give you much faster feedback.

I'll post a picture of the table in a 10 mins

@ramvikrams
Copy link
Contributor Author

image

Can you please check this table

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jan 25, 2023

image

Can you please check this table

Can you post the table as a Markdown table, so then it becomes easier to collaborate via my edits?

You need to add numpy types (np.int, np.float, etc.). You need to add the string representations, e.g., "int", "float", in addition to just int and float

I things like datetime.date and datetime.datetime will map to np.datetime64 . You have to experiment (and write associated tests).

I think it is the case that we need to remove some of the values from S1. for example, I think the datetime ones can be removed:

>>> import pandas as pd
>>> import datetime as dt
>>> a=[dt.datetime(2023, 1, 25), dt.datetime(2023, 1, 26)]
>>> a
[datetime.datetime(2023, 1, 25, 0, 0), datetime.datetime(2023, 1, 26, 0, 0)]
>>> s=pd.Series(a)
>>> s
0   2023-01-25
1   2023-01-26
dtype: datetime64[ns]
>>> s.dtype
dtype('<M8[ns]')
>>> type(s.dtype)
<class 'numpy.dtype[datetime64]'>

So in the above, I created a Series, containing python datetime, but pandas converted the dtype to numpy.datetime64, which probably should be added to S1.

Getting this right is not an easy task and you have to experiment to make sure that the types in S1 are the correct set, and that the arguments for astype() (and pd.Series.__new__() for that matter) are all coordinated.

@ramvikrams
Copy link
Contributor Author

Yeah i'll create a markdown table and do the required changes i'll do these changes tommorow as it's night here

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure there are entries in the left column for each of the types currently in S1

Once we get this table right, then I think we should split this into 2 PR's. One would remove elements from S1 in _typing.pyi . Once that is done, then we can work on astype().

@ramvikrams
Copy link
Contributor Author

I have updated the table can you please check it
we can delete the following from S1 :

datetime.date,
datetime.datetime,
datetime.time,
datetime.timedelta,
Timestamp,
Timedelta,
Period,
Interval[int],
Interval[float],
Interval[Timestamp],
Interval[Timedelta],

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jan 26, 2023

I have updated the table can you please check it we can delete the following from S1 :

datetime.date,
datetime.datetime,
datetime.time,
datetime.timedelta,

The above ones can be deleted because they get converted to numpy types.

Timestamp,
Timedelta,
Period,
Interval[int],
Interval[float],
Interval[Timestamp],
Interval[Timedelta],

We can't delete those because of how the typing works. But we can delete many of the references to Series[Timestamp], Series[Timedelta] and Series[Period]. throughout the stubs. See #520 that I'd like you do first.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Feb 7, 2023

I think the table is right

@ramvikrams
Copy link
Contributor Author

I think the table is right

ok then i'll start with it

@ramvikrams
Copy link
Contributor Author

Will we have to remove args like "int" which are in quotes because Flake8 says Quoted annotations should never be used in stubs

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any of the tests such as:

check(assert_type(s.astype("bool"), "Series[bool]"), pd.Series, np.bool_)

you can remove the need to import Series, by changing it to this form:

check(assert_type(s.astype("bool"), "pd.Series[bool]"), pd.Series, np.bool_)

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer, and thanks for all the hard work.

Seeing the issue that DataFrame.astype() has no type for the dtype argument, we need to do a bit more to support that as well. So, can you do the following:

  1. Take all of the type combinations that you have in series.pyi and make them into new types in _typing.pyi using the following pattern:
BooleanDtypeArg: TypeAlias = type[bool] | type[np.bool_] | pd.BooleanDtype | Literal["bool"]
IntDtypeArg: TypeAlias = type[_str] | pd.StringDtype | Literal["str"]
#  etc for each set of arguments 
  1. Then modify series.pyi to import those aliases and use them in series.pyi, which should shorten up that piece of code.
  2. In _typing.pyi, change AsTypeArg to be the union of all of the types created in step 1 PLUS ExtensionDtype
  3. Create a new overload that should be last in series.pyi for dtype: ExtensionDtype that returns Series . Note - you may need to add some # type: ignore[misc] in different places because I'm pretty sure mypy will complain about this.
  4. To test the argument of dtype: ExtensionDtype, do the following:
from decimal import Decimal
from pandas.tests.extension.decimal import DecimalDtype

check(assert_type(pd.Series([Decimal(x) for x in [1,2,3]]).astype(DecimalDtype()), pd.Series), Decimal)

NOTE: If for some reason either type checker complains about importing DecimalDtype (because it will type check the test code), then you can skip this test for now, but just leave it in commented out with a note that the type checkers aren't liking that import.

  1. Add tests in tests/test_frame.py that test DataFrame.astype(). Only need to test for a few of the arguments, but should also check using the dict argument as well.

@ramvikrams
Copy link
Contributor Author

This is getting closer, and thanks for all the hard work.

thanks sir, i'll do the changes and get back

@ramvikrams
Copy link
Contributor Author

ramvikrams commented Feb 24, 2023

  1. In _typing.pyi, change AsTypeArg to be the union of all of the types created in step 1 PLUS ExtensionDtype

Actually I get this when creating a union
Use PEP 604 union types instead of typing.Union (e.g. "int | str" instead of "Union[int, str]").

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Feb 24, 2023

  1. In _typing.pyi, change AsTypeArg to be the union of all of the types created in step 1 PLUS ExtensionDtype

Actually I get this when creating a union Use PEP 604 union types instead of typing.Union (e.g. "int | str" instead of "Union[int, str]").

Follow the pattern in _typing.pyi and use | to create unions.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to move all the XXXDtypeArg declarations to right above AstypeArg, in addition to the other comments below

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Feb 25, 2023

It appears the CI is failing because we test without pandas being installed.

So go back to commenting out the tests for decimal (and the import), and I will investigate this later.

@ramvikrams
Copy link
Contributor Author

It appears the CI is failing because we test without pandas being installed.

So go back to commenting out the tests for decimal (and the import), and I will investigate this later.

Done sir

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some unresolved comments

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to move all the XXXDtypeArg declarations from the bottom of _typing.pyi to right above AstypeArg, in addition to the other comments below

@@ -1441,8 +1441,7 @@ class DataFrame(NDFrame, OpsMixin):
) -> DataFrame: ...
def astype(
self,
dtype: AstypeArg | Mapping[Hashable, AstypeArg] | Series,
# dtype: AstypeArg | dict[Any, AstypeArg],
dtype: AstypeArg | Mapping[Hashable, AstypeArg] | Series | Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any is unacceptable here. Please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with all the changes sir

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ramvikrams . This was not an easy one to get right, and I anticipate that we may get some feedback when this is released.

@Dr-Irv Dr-Irv merged commit c6815aa into pandas-dev:main Feb 25, 2023
@ramvikrams
Copy link
Contributor Author

Thanks @ramvikrams . This was not an easy one to get right, and I anticipate that we may get some feedback when this is released.

yes if any further change is required i'll do it and thanks sir

@ramvikrams ramvikrams deleted the t90 branch February 26, 2023 08:52
twoertwein pushed a commit to twoertwein/pandas-stubs that referenced this pull request Apr 1, 2023
* adding overloads to astype

* Created the table for astype

* Update table.rst

* Updated the table and added numpy dtypes

* Update table.rst

* updated np.datetime64

* Update table.rst

* added types in Timedelta

* removed not required args in Dtype

* removed np.timedelta64 in Timedelta

* Removed timedelta64

* expanding series astype

* Added type in args in dtype

* corrected the args

* adding a overload for 'category'  and normal changes

* added tests

* removed unused args

* corrected tests

* Delete table.rst

* added the bool overload to top and done the required test  changes

* added type_checker

* added types for check and did requested changes

* updated the check types

* added astype in dataframe and other changes

* Update test_series.py

* Update test_series.py

* added dict test for astype in datatest_frame and tests for ExtensionDtype in test_series

* commented out the decimal tests

* Update test_series.py

* updated dtype args in astype

* added any to list of args for astype

* changed dtype args
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Series.astype() to return a Series[type] dependent on argument
2 participants