Skip to content

ENH: set __module__ for objects in pandas pd.DataFrame API #55171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 23, 2024

Conversation

aimlnerd
Copy link
Contributor

@aimlnerd aimlnerd commented Sep 16, 2023

This PR add @set_module decorator around class pd.DataFrame

Before this PR

>>> import pandas as pd
>>> pd.DataFrame
<class 'pandas.core.frame.DataFrame'>

After this PR

>>> import pandas as pd
>>> pd.DataFrame
<class 'pandas.DataFrame'>

This change would discourage users from incorrectly using

from pandas.core.frame import DataFrame

Inspired by the similar implementation in numpy:
numpy/numpy#12382

@jbrockmendel
Copy link
Member

inspired by similar implementation in numpy

Whenever I try to figure out how something in numpy works I have a hard time finding out where something is defined because they use patterns like from foo import * at the top level. I don't know if the pattern in this PR contributes to that pain point, but my spidey sense is tingling that it might.

@jorisvandenbossche jorisvandenbossche added the Sprints Sprint Pull Requests label Sep 16, 2023
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 17, 2023

inspired by similar implementation in numpy

Whenever I try to figure out how something in numpy works I have a hard time finding out where something is defined because they use patterns like from foo import * at the top level. I don't know if the pattern in this PR contributes to that pain point, but my spidey sense is tingling that it might.

@jbrockmendel this has nothing to do with a * import though, so I don't fully understand your comment.
It is true that it does mask where the object actually lives, so it makes it harder to find the location (the file) in the source code. But this masking is actually the purpose, with the idea that users don't need to see this (with the drawback for contributors, but in making the trade-off, there are much more users, so I would prefer prioritizing that use case).

EDIT: opened #55178 to keep the general discussion there

@jorisvandenbossche
Copy link
Member

@deepak-george there are a bunch of tests where it seems we rely on this exact "pandas.core.frame" string that will have to be updated as well.

First, there are some doctest failures (i.e. docstrings that need to be updated):




=================================== FAILURES ===================================
______________ [doctest] pandas.core.indexing.IndexingMixin.iloc _______________
207         c    3
208         d    4
209         Name: 0, dtype: int64
210 
211         With a list of integers.
212 
213         >>> df.iloc[[0]]
214            a  b  c  d
215         0  1  2  3  4
216         >>> type(df.iloc[[0]])
Expected:
    <class 'pandas.core.frame.DataFrame'>
Got:
    <class 'pandas.DataFrame'>

/home/runner/micromamba/envs/test/lib/python3.10/site-packages/pandas/core/indexing.py:216: DocTestFailure
_________________ [doctest] pandas.core.indexing._iLocIndexer __________________
1583 c    3
1584 d    4
1585 Name: 0, dtype: int64
1586 
1587 With a list of integers.
1588 
1589 >>> df.iloc[[0]]
1590    a  b  c  d
1591 0  1  2  3  4
1592 >>> type(df.iloc[[0]])
Expected:
    <class 'pandas.core.frame.DataFrame'>
Got:
    <class 'pandas.DataFrame'>

/home/runner/micromamba/envs/test/lib/python3.10/site-packages/pandas/core/indexing.py:1592: DocTestFailure

And then there are also normal test failures like:

____________________________ test_info_int_columns _____________________________
[gw2] darwin -- Python 3.9.18 /Users/runner/micromamba/envs/test/bin/python3.9

    @pytest.mark.xfail(not IS64, reason="GH 36579: fail on 32-bit system")
    def test_info_int_columns():
        # GH#37245
        df = DataFrame({1: [1, 2], 2: [2, 3]}, index=["A", "B"])
        buf = StringIO()
        df.info(show_counts=True, buf=buf)
        result = buf.getvalue()
        expected = textwrap.dedent(
            """\
            <class 'pandas.core.frame.DataFrame'>
            Index: 2 entries, A to B
            Data columns (total 2 columns):
             #   Column  Non-Null Count  Dtype
            ---  ------  --------------  -----
             0   1       2 non-null      int64
             1   2       2 non-null      int64
            dtypes: int64(2)
            memory usage: 48.0+ bytes
            """
        )
>       assert result == expected
E       AssertionError

pandas/tests/io/formats/test_info.py:501: AssertionError

and for all, see the failing CI logs.

It might be possible to fix most of them with a search replace of "pandas.core.frame.DataFrame" -> "pandas.DataFrame"

@jorisvandenbossche jorisvandenbossche added the Output-Formatting __repr__ of pandas objects, to_string label Sep 17, 2023
@aimlnerd aimlnerd requested a review from noatamir as a code owner September 18, 2023 13:53
@aimlnerd
Copy link
Contributor Author

aimlnerd commented Sep 19, 2023

@jorisvandenbossche Thanks for the feedback! I have made the changes like pandas.core.frame.DataFrame -> pandas.DataFrame & pylint fix

Now all the tests except "Unit Tests / Linux-32-bit (pull_request)" passed. I am looking into the logs and trying to fix but not sure yet what is going on.

@aimlnerd
Copy link
Contributor Author

aimlnerd commented Sep 19, 2023

@jorisvandenbossche Thanks for the feedback! I have made the changes like pandas.core.frame.DataFrame -> pandas.DataFrame & pylint fix

Now all the tests except "Unit Tests / Linux-32-bit (pull_request)" passed. I am looking into the logs and trying to fix but not sure yet what is going on.

@jorisvandenbossche
../../numpy/meson.build:207:4: ERROR: Problem encountered: No BLAS library detected! Install one, or use the allow-noblas build option (note, this may be up to 100x slower for some linear algebra operations).

I believe BLAS is already installed, so not sure if this should be fixed in this PR.

@jorisvandenbossche
Copy link
Member

@deepak-george you can ignore that error, that looks unrelated (and probably if you merge with the latest main branch, that might be resolved)

@aimlnerd
Copy link
Contributor Author

aimlnerd commented Oct 3, 2023

@deepak-george you can ignore that error, that looks unrelated (and probably if you merge with the latest main branch, that might be resolved)

Ok. Now all the tests including the new test added to check this functionality have passed. Is there something else that need to be done before you could merge? Please let me know the next step.

…__module__

� Conflicts:
�	doc/source/user_guide/enhancingperf.rst
�	doc/source/user_guide/io.rst
�	doc/source/whatsnew/v0.25.0.rst
@aimlnerd
Copy link
Contributor Author

@jorisvandenbossche What is the next step for this PR? Could this be merged?
I have pulled the recent changes from main and all checks have passed as well.

@jorisvandenbossche jorisvandenbossche added this to the 3.0 milestone Dec 14, 2023
@jorisvandenbossche
Copy link
Member

@aimlnerd sorry for the slow follow-up here. But we might want to only merge this after pandas 2.2 is cut, so we merge this in the main branch for 3.0. Reason for this: in theory it could be a breaking change if someone relies on the __module__ attribute, so for safety it might be better to only do this in the major release.

It is the plan to branch 2.2.x in one week or two, so we can revisit this shortly!

@aimlnerd
Copy link
Contributor Author

@jorisvandenbossche Thanks for the context! Please let me know when you want to revisit, happy to make changes and complete it!

@jorisvandenbossche
Copy link
Member

Sorry for the slow follow-up, but in the meantime main is targetting 3.0 for a while, so we can merge this now. @aimlnerd thanks again!

@jorisvandenbossche jorisvandenbossche merged commit e51039a into pandas-dev:main Mar 23, 2024
46 checks passed
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
@@ -503,3 +503,24 @@ def indent(text: str | None, indents: int = 1) -> str:
"future_version_msg",
"Substitution",
]


def set_module(module):
Copy link
Member

@rhshadrach rhshadrach Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just came across this today, I do not see the benefit of having this decorator. We either have:

class DataFrame:
    __module__ = "pandas"

or

@set_module("pandas")
class DataFrame:
    ...

There is a cost, particularly for reading code you have to chase down the definition. I'm also seeing 6.5% longer runtime to instantiate an empty class with this decorator, which is a small impact on import time for pandas.

These are certainly small costs, so if there is a gain to be had then great. But I'm just not seeing what the gain is.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't think I had thought about that option for classes (setting it afterwards like DataFrame.__module__ = "pandas" would move it further away, but is also an option), but we just copied the approach used in numpy.

If it has overhead, it's certainly another reason to do it the other way (and agree that too many decorators make the code harder too read)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that as the decorator is untyped, we also loose type definitions when applied to functions but I can add this to the todo list (along with a whats new) in the issue itself.

And also investigate the testing so that no public api classes/functions get missed.

Rather than holding up the merging of the open PRs, this discussion should probably be moved to the open issue also.

I'll go ahead and get the open ones merged so that I can update the issue with what's left to close out the issue.

The task was more about checking whether any other changes were needed and it was only the Series one that needed other changes. So if we want to not use the decorator this could be an easy followup rather than a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Sprints Sprint Pull Requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants