ENH: set module for objects in pandas pd.DataFrame API #55171

aimlnerd · 2023-09-16T15:46:31Z

part of ENH: set __module__ on top-level public objects #55178
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR add @set_module decorator around class pd.DataFrame

Before this PR

>>> import pandas as pd
>>> pd.DataFrame
<class 'pandas.core.frame.DataFrame'>

After this PR

>>> import pandas as pd
>>> pd.DataFrame
<class 'pandas.DataFrame'>

This change would discourage users from incorrectly using

from pandas.core.frame import DataFrame

Inspired by the similar implementation in numpy:
numpy/numpy#12382

jbrockmendel · 2023-09-16T15:50:09Z

inspired by similar implementation in numpy

Whenever I try to figure out how something in numpy works I have a hard time finding out where something is defined because they use patterns like from foo import * at the top level. I don't know if the pattern in this PR contributes to that pain point, but my spidey sense is tingling that it might.

jorisvandenbossche · 2023-09-17T11:36:41Z

inspired by similar implementation in numpy

Whenever I try to figure out how something in numpy works I have a hard time finding out where something is defined because they use patterns like from foo import * at the top level. I don't know if the pattern in this PR contributes to that pain point, but my spidey sense is tingling that it might.

@jbrockmendel this has nothing to do with a * import though, so I don't fully understand your comment.
It is true that it does mask where the object actually lives, so it makes it harder to find the location (the file) in the source code. But this masking is actually the purpose, with the idea that users don't need to see this (with the drawback for contributors, but in making the trade-off, there are much more users, so I would prefer prioritizing that use case).

EDIT: opened #55178 to keep the general discussion there

jorisvandenbossche · 2023-09-17T11:39:38Z

@deepak-george there are a bunch of tests where it seems we rely on this exact "pandas.core.frame" string that will have to be updated as well.

First, there are some doctest failures (i.e. docstrings that need to be updated):




=================================== FAILURES ===================================
______________ [doctest] pandas.core.indexing.IndexingMixin.iloc _______________
207         c    3
208         d    4
209         Name: 0, dtype: int64
210 
211         With a list of integers.
212 
213         >>> df.iloc[[0]]
214            a  b  c  d
215         0  1  2  3  4
216         >>> type(df.iloc[[0]])
Expected:
    <class 'pandas.core.frame.DataFrame'>
Got:
    <class 'pandas.DataFrame'>

/home/runner/micromamba/envs/test/lib/python3.10/site-packages/pandas/core/indexing.py:216: DocTestFailure
_________________ [doctest] pandas.core.indexing._iLocIndexer __________________
1583 c    3
1584 d    4
1585 Name: 0, dtype: int64
1586 
1587 With a list of integers.
1588 
1589 >>> df.iloc[[0]]
1590    a  b  c  d
1591 0  1  2  3  4
1592 >>> type(df.iloc[[0]])
Expected:
    <class 'pandas.core.frame.DataFrame'>
Got:
    <class 'pandas.DataFrame'>

/home/runner/micromamba/envs/test/lib/python3.10/site-packages/pandas/core/indexing.py:1592: DocTestFailure

And then there are also normal test failures like:

____________________________ test_info_int_columns _____________________________
[gw2] darwin -- Python 3.9.18 /Users/runner/micromamba/envs/test/bin/python3.9

    @pytest.mark.xfail(not IS64, reason="GH 36579: fail on 32-bit system")
    def test_info_int_columns():
        # GH#37245
        df = DataFrame({1: [1, 2], 2: [2, 3]}, index=["A", "B"])
        buf = StringIO()
        df.info(show_counts=True, buf=buf)
        result = buf.getvalue()
        expected = textwrap.dedent(
            """\
            <class 'pandas.core.frame.DataFrame'>
            Index: 2 entries, A to B
            Data columns (total 2 columns):
             #   Column  Non-Null Count  Dtype
            ---  ------  --------------  -----
             0   1       2 non-null      int64
             1   2       2 non-null      int64
            dtypes: int64(2)
            memory usage: 48.0+ bytes
            """
        )
>       assert result == expected
E       AssertionError

pandas/tests/io/formats/test_info.py:501: AssertionError

and for all, see the failing CI logs.

It might be possible to fix most of them with a search replace of "pandas.core.frame.DataFrame" -> "pandas.DataFrame"

aimlnerd · 2023-09-19T07:41:13Z

@jorisvandenbossche Thanks for the feedback! I have made the changes like pandas.core.frame.DataFrame -> pandas.DataFrame & pylint fix

Now all the tests except "Unit Tests / Linux-32-bit (pull_request)" passed. I am looking into the logs and trying to fix but not sure yet what is going on.

aimlnerd · 2023-09-19T08:05:09Z

@jorisvandenbossche Thanks for the feedback! I have made the changes like pandas.core.frame.DataFrame -> pandas.DataFrame & pylint fix

Now all the tests except "Unit Tests / Linux-32-bit (pull_request)" passed. I am looking into the logs and trying to fix but not sure yet what is going on.

@jorisvandenbossche
../../numpy/meson.build:207:4: ERROR: Problem encountered: No BLAS library detected! Install one, or use the allow-noblas build option (note, this may be up to 100x slower for some linear algebra operations).

I believe BLAS is already installed, so not sure if this should be fixed in this PR.

jorisvandenbossche · 2023-10-03T10:12:05Z

@deepak-george you can ignore that error, that looks unrelated (and probably if you merge with the latest main branch, that might be resolved)

aimlnerd · 2023-10-03T11:37:55Z

@deepak-george you can ignore that error, that looks unrelated (and probably if you merge with the latest main branch, that might be resolved)

Ok. Now all the tests including the new test added to check this functionality have passed. Is there something else that need to be done before you could merge? Please let me know the next step.

…__module__ � Conflicts: � doc/source/user_guide/enhancingperf.rst � doc/source/user_guide/io.rst � doc/source/whatsnew/v0.25.0.rst

aimlnerd · 2023-10-27T08:51:58Z

@jorisvandenbossche What is the next step for this PR? Could this be merged?
I have pulled the recent changes from main and all checks have passed as well.

jorisvandenbossche · 2023-12-14T09:50:12Z

@aimlnerd sorry for the slow follow-up here. But we might want to only merge this after pandas 2.2 is cut, so we merge this in the main branch for 3.0. Reason for this: in theory it could be a breaking change if someone relies on the __module__ attribute, so for safety it might be better to only do this in the major release.

It is the plan to branch 2.2.x in one week or two, so we can revisit this shortly!

aimlnerd · 2023-12-14T10:10:08Z

@jorisvandenbossche Thanks for the context! Please let me know when you want to revisit, happy to make changes and complete it!

jorisvandenbossche · 2024-03-23T12:24:45Z

Sorry for the slow follow-up, but in the meantime main is targetting 3.0 for a while, so we can merge this now. @aimlnerd thanks again!

…v#55171) Co-authored-by: Joris Van den Bossche <[email protected]>

rhshadrach · 2024-11-09T17:37:32Z

pandas/util/_decorators.py

@@ -503,3 +503,24 @@ def indent(text: str | None, indents: int = 1) -> str:
    "future_version_msg",
    "Substitution",
 ]
+
+
+def set_module(module):


Just came across this today, I do not see the benefit of having this decorator. We either have:

class DataFrame: __module__ = "pandas"

or

@set_module("pandas") class DataFrame: ...

There is a cost, particularly for reading code you have to chase down the definition. I'm also seeing 6.5% longer runtime to instantiate an empty class with this decorator, which is a small impact on import time for pandas.

These are certainly small costs, so if there is a gain to be had then great. But I'm just not seeing what the gain is.

To be honest, I don't think I had thought about that option for classes (setting it afterwards like DataFrame.__module__ = "pandas" would move it further away, but is also an option), but we just copied the approach used in numpy.

If it has overhead, it's certainly another reason to do it the other way (and agree that too many decorators make the code harder too read)

I assume that as the decorator is untyped, we also loose type definitions when applied to functions but I can add this to the todo list (along with a whats new) in the issue itself.

And also investigate the testing so that no public api classes/functions get missed.

Rather than holding up the merging of the open PRs, this discussion should probably be moved to the open issue also.

I'll go ahead and get the open ones merged so that I can update the issue with what's left to close out the issue.

The task was more about checking whether any other changes were needed and it was only the Series one that needed other changes. So if we want to not use the decorator this could be an easy followup rather than a blocker.

aimlnerd added 3 commits September 16, 2023 13:46

fix pandas.DataFrame.agg.html documents

156b89f

add set __module__ for pd.DataFrame

828b9c4

revert the documentaion change made

69689de

aimlnerd added 2 commits September 16, 2023 17:52

fix documentation change

43b6429

remove doc change

77036a5

jorisvandenbossche added the Sprints Sprint Pull Requests label Sep 16, 2023

jorisvandenbossche added the Output-Formatting __repr__ of pandas objects, to_string label Sep 17, 2023

pandas.core.frame.DataFrame -> pandas.DataFrame

a4ace57

aimlnerd requested a review from noatamir as a code owner September 18, 2023 13:53

add new line for test

c7ea1d3

Merge branch 'main' of https://github.com/pandas-dev/pandas into set-…

d19eccb

…__module__ � Conflicts: � doc/source/user_guide/enhancingperf.rst � doc/source/user_guide/io.rst � doc/source/whatsnew/v0.25.0.rst

jorisvandenbossche added this to the 3.0 milestone Dec 14, 2023

jorisvandenbossche added 2 commits March 23, 2024 09:47

Merge remote-tracking branch 'upstream/main' into set-__module__

86b7f3e

lint

d337a71

jorisvandenbossche approved these changes Mar 23, 2024

View reviewed changes

jorisvandenbossche merged commit e51039a into pandas-dev:main Mar 23, 2024
46 checks passed

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024

ENH: set __module__ for objects in pandas pd.DataFrame API (pandas-de…

22cc611

…v#55171) Co-authored-by: Joris Van den Bossche <[email protected]>

mutricyl mentioned this pull request Jul 12, 2024

follow up from #237 & #239 hgrecco/pint-pandas#240

Closed

jorisvandenbossche mentioned this pull request Nov 8, 2024

ENH: set __module__ on top-level public objects #55178

Open

11 tasks

rhshadrach reviewed Nov 9, 2024

View reviewed changes

amansharma612 mentioned this pull request Nov 16, 2024

ENH: set __module__ on concat function in concat.py #60334

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: set module for objects in pandas pd.DataFrame API #55171

ENH: set module for objects in pandas pd.DataFrame API #55171

aimlnerd commented Sep 16, 2023 •

edited

Loading

jbrockmendel commented Sep 16, 2023

jorisvandenbossche commented Sep 17, 2023 •

edited

Loading

jorisvandenbossche commented Sep 17, 2023

aimlnerd commented Sep 19, 2023 •

edited

Loading

aimlnerd commented Sep 19, 2023 •

edited

Loading

jorisvandenbossche commented Oct 3, 2023

aimlnerd commented Oct 3, 2023

aimlnerd commented Oct 27, 2023

jorisvandenbossche commented Dec 14, 2023

aimlnerd commented Dec 14, 2023

jorisvandenbossche commented Mar 23, 2024

rhshadrach Nov 9, 2024 •

edited

Loading

jorisvandenbossche Nov 9, 2024 •

edited

Loading

simonjayhawkins Nov 11, 2024

ENH: set __module__ for objects in pandas pd.DataFrame API #55171

ENH: set __module__ for objects in pandas pd.DataFrame API #55171

Conversation

aimlnerd commented Sep 16, 2023 • edited Loading

jbrockmendel commented Sep 16, 2023

jorisvandenbossche commented Sep 17, 2023 • edited Loading

jorisvandenbossche commented Sep 17, 2023

aimlnerd commented Sep 19, 2023 • edited Loading

aimlnerd commented Sep 19, 2023 • edited Loading

jorisvandenbossche commented Oct 3, 2023

aimlnerd commented Oct 3, 2023

aimlnerd commented Oct 27, 2023

jorisvandenbossche commented Dec 14, 2023

aimlnerd commented Dec 14, 2023

jorisvandenbossche commented Mar 23, 2024

rhshadrach Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

simonjayhawkins Nov 11, 2024

Choose a reason for hiding this comment

ENH: set module for objects in pandas pd.DataFrame API #55171

ENH: set module for objects in pandas pd.DataFrame API #55171

aimlnerd commented Sep 16, 2023 •

edited

Loading

jorisvandenbossche commented Sep 17, 2023 •

edited

Loading

aimlnerd commented Sep 19, 2023 •

edited

Loading

aimlnerd commented Sep 19, 2023 •

edited

Loading

rhshadrach Nov 9, 2024 •

edited

Loading

jorisvandenbossche Nov 9, 2024 •

edited

Loading