Skip to content

ENH: set __module__ on top-level public objects #55178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 of 11 tasks
jorisvandenbossche opened this issue Sep 17, 2023 · 8 comments
Open
5 of 11 tasks

ENH: set __module__ on top-level public objects #55178

jorisvandenbossche opened this issue Sep 17, 2023 · 8 comments
Labels
Enhancement Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 17, 2023

Currently the repr of the DataFrame class (and any other class or method in the main namespace) shows the "full code path" of where the object is actually defined:

>>> import pandas as pd
>>> pd.DataFrame
<class 'pandas.core.frame.DataFrame'>

while we could also make it show the code path of how it is publicly exposed (and expected to be imported and used):

>>> pd.DataFrame
<class 'pandas.DataFrame'>

The above can be achieved by setting the __module__ attribute on the classes and methods. In numpy they already do this for several years, and so the repr of top-level functions or objects shows "numpy.<..>", and not things like "numpy.core.multiarray..". The main PR in numpy that implemented this: numpy/numpy#12382

I think the main benefits are:

  • Reduce the visual noise and hide implementation details for users (no regular user needs to know that DataFrame class is defined in pandas/core/frame.py)
  • Avoid that people tend to incorrectly import from where it is defined (i.e. discourage from pandas.core.frame import DataFrame, a pattern that we often see in downstream packages). I think this would also help for making pandas.core private (and potentially renaming it, xref DEPR: pandas/core #27522, cc @rhshadrach)

The main disadvantage is that we thus mask where an object lives, which makes it harder for contributors to figure that out. On the draft PR, @jbrockmendel also commented:

inspired by similar implementation in numpy

Whenever I try to figure out how something in numpy works I have a hard time finding out where something is defined because they use patterns like from foo import * at the top level. I don't know if the pattern in this PR contributes to that pain point, but my spidey sense is tingling that it might.

This does not change any * imports (it only changes the visual repr), but that aside, it certainly hides a bit more where something is defined, making it harder to find the location (the file) in the source code. But this masking is the purpose of the proposal, with the idea that this is better for users (see bullet points above). I certainly comes with the drawback for contributors, but in making the trade-off, there are much more users, so I would personally go with prioritizing that use case (and for contributors, there are still many other ways to find where something is defined: looking at our imports in the codebase, searching for "class DataFrame", ...).


Overview of objects:

@rhshadrach
Copy link
Member

  • Avoid that people tend to incorrectly import from where it is defined (i.e. discourage from pandas.core.frame import DataFrame, a pattern that we often see in downstream packages)

Wow: https://github.com/search?q=%22from+pandas.core.frame+import+DataFrame%22&type=code

I agree with the trade off proposed here. I think not exposing the internal locations in such a visible way to users is more valuable than aiding developers in navigating the pandas codebase in this fashion.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 20, 2023

FWIW, for numpy, if you use VS Code, and do a "Go to Definition" on any function that is defined in python (not ones that are C-python), it does go to the correct file. So contributors could figure out the locations of code that way. I tested this with np.linspace(). So I imagine the same thing would happen for pandas if we made this change.

@simonjayhawkins
Copy link
Member

@jorisvandenbossche can you clarify the benefits for the top level functions. I'm able to explain what the user will see differently when we set the module for the objects, but not clear when it comes to the functions.

@simonjayhawkins
Copy link
Member

eg. for a function

>>> pd.isna.__module__
'pandas.core.dtypes.missing'
>>> repr(pd.isna)
'<function isna at 0x7efcf5e3cee0>'
>>> 

after setting __module__

>>> pd.isna.__module__
'pandas'
>>> repr(pd.isna)
'<function isna at 0x7ff349aaa7a0>'
>>> 

@jorisvandenbossche
Copy link
Member Author

That's a good question. Based on what you show above, it might be good to just focus on the classes.

However, it might depend on the console you are using. Or how you are looking at it. It's strange that for isna, the result of repr(..) is different from the actual repr (but so maybe that is specific to IPython):

In [23]: repr(pd.isna)
Out[23]: '<function isna at 0x7f462079be20>'

In [24]: pd.isna
Out[24]: <function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>

So here the actual repr this shows it as pandas.core.dtypes.missing.isna.

Now, it's certainly less important for functions (and also less common to look at them that way), but I think there is still some value is hiding the full path in the output above.

@simonjayhawkins
Copy link
Member

Now, it's certainly less important for functions (and also less common to look at them that way), but I think there is still some value is hiding the full path in the output above.

No problem. some functions have been done too.

@rhshadrach
Copy link
Member

It's strange that for isna, the result of repr(..) is different from the actual repr (but so maybe that is specific to IPython):

This is IPython logic.

https://github.com/ipython/ipython/blob/0615526f80691452f2e282c363bce197c0141561/IPython/lib/pretty.py#L806

@simonjayhawkins
Copy link
Member

the discussion above lists the top level objects exposed in the pandas/core/api.py.

In addition, following @rhshadrach comment #60268 (comment), the __module__ attribute for the public(ish) objects that are exposed in /pandas/api/typing/__init__.py should also be included in this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

4 participants