Skip to content

STY/TST: what standard to follow to import top-level objects in tests #33203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Apr 1, 2020 · 14 comments
Labels
Code Style Code style, linting, code_checks Needs Discussion Requires discussion from core team before further action Testing pandas testing functions or related to the test suite

Comments

@jorisvandenbossche
Copy link
Member

So this discussion regularly comes up (latest one being #33184), so let's try to decide which one of the two we prefer.

In test code, do we do:

from pandas import Series
...
Series(...)

or

import pandas as pd
...
pd.Series(..)

Personally, I prefer the second, and from recent discussions I have the impression @WillAyd and @datapythonista do as well (is that correct?)

To be clear: consistency within a single file is most important (so not mixing Series and pd.Series in a single file).
But when someone does a PR to fix such inconsistencies in a file, it would be good if there is agreement on whether such a PR should replace pd.Series with Series or rather replace Series with pd.Series.

@simonjayhawkins simonjayhawkins added Needs Discussion Requires discussion from core team before further action Code Style Code style, linting, code_checks Testing pandas testing functions or related to the test suite labels Apr 1, 2020
@TomAugspurger
Copy link
Contributor

I prefer the second.

@jreback
Copy link
Contributor

jreback commented Apr 1, 2020

1st is much better for testing & validation.

@jbrockmendel
Copy link
Member

I'm going to refer to the two options as the specific approach and the concise approach, respectively.

I've come to prefer the more-specific-imports approach.

The test suite is big, and it can be difficult to know "what is tested where?". With specific imports, I can look at the top of a file and rule out a lot of possibilities. This is especially helpful for e.g. tests.tslibs or tests.scalar where we have a carefully-constructed dependency structure and would like the tests to reflect that.

Similarly if Series is imported and DataFrame is not, that is useful information.

If pd is in the namespace, all bets are off.

@jorisvandenbossche
Copy link
Member Author

1st is much better for testing & validation.

@jreback Can you explain why? Testing works perfectly fine with both approaches I think.

If pd is in the namespace, all bets are off.

You can still search for pd. in a file / set of files? That will of course give more (duplicated) results, so certainly not as easy to process as the imports, but with a decent search UI it might still be OK?

I fully agree that a good and logical organization of the test suite is important, and you are doing a lot of work to clean this up. But personally, I think adding file docstrings explaining the purpose of a file + documentation explaining the general layout of how the tests are organized would be more useful / accessible for other contributors to understand this rather than expecting them to look at imports.

@WillAyd reasoning from #29275 (with which I agree):

This is getting into personal development preferences but as a counter argument I sometimes find these imports very annoying. When debugging the pd namespace is almost always available, yet when copy pasting blocks of code that use this import machinery you then have to go up and see what is imported from top of the module.

@jbrockmendel
Copy link
Member

yet when copy pasting blocks of code that use this import machinery you then have to go up and see what is imported from top of the module.

I'm sympathetic to this argument, but if we're going to make copy/paste-ability a criterion, fixtures are a bigger hurdle than imports.

@jreback
Copy link
Contributor

jreback commented Apr 1, 2020

being more explicit is way better and provides stronger consistency. allowing one to do pd.* is fine for user code but we want as strict a check in tests as possible.

I don't see any reason to use pd.* in tests unless of course we are actually tesing the namespace or functions which only exist there.

@jorisvandenbossche
Copy link
Member Author

I'm sympathetic to this argument, but if we're going to make copy/paste-ability a criterion, fixtures are a bigger hurdle than imports.

I fully agree on that ;) And we both argumented for that elsewhere. Now, for this specific issue, it's indeed an argument to say that the import is not the most important copy paste issue, but there are still a lot of tests that are not using fixtures (or not for the data being created).

being more explicit is way better and provides stronger consistency.

I don't think that any of both approaches is "more explicit" than the other.
(in fact, I would say that rather Series(..) is less explicit, as you don't see from which module it comes. Everybody contributing to pandas probably knows where it is coming from, though ;))

@datapythonista
Copy link
Member

My preference is just importing the module, mainly to make the code explicit. With a silly example, if there is a line add(2, 2), I need to check the rest of the file to know whether there is a def add(...), or a from mylib import add (and what "mylib" is).

That being said, I think in pandas code having Series and DataFrame shouldn't be a problem. And if I understand correctly, @jbrockmendel and @jreback find useful seeing a from pandas import Series at the top of the module, since they know that only Series is being tested. Seems like a fair point. I don't find it useful for myself at this point, but if it makes things clearer for some people, I'm ok adding a check to avoid using the pattern import pandas in the tests directory, and getting the cases fixed, if that's agreed.

@simonjayhawkins
Copy link
Member

Consistency in a single module may be more important than following a "foolish" consistency across the whole codebase.

However, recently in #32459 there was a similar discussion for imports in cython modules. The outcome seemed to be that the 2nd pattern above was preferred in that context.

for me, if there is no clear advantage of one method over the other for testing, I would prefer to adopt the 2nd pattern for consistency.

@TomAugspurger
Copy link
Contributor

So is the agreement to

  • Import the objects (Series, DataFrame, etc.) in object-specific tests like tests/series/test*.py
  • import pandas as pd elsewhere?

@jbrockmendel
Copy link
Member

jbrockmendel commented Apr 6, 2020

Import the objects (Series, DataFrame, etc.) in object-specific tests like tests/series/test*.py

The fewer things are tested in a module, the more I care about specific imports (recall above Im calling the two options "specific"/"concise"). So in tests.tslibs or tests.scalar I care a lot, in tests.indexes i care medium, and in tests.frame and in tests.io i dont care at all.

@TomAugspurger is this anything like what you meant with your last comment?

@TomAugspurger
Copy link
Contributor

Something like that, though your preferences may be a bit stronger. So the wording may be "prefer specific imports, unless you're testing general things"?

@jbrockmendel
Copy link
Member

though your preferences may be a bit stronger

Yah, strength of preferences in my previous comment should all be normalized by "eh, not that big a deal".

So the wording may be "prefer specific imports, unless you're testing general things"?

Which form of imports to use depends on what classes/functions X are being tested
in a file and where those lie in the dependency structure.

If X is something like `pd.concat` or `pd.read_csv` where the implementation
depends on essentially all of `pandas.core`,  then use `import pandas as pd`.

If X is something like `pd.NaT` or `tslibs.ccalendar` that have very few intra-pandas
dependencies, use specific imports to make it clear to a reader that higher-level 
objects are not tested in this file.

In between is something like TimedeltaIndex, where some test files require
higher-level objects (Series, DataFrame) and others rely on only TimedeltaIndex.
For those, authors will have to use their best judgement.

@mroeschke
Copy link
Member

xref #23018 as discussion points made on the same topic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code Style Code style, linting, code_checks Needs Discussion Requires discussion from core team before further action Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

7 participants