Skip to content

DEPR: pandas/core #27522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jreback opened this issue Jul 22, 2019 · 40 comments
Open

DEPR: pandas/core #27522

jreback opened this issue Jul 22, 2019 · 40 comments
Labels
Deprecate Functionality to remove in pandas
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jul 22, 2019

we have a documentation note that pandas.core is private but we need an actual deprecation

see discussion starting here: #27471 (comment)

[Edit by rhshadrach] Current outline (will update as this evolves):

@jreback jreback added the Deprecate Functionality to remove in pandas label Jul 22, 2019
@jreback jreback added this to the 1.0 milestone Jul 22, 2019
@WillAyd
Copy link
Member

WillAyd commented Jul 22, 2019

Is there anything outside of the top level pandas namespace that is public? I know we have some usages of pandas.io.json.json_normalize and IIRC something with Categorical but I wonder if we should just move them to pandas and provide a generic warning for anything not coming from pandas to end users

@jbrockmendel
Copy link
Member

I share Will's sentiment, though from looking at the history it seems like there was an intentional move away from that before I got here.

If there is a shuffling of core files, I'd like to separate out Index/Series/DataFrame/internals from lower-level code that doesn't need to be "aware" of these classes

@jreback
Copy link
Contributor Author

jreback commented Jul 22, 2019

nothing should be public except pd namespace
but historically we had a lot (like everything) was

we moved everything a few years ago to create _libs and consolidate under core & io
but i think we need to be explicitly private

moving things around is much much easier after

@WillAyd
Copy link
Member

WillAyd commented Jul 22, 2019

So sounds like we need to:

  1. Audit external usage of anything not in the top level (json_normalize, etc...) and make available in pandas.init.py
  2. Document / throw FutureWarning when accessing other namespaces

Then actually prefix those namespaces with an underscore to privatize?

@huandzh
Copy link

huandzh commented Jul 23, 2019

Hi, it will be nice to also expose exceptions raised by top level functions, since downstream may want to handle them in their own code.

Use case: aliyun/aliyun-odps-python-sdk#99

@simonjayhawkins
Copy link
Member

Hi, it will be nice to also expose exceptions raised by top level functions, since downstream may want to handle them in their own code.

this may warrant a separate issue for discussion. @jorisvandenbossche what's you current feeling regarding custom exceptions exposed to users? #25569 (comment)

@jreback
Copy link
Contributor Author

jreback commented Jul 23, 2019

Hi, it will be nice to also expose exceptions raised by top level functions, since downstream may want to handle them in their own code.

this may warrant a separate issue for discussion. @jorisvandenbossche what's you current feeling regarding custom exceptions exposed to users? #25569 (comment)

these are already mostly in pandas.exceptions which are public

@simonjayhawkins
Copy link
Member

Hi, it will be nice to also expose exceptions raised by top level functions, since downstream may want to handle them in their own code.

this may warrant a separate issue for discussion. @jorisvandenbossche what's you current feeling regarding custom exceptions exposed to users? #25569 (comment)

these are already mostly in pandas.exceptions which are public

is this an exception (sorry for the pun) to the "nothing should be public except pd namespace" rule?

@jreback
Copy link
Contributor Author

jreback commented Jul 23, 2019

Hi, it will be nice to also expose exceptions raised by top level functions, since downstream may want to handle them in their own code.

this may warrant a separate issue for discussion. @jorisvandenbossche what's you current feeling regarding custom exceptions exposed to users? #25569 (comment)

these are already mostly in pandas.exceptions which are public

is this an exception (sorry for the pun) to the "nothing should be public except pd namespace" rule?

yes

@jreback
Copy link
Contributor Author

jreback commented Jul 23, 2019

pandas.api as well; the other top level packages should be private (though may be tricky / take a while to fully remove)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 23, 2019 via email

@huandzh
Copy link

huandzh commented Jul 24, 2019

Hi, it will be nice to also expose exceptions raised by top level functions, since downstream may want to handle them in their own code.

this may warrant a separate issue for discussion. @jorisvandenbossche what's you current feeling regarding custom exceptions exposed to users? #25569 (comment)

these are already mostly in pandas.exceptions which are public

is this an exception (sorry for the pun) to the "nothing should be public except pd namespace" rule?

yes

Then, OptionError raised by pandas.set_option will be available in pandas.exceptions afterwards, and so do other custom exceptions raised by top-level functions, am i right?

Anyway, it's better to have them in one and fixed place.

@simonjayhawkins
Copy link
Member

Hi, it will be nice to also expose exceptions raised by top level functions, since downstream may want to handle them in their own code.

this may warrant a separate issue for discussion. @jorisvandenbossche what's you current feeling regarding custom exceptions exposed to users? #25569 (comment)

these are already mostly in pandas.exceptions which are public

is this an exception (sorry for the pun) to the "nothing should be public except pd namespace" rule?

yes

Then, OptionError raised by pandas.set_option will be available in pandas.exceptions afterwards, and so do other custom exceptions raised by top-level functions, am i right?

Anyway, it's better to have them in one and fixed place.

it looks like this was added in #19790 and intended to be user facing. so it probably should be in the public api. can you raise an issue for this.

@huandzh
Copy link

huandzh commented Jul 24, 2019

Hi, it will be nice to also expose exceptions raised by top level functions, since downstream may want to handle them in their own code.

this may warrant a separate issue for discussion. @jorisvandenbossche what's you current feeling regarding custom exceptions exposed to users? #25569 (comment)

these are already mostly in pandas.exceptions which are public

is this an exception (sorry for the pun) to the "nothing should be public except pd namespace" rule?

yes

Then, OptionError raised by pandas.set_option will be available in pandas.exceptions afterwards, and so do other custom exceptions raised by top-level functions, am i right?
Anyway, it's better to have them in one and fixed place.

it looks like this was added in #19790 and intended to be user facing. so it probably should be in the public api. can you raise an issue for this.

#27553 submitted

@jorisvandenbossche
Copy link
Member

nothing should be public except pd namespace

I think it has become clear from the discussion, but to put it explicitly: this is not really correct (in the meaning of "only objects and functions in the pd namespace").

All submodules in the pd namespace are also (partly) public (and the objects they contain), except for pandas.core, pandas.util and pandas.compat. For pandas.io and pandas.tseries this is only "partly", as we say in the docs that for those submodules only those objects are public which are listed in the API docs. But so the arrays, errors, plotting and testing modules are fully public.
See the docs on this: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

I think the main confusing situation are those submodules that are only partly public (io and tseries), the objects of them we list in the API docs as public:

  • pandas.io:
    • mentioned in the API reference:
      • pandas.io.json: json_normalize, build_table_schema
      • pandas.io.formats.style: Styler
    • also used in the user guide: pd.io.sql.execute (and pd.io.sql.get_schema always had a somewhat dubious status) and pd.io.date_converters.parse_date_time
  • pandas.tseries:
    • mentioned in the API reference:
      • pandas.tseries.offsets: DateOffset, Tick, and all their subclasses
      • pandas.tseries.frequencies: to_offset
    • also used in the user guide: several objects from pandas.tseries.holiday
    • Additional note: in the user guide we use pd.offsets and not pd.tseries.offsets as in the reference guide, something we should make consistent.

To say that json_normalize is far from the only non-top-level function.

From the explicitly documented as private modules (core, compat, util), I think it is mainly pd.util.testing that actually sees quite some usage in the wild. Not sure if we want to do something about this. We moved some to pandas.testing, but I think people use more than what we exposed there (eg several "create dataframe" functions).

For the custom expections that are not included in pandas.errors, I opened #27656 (there are quite a few).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 30, 2019 via email

@TomAugspurger
Copy link
Contributor

Is deprecating this a priority for 1.0? According to https://dev.pandas.io/docs/development/policies.html, we're allowed to add depredations in x.y releases, so we can do this after 1.0 if no one has time for it.

@jorisvandenbossche
Copy link
Member

It's not fully clear to me what is actually being proposed here (a lot of the above discussion was about other top-level modules that are actually public). Is it about moving pandas.core to eg pandas._core and raise warnings for pandas.core ?

@jreback
Copy link
Contributor Author

jreback commented Nov 13, 2019

It's not fully clear to me what is actually being proposed here (a lot of the above discussion was about other top-level modules that are actually public). Is it about moving pandas.core to eg pandas._core and raise warnings for pandas.core ?

yes that’s what we did quite a while ago for the other codebases

@TomAugspurger
Copy link
Contributor

Just FYI, we do have public things currently exposed from pandas.core, things like Resampler, GroupBy, etc. If we're deprecating all of pandas.core then we'll need to add those elsewhere and update the API references.

@jorisvandenbossche
Copy link
Member

Moving this off the 1.0 milestone (deprecations are not blockers for 1.0 I think, since they can be introduced in any 1.x release)

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.0, 1.1 Jan 8, 2020
@lithomas1
Copy link
Member

take

@rhshadrach
Copy link
Member

I was thinking this was blocked since Styler is not in pandas.api.typing, but Styler is defined in pandas.io and not pandas.core.

@phofl - My recollection is that you had better ideas on how to implement this than I could conjure. Any interest in tackling this at some point as part of 2.x?

@jorisvandenbossche
Copy link
Member

Some general comments:

  • I think that for anything that is defined in pandas.core (and thus pandas._core in the future), we should set the __module__ of that object to the public namespace where it is exposed, to avoid leaking the "_core" details to users (-> ENH: set __module__ on top-level public objects #55178)
    IMO we should try to do that before the pandas._core rename lands in a release (not necessarily before the PR doing the rename lands, though). But at the same time, setting __module__ could be a somewhat breaking change (I just noticed dask checks that as a fast way to see if an object is a pandas DataFrame), so probably something to do in pandas 3.0
  • Given how widespread the usage of pandas.core imports are (also unnecessary ones, like importing DataFrame from pandas.core.frame), I think we will have to do this very slow.
  • I think we should try to provide a more helpful deprecation message than just a simple "import from pandas.core is deprecated, import from a public module instead". Generally there are two buckets: objects we know have a public import, and actual private objects. For the first category, I think we create some mapping of those, and ensure we point to the correct import in the deprecation message.
    For the second category, instead of essentially saying something like "use from pandas._core if needed at your own risk", we could also encourage users to open an issue to discuss whether that object should actually be made public (related to next bullet point).
  • We might need to evaluate a bunch of APIs about whether those are actually useful for advanced external use cases, and whether we want to expose them somewhere publicly (e.g. we already have some (d)type related helpers in pandas.api.types, but there might be other objects being used in the wild for valid use cases).
    Some quick examples (from a quick search on "pandas.core"), pyspark is using CachedAccessor, PandasObject, dask is using Resampler and Rolling from pandas.core (but those are now exposed in pandas.api.typing) and pandas.core.apply.reconstruct_func, fastparquet is using BaseMaskedDtype and IntegerArray (the latter already is exposed in pandas.arrays), pyarrow is using make_block and BlockManager, statsmodels uses our nanmean (I suppose they can just use numpy nowadays, though), modin is using find_common_type, several classes for inheriting docs (Rolling et al, but also CategoricalAccessor, StringMethods, CombinedDatetimelikeProperties, SparseAccessor), infer_dtype_from_object, reconstruct_func, is_bool_indexer, and several more; pyjanitor is using is_bool_indexer, concat_compat, apply_if_callable, _MergeOperation; geopandas is using CachedAccessor. And this are just some quick examples (and to be clear, most of those projects are using a lot more from pandas.core, but mostly things that have already a public entrypoing like in pandas.api.types).
    It's certainly not the case for all of those examples, but for some of them there might be a good reason that a downstream project is using those APIs. And then the question is: do we want to expose some of those in a public location? (and where?) Or do we just assume that those projects will keep using them from pandas._core (but then we risk many people simply adding try/excepts first importing from pandas._core falling back to importing from pandas.core, and then what did we achieve with this?)

@rhshadrach
Copy link
Member

  • It's certainly not the case for all of those examples, but for some of them there might be a good reason that a downstream project is using those APIs. And then the question is: do we want to expose some of those in a public location? (and where?) Or do we just assume that those projects will keep using them from pandas._core (but then we risk many people simply adding try/excepts first importing from pandas._core falling back to importing from pandas.core, and then what did we achieve with this?)

Since we're doing the deprecation very slowly (over 2+ years maybe?), there will be an extended period of time where you can import from either _core or core. By the time we remove core, I expect packages will only be supporting versions of pandas that have _core. If this is the case, there is no need for try-excepts.

With this, I think it's okay to think of this question as independent of our deprecation of core.

@lithomas1
Copy link
Member

IMO, this probably should require a PDEP, given the scope of this deprecation.

I missed the last dev call, was something like that discussed there?

@rhshadrach
Copy link
Member

rhshadrach commented Dec 18, 2023

IMO, this probably should require a PDEP, given the scope of this deprecation.

pandas.core is documented as being private. We're deprecating the move because users and downstream packages haven't always followed this, but in my opinion, this isn't technically a user-facing change.

@lithomas1
Copy link
Member

Even though this doesn't directly affect users, I think it would be good to provide clarity to downstream as to how they should go about replacing the internals from pandas.core they are using, and the timeline for doing so. Making this a PDEP would provide a lot more visibility than just discussing on the issue here.

There is also a large impact on contributors to the package as I imagine stuff like renaming core to _core will cause a bunch of conflicts, and maybe mess up the git history (not sure if git blame is able to handle this or not).

@rhshadrach
Copy link
Member

@pandas-dev/pandas-core - who thinks this does or doesn't need a PDEP.

👍 Needs a PDEP
👎 Doesn't need a PDEP.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 19, 2023

IMO, this probably should require a PDEP, given the scope of this deprecation.

pandas.core is documented as being private. We're deprecating the move because users and downstream packages haven't always followed this, but in my opinion, this isn't technically a user-facing change.

Yes, but we do have reference pages that list methods as part of core. If you go to https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html and then click on any of the DataFrameGroupBy methods, they are showing as being part of core .

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 19, 2023

Even though this doesn't directly affect users, I think it would be good to provide clarity to downstream as to how they should go about replacing the internals from pandas.core they are using, and the timeline for doing so. Making this a PDEP would provide a lot more visibility than just discussing on the issue here.

To me, this is a compelling reason to do so. Then we have an agreed upon path for the deprecation. I don't think we need a PDEP that is about renaming. We need a PDEP that explains the deprecation process.

@phofl
Copy link
Member

phofl commented Dec 20, 2023

We need a PDEP that explains the deprecation process.

That's what docs are there for

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 20, 2023

We need a PDEP that explains the deprecation process.

That's what docs are there for

Yes, but I think we have to agree on that deprecation process before it gets documented (effectively making the core of the PDEP the future documentation), as opposed to doing a PR that has a description of the deprecation process in docs, which we then have a debate over.

@mroeschke
Copy link
Member

I don't think this needs a PDEP (yet) because:

  1. There already seems to be general consensus from the core team to deprecate pandas.core
  2. There hasn't been objection to the plan proposed in DEPR: pandas/core #27522 (comment) (due to lack of discussion it appears)

From the comments so far it appears the deprecation plan needs more clarity/details(?). If the plan gets to a state where there's a gridlock in consensus then I think a PDEP vote would be warranted. Also IMO the best way to get feedback on this procedure from users would be to issue to deprecation as opposed to writing a PDEP

@rhshadrach
Copy link
Member

rhshadrach commented Dec 20, 2023

@Dr-Irv

but I think we have to agree on that deprecation process before it gets documented

That's what this issue is for. I've edited the OP here with the process as I currently see it, but will be updating based on feedback.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 20, 2023

That's what this issue is for. I've edited the OP here with the process as I currently see it, but will be updating based on feedback.

Fair enough. Your edits cleared things up. I think the section at the end entitled "Anytime after" could use more precision on the timing. Would those changes in that section be made in a major or minor release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.