Skip to content

Index constructor doesn't validate kwargs #26974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jun 20, 2019 · 21 comments · Fixed by #38597
Closed

Index constructor doesn't validate kwargs #26974

TomAugspurger opened this issue Jun 20, 2019 · 21 comments · Fixed by #38597
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Error Reporting Incorrect or improved errors from pandas Index Related to the Index class or subclasses
Milestone

Comments

@TomAugspurger
Copy link
Contributor

In [4]: pd.Index([1, 2], abc=1)                                                                                                                                                                                                                                                                                                                                           
Out[4]: Int64Index([1, 2], dtype='int64')

This should raise, since abc isn't a valid keyword.

@jbrockmendel jbrockmendel added Index Related to the Index class or subclasses Constructors Series/DataFrame/Index/pd.array Constructors labels Jul 21, 2019
@adriesse
Copy link

adriesse commented Aug 6, 2019

Is the goal in pandas, to raise an exception whenever a non-existing kwarg is used?

@TomAugspurger
Copy link
Contributor Author

Yes.

@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Aug 9, 2019
@bartkim0426
Copy link

Hi, is there someone working for it? If not, can I make PR for this?

@TomAugspurger
Copy link
Contributor Author

Sure. Here's how I would do it.

Start in Index.__new__ at

def __new__(
and look at all the places we pass through **kwargs. This will be index subclasses like Categorical, IntervalIndex, PeriodIndex, etc. Find the set of arguments that those accept. If any of them accept **kwargs, then we should start by narrowing down the valid set of kwargs first (in its own PR, one per subclass).

@jreback @jorisvandenbossche two API questions:

  1. What do we do with subclass-specific kwargs like closed for IntervalIndex? Do we add them to the Index class? Do we have a dedicated subclass_kwargs argument that's a dict of kwargs for the subclass (that's not a good name. Maybe not a good idea). Or do we keep kwargs, and just raise when an invalid one is passed?
  2. What do we do now with invalid kwargs? Raise, or warn that they'll raise in the future?

@jorisvandenbossche
Copy link
Member

What do we do with subclass-specific kwargs like closed for IntervalIndex?

They are currently passed through?

@TomAugspurger
Copy link
Contributor Author

Yes. The method depends a bit on the subclass. I glanced briefly, but CategoricalIndex gets **kwargs. IntervalIndex gets kwargs.get("closed", None). So they're either all passed through, or they get specific ones.

@jorisvandenbossche
Copy link
Member

We could also deprecate that? if you want the specific keywords, you can use the specific class constructor?

What do we do now with invalid kwargs? Raise, or warn that they'll raise in the future?

I would lean towards first a warning. Users could have unintentionally done this (eg typo).

@bartkim0426
Copy link

Thanks for detail explanation. I'll make a work for this - raise a error or just add warning.

@bartkim0426
Copy link

@TomAugspurger I try to modify code - make a warning for invalid kwargs.
Then we have to specify a list for correct kwargs - and should check every kwargs that they're in the correct list. Is this what you're going to change? I can't find python code that warns with invalid kwargs. Please give detail thinking about it. Thanks!

@TomAugspurger
Copy link
Contributor Author

Here's how I might do this:

  1. Ensure that every Index subclass does not take **kwargs.
  2. Make the named arguments in Index.__new__ just the arguments that are accepted by all subclasses (maybe treat tupleize_cols specially, since it's not meant to be passed through)
  3. Always pass **kwargs through to the subclasses constructor. This validates **kwargs for us when we're returning a subclass
  4. When returning an object-dtype Index (the last case in the if / elif chain), validate that **kwargs is just the expected arguments. I think just
expected_kwargs = {'data', 'dtype', 'copy', 'name', 'tuplize_cols'}
extra = set(kwargs) - expected_kwargs
if set(kwargs) - expected_kwargs:
    warnings.warn(...)
    return Index(subarr, dtype=dtype, copy=copy, name=name)

Does that make sense?

@bartkim0426
Copy link

@TomAugspurger
Thanks for the explanation and sorry for late response.

First, I checked that every subclass doesn't take **kwargs.
(NumericIndex, MultiIndex, CategoricalIndex, DatetimeIndex, TimedeltaIndex, PeriodIndex, RangeIndex, Int64Index, NumericIndex)

I was confused about always pass **kwargs through to the subclasses constructor.
For subclasses - if there are invalid args, TypeError occurred so I think we don't have to validate the subclasses.

Then is it enough to add validate code below __new__ of Index? I think

def Index(...)
    ...
    def __new__(
        cls,
        data=None,
        dtype=None,
        copy=False,
        name=None,
        fastpath=None,
        tupleize_cols=True,
        **kwargs
    ):
        expected_kwargs = {'data', 'dtype', 'copy', 'name', 'tuplize_cols'}
        if set(kwargs) - expected_kwargs:
            warnings.warn(...)
        ... 

Then users can see warnings while calling Index with invalid **kwargs, and faces error while calling index's subclasses. Please tell me if I misunderstand something. Thanks again!

@TomAugspurger
Copy link
Contributor Author

I don't think the maintainers have agreed on a desired behavior yet, sorry.

In another issue (can't find it now) @WillAyd expressed displeasure with accepting both name and names (for MultiIndex). So perhaps he agrees with @jorisvandenbossche's suggestion to deprecate passing through subclass-specific kwargs.

cc @jreback if you have thoughts on not passing through subclass-specific kwargs.

@WillAyd
Copy link
Member

WillAyd commented Oct 14, 2019

I'm not super familiar with all of the classes at play but do we even really need kwargs? Would rather align signatures where possible and handle signature exceptions as needed (is name(s) the only one?)

@jreback
Copy link
Contributor

jreback commented Oct 14, 2019

we don’t really need kwargs but is simpler

for example IntervalIndex has closed kwarg this needs to be passed when u construct from Index (and not II directly)

@jorisvandenbossche
Copy link
Member

for example IntervalIndex has closed kwarg this needs to be passed when u construct from Index (and not II directly)

But then if a user wants to use this keyword, then can use the IntervalIndex constructor instead of the basic Index constructor?

@jreback
Copy link
Contributor

jreback commented Oct 14, 2019

for example IntervalIndex has closed kwarg this needs to be passed when u construct from Index (and not II directly)

But then if a user wants to use this keyword, then can use the IntervalIndex constructor instead of the basic Index constructor?

sure though we currently allow one way to construct indexes via Index
so this would be a break

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 14, 2019

so this would be a break

Yes, sure, so we would need to do this with a deprecation, if we want to do this.
So the question is: de we want to support this?

In practice, to fix the original issue in this PR (validating kwargs), we don't need to disallow this, as those kwargs could be validated on the level of specific index constructors. This is basically what Tom proposed above at #26974 (comment)
So maybe we should move the discussion about passing through keyword args to a separate issue.
I think what Tom described above is a good way to solve the actual issue here.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2019

i agree with this proposed soon: #26974 (comment)

@TomAugspurger
Copy link
Contributor Author

Alrighty, sorry for bringing that other discussion in here.

For now @bartkim0426 can you implement #26974 (comment)?

I was confused about always pass **kwargs through to the subclasses constructor.
For subclasses - if there are invalid args, TypeError occurred so I think we don't have to validate the subclasses.

The idea would be for pd.Index([], dtype='category', not_a_kwargs="foo") to raise. I think you don't pass **kwargs through to all the index subclasses, there might be a case where we don't raise on that. Make sense?

@bartkim0426
Copy link

@TomAugspurger
I understand. Thank you for the detail explanation. I'll make a PR for this.

I still confused is it the right way to validate all return clause in __new__ because there're over 20 clauses.

@TomAugspurger
Copy link
Contributor Author

Ideally you only need to worry about the last one. Since all the subclass returns will error on there own if an invalid kwarg is passed.

@mroeschke mroeschke added Bug Error Reporting Incorrect or improved errors from pandas labels May 5, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Error Reporting Incorrect or improved errors from pandas Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants