Skip to content

API: read_csv, to_csv line_terminator keyword inconsistency #35399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Jul 24, 2020

I followed discussion in #9568. The aim of this PR is to allow both line_terminator and lineterminator keyword args to read_csv and to_csv but only document line_terminator. As per @jorisvandenbossche's suggestion we preserve lineterminator to stay compatible with csv dialects.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arw2019 for the PR. some reservations regarding the api.

@@ -3004,6 +3004,7 @@ def to_csv(
quoting: Optional[int] = None,
quotechar: str = '"',
line_terminator: Optional[str] = None,
lineterminator: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this signature is displayed in the documentation so doing it like this could be confusing. (also this function is not kwargs only, so although bonkers, passing arguments as positional arguments would break)

maybe add a **kwargs at the end and document as 'for compatbility with csv module'

then in code something like kwargs.setdefault('lineterminator', line_terminator) and pass on kwargs instead of line_terminator to CSVFormatter which just passes them onto csvlib.writer

Copy link
Member Author

@arw2019 arw2019 Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I implemented this idea and added optional kwargs to the docstrings.

In to_csv it's maybe a little awkward

kwargs.setdefault("lineterminator", line_terminator)
line_terminator = line_terminator or kwargs["lineterminator"]

because we have to feed line_terminator to CSVFormatter but also we want to keep line_terminator explicit for the docs. It works but I'm happy to write this another way if these lines look odd

@@ -579,6 +579,7 @@ def read_csv(
compression="infer",
thousands=None,
decimal: str = ".",
line_terminator=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, in the signature, I think we should only have the one parameter and the compatibility keyword accepted though **kwargs

@simonjayhawkins simonjayhawkins added API - Consistency Internal Consistency of API/Behavior IO CSV read_csv, to_csv labels Jul 24, 2020
@simonjayhawkins simonjayhawkins added this to the 1.2 milestone Jul 24, 2020
@WillAyd
Copy link
Member

WillAyd commented Jul 27, 2020

I am -1 on this change as I don't think adding kwargs to these functions is a good idea; we struggle enough already with how large these signatures are

I don't know if any change here is really worth the churn

@arw2019
Copy link
Member Author

arw2019 commented Jul 27, 2020

I am -1 on this change as I don't think adding kwargs to these functions is a good idea; we struggle enough already with how large these signatures are

I don't know if any change here is really worth the churn

@WillAyd @simonjayhawkins #27544 might be relevant

@WillAyd
Copy link
Member

WillAyd commented Jul 27, 2020

That change is about keyword only calls, i.e. you can't call items positionally. That is separate from allowing kwargs, which I think we generally have moved away from because it actually can cause more confusion than good for end users

@arw2019
Copy link
Member Author

arw2019 commented Jul 27, 2020

That change is about keyword only calls, i.e. you can't call items positionally. That is separate from allowing kwargs, which I think we generally have moved away from because it actually can cause more confusion than good for end users

Ok, right.

xref #34976 added tests to check for incorrect keyword arguments to read_csv - this PR breaks those

The options seem to be:

  1. live with things as they are - close this and close API: read_csv,from_csv/to_csv keyword consistency #9568
  2. (maybe) find a better way to have both line_terminator and lineterminator. I'm not sure there's a way that doesn't cause at least as much confusion as the current solution.
  3. replace lineterminator with line_terminator in read_csv
  4. replace line_terminator with lineterminator in to_csv

I'm agnostic on which way to go - but happy to implement the changes if we're doing 2/3/4.

@simonjayhawkins
Copy link
Member

I think we generally have moved away from because it actually can cause more confusion than good for end users

agree with this point in general, however in this case we want to avoid the inconsistency AND allow both 'line_terminator' and 'lineterminator', see #9568 (comment) and #9568 (comment)

I'm not keen on having both parameters in the signature, so I feel the **kwargs is the lesser of two evils.

@WillAyd
Copy link
Member

WillAyd commented Jul 27, 2020

If we change something I would agree with the comment to just add both; it would be less code than kwargs and doesn't have any side effects. We can just document it as alias for XXX which I think we do for other things

@simonjayhawkins simonjayhawkins removed this from the 1.2 milestone Jul 29, 2020
@arw2019
Copy link
Member Author

arw2019 commented Aug 7, 2020

Checking in on this.

It seems like people do agree on having both line_terminator and lineterminator. The options are:

  1. have both arguments in the signature, and document lineterminator as alias to line_terminator (+1 @WillAyd)
  2. use line_terminator in signature and allow line_terminator as a kwarg. This would mean adding kwargs to both read_csv and to_csv - currently neither method has those (+1 @simonjayhawkins)

Looping in @jreback @jorisvandenbossche since you guys discussed this back in #9568

@@ -3141,6 +3146,9 @@ def to_csv(

from pandas.io.formats.csvs import CSVFormatter

kwargs.setdefault("lineterminator", line_terminator)
line_terminator = line_terminator or kwargs["lineterminator"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to watch out for corner case where user passes both

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Do we throw an error or would we use one (maybe line_terminator) and a throw warning?

@jbrockmendel
Copy link
Member

What about something like

def mywrapper(func):
    @wraps(func)
    def new_func(*args, **kwargs):
        if "lineterminator" in kwargs:
            kwargs["line_terminator"] = kwargs.pop("lineterminator")
        return func(*args, **kwargs)
    return new_func

@mywrapper
def read_csv(...)

inspect.signature shows the original signature unchanged. Will that make it to the docs correctly?

@arw2019
Copy link
Member Author

arw2019 commented Aug 18, 2020

What about something like

def mywrapper(func):
    @wraps(func)
    def new_func(*args, **kwargs):
        if "lineterminator" in kwargs:
            kwargs["line_terminator"] = kwargs.pop("lineterminator")
        return func(*args, **kwargs)
    return new_func

@mywrapper
def read_csv(...)

inspect.signature shows the original signature unchanged. Will that make it to the docs correctly?

I'll compile the docs and check!

@arw2019
Copy link
Member Author

arw2019 commented Aug 21, 2020

What about something like

def mywrapper(func):
    @wraps(func)
    def new_func(*args, **kwargs):
        if "lineterminator" in kwargs:
            kwargs["line_terminator"] = kwargs.pop("lineterminator")
        return func(*args, **kwargs)
    return new_func

@mywrapper
def read_csv(...)

inspect.signature shows the original signature unchanged. Will that make it to the docs correctly?

@jbrockmendel this renders fine in the docs. To move forward, should I add this decorator to pandas.util._decorators? I'd like to be systematic (we'd be using this at least twice, in read_csv and to_csv).

This is very cool - thanks for suggesting it!!

@@ -3065,6 +3065,7 @@ def to_csv(
decimal: Optional[str] = ".",
errors: str = "strict",
storage_options: StorageOptions = None,
**kwargs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that @simonjayhawkins and I have a different point of view here but I am strongly against adding kwargs to read_csv. The signature is already massive and adding this only makes things worse

I think we either stick to adding just one keyword arg or just leave this for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other reason i'm not keen on adding the alias to the signature is that we also have doublequote, escapechar, quotechar and skipinitialspace as parameters to read_csv. So if we have both line_terminator and lineterminator, some bright spark will want snake case equivalents of the others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you guys be happy with @jbrockmendel solution? I think it doesn't add kwargs and lineterminator won't appear in the signature

def mywrapper(func):
    @wraps(func)
    def new_func(*args, **kwargs):
        if "lineterminator" in kwargs:
            kwargs["line_terminator"] = kwargs.pop("lineterminator")
        return func(*args, **kwargs)
    return new_func

@mywrapper
def read_csv(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A decorator would be a good solution, however some of our decorators 'lose' the signature in the docs e.g. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_stata.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to keep both ad infinitum, it may be possible to mostly use deprecate_kwarg and just eat the warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins The "solution" to that issue is probably for the wrapper to rewrite the docstring with an explicit signature that can be created using Signature within the docstring. For example, after post processing the docstring for

def foo(x, y):
    """
    Does foo

    Parameters
    ---------
    x : DataFrame
    y : DataFrame
"""

would become

    """
    foo(x, y)

    Does foo

    Parameters
    ---------
    x : DataFrame
    y : DataFrame
"""

Of course, this would be for a different PR. The string variant is the method used to document the signature of Cython functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A decorator would be a good solution, however some of our decorators 'lose' the signature in the docs e.g. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_stata.html

@simonjayhawkins I tried adding this one to read_csvand compiling. Seems ok but I can't say how robust that is

@bashtage
Copy link
Contributor

bashtage commented Aug 26, 2020

Is this just a case for deprecate_kwarg or am I missing something? Is there a desire to keep both alive for some reason?

That only line_terminator is being documented is a strong suggestion that lineterminator should not be used (and eventually usable).

@simonjayhawkins
Copy link
Member

Is this just a case for deprecate_kwarg or am I missing something? Is there a desire to keep both alive for some reason?

see #9568 (comment) and #9568 (comment)

so from the issue, the current 'requirement' is to accept both. deprecate_kwarg is certainly the easier option.

@bashtage
Copy link
Contributor

so from the issue, the current 'requirement' is to accept both. deprecate_kwarg is certainly the easier option.

Deprecating doesn't mean it has to be removed -- it just tells users not to use it. Technically I think deprecate_kwarg satisfies the requirements of the old thread since both are accepted and only the preferred is retained.

@arw2019
Copy link
Member Author

arw2019 commented Aug 26, 2020

@bashtage @simonjayhawkins I'm down to do either (implement the decorator in a separate PR then come back here or deprecate)

@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2020

To be honest I don't think this is that big of an issue that we need to add another layer of wrapping to the csv function, so slight -1 on that design

@arw2019
Copy link
Member Author

arw2019 commented Aug 26, 2020

How about in read_csv deprecating lineterminator in favour of line_terminator and leaving to_csv as is? @bashtage @simonjayhawkins @WillAyd

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2020

Closing this PR as I think still needs some discussion. If a point of interest please comment back on the original issue to get alignment on design approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: read_csv,from_csv/to_csv keyword consistency
5 participants