Skip to content

API: read_csv, to_csv line_terminator keyword inconsistency #35399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
30c9b83
add values.dtype.kind==f branch to array_with_unit_datetime
arw2019 Jun 27, 2020
2f25460
merge with master
arw2019 Jun 29, 2020
572363a
revert pandas/_libs/tslib.pyx
arw2019 Jun 29, 2020
b891030
merge with master
arw2019 Jun 30, 2020
ecd8ce3
merge with master
arw2019 Jun 30, 2020
ee55191
merge with master
arw2019 Jul 2, 2020
292fcdc
merge with master
arw2019 Jul 7, 2020
9e4ac71
Merge remote-tracking branch 'upstream/master'
arw2019 Jul 8, 2020
1d0ba61
merge with master
arw2019 Jul 8, 2020
b59831e
Merge branch 'master' of https://github.com/arw2019/pandas
arw2019 Jul 16, 2020
b954874
Merge remote-tracking branch 'upstream/master'
arw2019 Jul 16, 2020
ac0a7f1
merge with master
arw2019 Jul 16, 2020
bc55716
added line_terminator arg to read_csv
arw2019 Jul 24, 2020
ee69a76
added line_terminator, lineterminator args + tests
arw2019 Jul 24, 2020
4d00fea
merge with master
arw2019 Jul 24, 2020
c015da5
Merge remote-tracking branch 'upstream/master'
arw2019 Jul 24, 2020
73d6d11
fix csv api using kwargs
arw2019 Jul 24, 2020
1a6497f
TST: remove failing test - read_csv takes kwargs now
arw2019 Jul 25, 2020
3a88ef0
add space between kwargs and colon in docstring
arw2019 Jul 25, 2020
7fe8274
DOC: remove the semicolon after kwargs
arw2019 Jul 25, 2020
1c27b2c
added line_terminator arg to read_csv
arw2019 Jul 24, 2020
1912aa2
added line_terminator, lineterminator args + tests
arw2019 Jul 24, 2020
f54df81
fix csv api using kwargs
arw2019 Jul 24, 2020
cea28d8
TST: remove failing test - read_csv takes kwargs now
arw2019 Jul 25, 2020
85ddf44
add space between kwargs and colon in docstring
arw2019 Jul 25, 2020
a28657c
DOC: remove the semicolon after kwargs
arw2019 Jul 25, 2020
2b1333f
Merge branch 'csv-api' of https://github.com/arw2019/pandas into csv-api
arw2019 Jul 27, 2020
0786617
merge with master
arw2019 Aug 21, 2020
5e87bbc
small changes to docstrings
arw2019 Aug 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -3065,6 +3065,7 @@ def to_csv(
decimal: Optional[str] = ".",
errors: str = "strict",
storage_options: StorageOptions = None,
**kwargs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that @simonjayhawkins and I have a different point of view here but I am strongly against adding kwargs to read_csv. The signature is already massive and adding this only makes things worse

I think we either stick to adding just one keyword arg or just leave this for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other reason i'm not keen on adding the alias to the signature is that we also have doublequote, escapechar, quotechar and skipinitialspace as parameters to read_csv. So if we have both line_terminator and lineterminator, some bright spark will want snake case equivalents of the others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you guys be happy with @jbrockmendel solution? I think it doesn't add kwargs and lineterminator won't appear in the signature

def mywrapper(func):
    @wraps(func)
    def new_func(*args, **kwargs):
        if "lineterminator" in kwargs:
            kwargs["line_terminator"] = kwargs.pop("lineterminator")
        return func(*args, **kwargs)
    return new_func

@mywrapper
def read_csv(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A decorator would be a good solution, however some of our decorators 'lose' the signature in the docs e.g. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_stata.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to keep both ad infinitum, it may be possible to mostly use deprecate_kwarg and just eat the warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins The "solution" to that issue is probably for the wrapper to rewrite the docstring with an explicit signature that can be created using Signature within the docstring. For example, after post processing the docstring for

def foo(x, y):
    """
    Does foo

    Parameters
    ---------
    x : DataFrame
    y : DataFrame
"""

would become

    """
    foo(x, y)

    Does foo

    Parameters
    ---------
    x : DataFrame
    y : DataFrame
"""

Of course, this would be for a different PR. The string variant is the method used to document the signature of Cython functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A decorator would be a good solution, however some of our decorators 'lose' the signature in the docs e.g. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_stata.html

@simonjayhawkins I tried adding this one to read_csvand compiling. Seems ok but I can't say how robust that is

) -> Optional[str]:
r"""
Write object to a comma-separated values (csv) file.
Expand Down Expand Up @@ -3179,9 +3180,6 @@ def to_csv(
Specifies how encoding and decoding errors are to be handled.
See the errors argument for :func:`open` for a full list
of options.

.. versionadded:: 1.1.0

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
Expand All @@ -3190,6 +3188,11 @@ def to_csv(
a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values

.. versionadded:: 1.2.0
kwargs
Additional keyword arguments passed to ``pd.to_csv`` for compatibility
with `csv` module. Include `lineterminator` (an alias of `line_terminator`).

.. versionadded:: 1.2.0

Returns
Expand Down
16 changes: 12 additions & 4 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
ParserError,
ParserWarning,
)
from pandas.util._decorators import Appender
from pandas.util._decorators import Appender, _get_alias_from_kwargs

from pandas.core.dtypes.cast import astype_nansafe
from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -285,7 +285,7 @@
Thousands separator.
decimal : str, default '.'
Character to recognize as decimal point (e.g. use ',' for European data).
lineterminator : str (length 1), optional
line_terminator : str (length 1), optional
Character to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
The character used to denote the start and end of a quoted item. Quoted
Expand Down Expand Up @@ -346,6 +346,11 @@
values. The options are `None` for the ordinary converter,
`high` for the high-precision converter, and `round_trip` for the
round-trip converter.
kwargs
Additional keyword arguments passed to ``pd.read_csv`` for compatibility
with `csv` module. Include `lineterminator` (an alias of `line_terminator`).

.. versionadded:: 1.2.0

Returns
-------
Expand Down Expand Up @@ -580,7 +585,7 @@ def read_csv(
compression="infer",
thousands=None,
decimal: str = ".",
lineterminator=None,
line_terminator=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, in the signature, I think we should only have the one parameter and the compatibility keyword accepted though **kwargs

quotechar='"',
quoting=csv.QUOTE_MINIMAL,
doublequote=True,
Expand All @@ -597,6 +602,7 @@ def read_csv(
memory_map=False,
float_precision=None,
storage_options=None,
**kwargs,
):
# gh-23761
#
Expand Down Expand Up @@ -634,6 +640,8 @@ def read_csv(
engine = "c"
engine_specified = False

kwargs.setdefault("lineterminator", line_terminator)

kwds.update(
delimiter=delimiter,
engine=engine,
Expand All @@ -645,7 +653,6 @@ def read_csv(
quotechar=quotechar,
quoting=quoting,
skipinitialspace=skipinitialspace,
lineterminator=lineterminator,
header=header,
index_col=index_col,
names=names,
Expand Down Expand Up @@ -684,6 +691,7 @@ def read_csv(
infer_datetime_format=infer_datetime_format,
skip_blank_lines=skip_blank_lines,
storage_options=storage_options,
**kwargs,
)

return _read(filepath_or_buffer, kwds)
Expand Down
40 changes: 40 additions & 0 deletions pandas/tests/frame/test_to_csv.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import csv
from io import StringIO
import os
import re

import numpy as np
import pytest
Expand Down Expand Up @@ -998,6 +999,45 @@ def test_to_csv_line_terminators(self):
with open(path, mode="rb") as f:
assert f.read() == expected

def test_to_csv_lineterminator_alternative_args(self):
# GH 9568
# examples from test_to_csv_line_terminators
# test equivalence of line_terminator vs. lineterminator keyword args

df = DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=["one", "two", "three"])

# case 1: CRLF as line terminator

with tm.ensure_clean() as path:
df.to_csv(path, line_terminator="\r\n")

with open(path, mode="rb") as f:
res_line_terminator = f.read()

with tm.ensure_clean() as path:
df.to_csv(path, lineterminator="\r\n")

with open(path, mode="rb") as f:
res_lineterminator = f.read()

assert re.match(res_line_terminator, res_lineterminator)

# case 2: LF as line terminator

with tm.ensure_clean() as path:
df.to_csv(path, line_terminator="\n")

with open(path, mode="rb") as f:
res_line_terminator = f.read()

with tm.ensure_clean() as path:
df.to_csv(path, lineterminator="\n")

with open(path, mode="rb") as f:
res_lineterminator = f.read()

assert re.match(res_line_terminator, res_lineterminator)

def test_to_csv_from_csv_categorical(self):

# CSV with categoricals should result in the same output
Expand Down
29 changes: 27 additions & 2 deletions pandas/tests/io/formats/test_to_csv.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import io
import os
import re
import sys

import numpy as np
Expand Down Expand Up @@ -330,10 +331,15 @@ def test_to_csv_multi_index(self):
@pytest.mark.parametrize("klass", [pd.DataFrame, pd.Series])
def test_to_csv_single_level_multi_index(self, ind, expected, klass):
# see gh-19589
result = klass(pd.Series([1], ind, name="data")).to_csv(
# GH9568 test for equivalence between line_terminator and lineterminator
result_line_terminator = klass(pd.Series([1], ind, name="data")).to_csv(
line_terminator="\n", header=True
)
assert result == expected
result_lineterminator = klass(pd.Series([1], ind, name="data")).to_csv(
lineterminator="\n", header=True
)
assert re.match(result_lineterminator, result_line_terminator)
assert re.match(result_line_terminator, expected)

def test_to_csv_string_array_ascii(self):
# GH 10813
Expand Down Expand Up @@ -436,6 +442,25 @@ def test_to_csv_string_with_crlf(self):
with open(path, "rb") as f:
assert f.read() == expected_crlf

def test_to_csv_string_line_terminator_alternative_args(self):
# GH 9568
# test equivalence of line_terminator vs. lineterminator keyword args

data = {"int": [1, 2, 3], "str_lf": ["abc", "d\nef", "g\nh\n\ni"]}
df = pd.DataFrame(data)

with tm.ensure_clean("crlf_test.csv") as path:
df.to_csv(path, line_terminator="\n", index=False)
with open(path, "rb") as f:
res_line_terminator = f.read()

with tm.ensure_clean("crlf_test.csv") as path:
df.to_csv(path, lineterminator="\n", index=False)
with open(path, "rb") as f:
res_lineterminator = f.read()

assert re.match(res_line_terminator, res_lineterminator)

def test_to_csv_stdout_file(self, capsys):
# GH 21561
df = pd.DataFrame(
Expand Down
2 changes: 0 additions & 2 deletions pandas/tests/io/parser/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -2080,8 +2080,6 @@ def test_unexpected_keyword_parameter_exception(all_parsers):
parser = all_parsers

msg = "{}\\(\\) got an unexpected keyword argument 'foo'"
with pytest.raises(TypeError, match=msg.format("read_csv")):
parser.read_csv("foo.csv", foo=1)
with pytest.raises(TypeError, match=msg.format("read_table")):
parser.read_table("foo.tsv", foo=1)

Expand Down