Skip to content

Fix delim_whitespace behavior in read_table, read_csv #36709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 8, 2020

Conversation

OlehKSS
Copy link
Contributor

@OlehKSS OlehKSS commented Sep 28, 2020

@@ -758,15 +761,17 @@ def read_table(
float_precision=None,
):
# TODO: validation duplicated in read_csv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you look for a way to de-duplicate the validation again? See Todo note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be impossible because read_table and read_csv have different default values ("\t" and ","). Since read_table uses read_csv internally, I have to pass "\t" explicitly. In that case I will explicitly provide the required sep value, thus validation will not work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could for example implement a function _read_csv called by read_csv and read_table where the default value is another parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to create a validation function instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved validation and actual call for file reading to a separate function.

@pep8speaks
Copy link

pep8speaks commented Sep 29, 2020

Hello @OlehKSS! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-07 21:31:37 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good, ideally if you can simplify the logic at all and still have everything pass :->

pls add a note in bug fixes for 1.2 in I/O section.

if dialect is not None:
sep_override = delimiter is None and sep == default_sep
sep_override = (delimiter is None) and (sep is lib.no_default or sep == ",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary anymore because we can no easily determine if sep is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to give my reasoning behind this change. The problem is that this piece of code treats "," as a default value even if it provided explicitly. In this pull request, I try to make read_csv and read_table a distinction between the default value and explicitly provided parameter value (even if it's value is identical to the default one, i.e. ",", "\t"). This statement wasn't making that distinction in the first place and there was no reason to change that. That's why I kept old behavior by complicating this if-statement. In case you think it I should revert this change and make sep_override = False when the function is being called like read_csv(sep=","...), I can do it.

Though I can't answer your question, I hope my explanation will help. Otherwise, could provide some more input?

@jreback jreback requested a review from gfyoung September 30, 2020 02:22
@jreback jreback added API Design Bug IO CSV read_csv, to_csv labels Sep 30, 2020
@jreback jreback added this to the 1.2 milestone Sep 30, 2020
@jreback
Copy link
Contributor

jreback commented Sep 30, 2020

cc @gfyoung if you can have a look.

@gfyoung
Copy link
Member

gfyoung commented Sep 30, 2020

The changes look pretty reasonable, @jreback's comment notwithstanding.

Test failures look unrelated AFAICT.

@OlehKSS
Copy link
Contributor Author

OlehKSS commented Oct 2, 2020

I updated the pull request:

  • added a note to whatsnew
  • moved duplicated validation to a separate function

@@ -3782,3 +3684,59 @@ def _make_reader(self, f):
self.skiprows,
self.infer_nrows,
)


def _check_defaults_read(read_kwds: dict, defaults: dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead pass by keyword args (you can still pass the defaults as a dict i think), type is using Dict[str, Any] though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok



def _check_defaults_read(read_kwds: dict, defaults: dict):
"""Check default values of input parameters, read a line file."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should return a new dict of the paramaters, it should actually do the reading; leave that to the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


def _check_defaults_read(read_kwds: dict, defaults: dict):
"""Check default values of input parameters, read a line file."""
engine = read_kwds["engine"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a full-on doc-string here with the Parameters, Returns, and Raises section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@OlehKSS
Copy link
Contributor Author

OlehKSS commented Oct 3, 2020

@jreback @gfyoung @phofl
I implemented the requested changes in the latest commits. Could you take one more look?

@jreback
Copy link
Contributor

jreback commented Oct 7, 2020

lgtm, ping on green.

cc @gfyoung if any comments.

@gfyoung
Copy link
Member

gfyoung commented Oct 7, 2020

The changes look good, but not sure why CI isn't all green...

@OlehKSS
Copy link
Contributor Author

OlehKSS commented Oct 7, 2020

One CI test is failing because it exceeds a timeout of 60min (Windows_py38_np18) while running tests. It seems like there are no tests that actually crash, but one test results in Windows Access Violation in one of the threads, which seems raised by a dependency

@jreback jreback merged commit eb0c443 into pandas-dev:master Oct 8, 2020
@jreback
Copy link
Contributor

jreback commented Oct 8, 2020

thanks @OlehKSS very nice.

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Read_Table and Read_Csv does not raise when delim_whitespace=True and sep=default is given
5 participants