-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix delim_whitespace behavior in read_table, read_csv #36709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/parsers.py
Outdated
@@ -758,15 +761,17 @@ def read_table( | |||
float_precision=None, | |||
): | |||
# TODO: validation duplicated in read_csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you look for a way to de-duplicate the validation again? See Todo note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be impossible because read_table
and read_csv
have different default values ("\t" and ","). Since read_table
uses read_csv
internally, I have to pass "\t" explicitly. In that case I will explicitly provide the required sep
value, thus validation will not work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could for example implement a function _read_csv
called by read_csv
and read_table
where the default value is another parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to create a validation function instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved validation and actual call for file reading to a separate function.
2ff79db
to
b7153d6
Compare
b7153d6
to
6e0f3f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good, ideally if you can simplify the logic at all and still have everything pass :->
pls add a note in bug fixes for 1.2 in I/O section.
pandas/io/parsers.py
Outdated
if dialect is not None: | ||
sep_override = delimiter is None and sep == default_sep | ||
sep_override = (delimiter is None) and (sep is lib.no_default or sep == ",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary anymore because we can no easily determine if sep is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to give my reasoning behind this change. The problem is that this piece of code treats "," as a default value even if it provided explicitly. In this pull request, I try to make read_csv
and read_table
a distinction between the default value and explicitly provided parameter value (even if it's value is identical to the default one, i.e. ",", "\t"). This statement wasn't making that distinction in the first place and there was no reason to change that. That's why I kept old behavior by complicating this if-statement. In case you think it I should revert this change and make sep_override = False
when the function is being called like read_csv(sep=","...)
, I can do it.
Though I can't answer your question, I hope my explanation will help. Otherwise, could provide some more input?
cc @gfyoung if you can have a look. |
The changes look pretty reasonable, @jreback's comment notwithstanding. Test failures look unrelated AFAICT. |
I updated the pull request:
|
pandas/io/parsers.py
Outdated
@@ -3782,3 +3684,59 @@ def _make_reader(self, f): | |||
self.skiprows, | |||
self.infer_nrows, | |||
) | |||
|
|||
|
|||
def _check_defaults_read(read_kwds: dict, defaults: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we instead pass by keyword args (you can still pass the defaults as a dict i think), type is using Dict[str, Any] though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
pandas/io/parsers.py
Outdated
|
||
|
||
def _check_defaults_read(read_kwds: dict, defaults: dict): | ||
"""Check default values of input parameters, read a line file.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should return a new dict of the paramaters, it should actually do the reading; leave that to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
pandas/io/parsers.py
Outdated
|
||
def _check_defaults_read(read_kwds: dict, defaults: dict): | ||
"""Check default values of input parameters, read a line file.""" | ||
engine = read_kwds["engine"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a full-on doc-string here with the Parameters, Returns, and Raises section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
lgtm, ping on green. cc @gfyoung if any comments. |
The changes look good, but not sure why CI isn't all green... |
One CI test is failing because it exceeds a timeout of 60min (Windows_py38_np18) while running tests. It seems like there are no tests that actually crash, but one test results in Windows Access Violation in one of the threads, which seems raised by a dependency |
thanks @OlehKSS very nice. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff