-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_fwf docs #49832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
read_fwf docs #49832
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
2bfa90a
Updated documentation indicating default behaviour is to strip whites…
RonaldBarnes 1823753
Merge branch 'pandas-dev:main' into read_fwf_docs
RonaldBarnes f297d99
Fix failed Sphinx lint issue.
RonaldBarnes 49f24d5
Merge branch 'read_fwf_docs' of github.com:RonaldBarnes/pandas into r…
RonaldBarnes a0304a7
Added delimiter to _fwf_defaults.
RonaldBarnes 7adb89d
Changed comment from ## to # per flake8.
RonaldBarnes 62b8125
Merge branch 'pandas-dev:main' into read_fwf_docs
RonaldBarnes ab111c7
Delimiters used by colspecs='infer'
RonaldBarnes 9504520
Merge branch 'pandas-dev:main' into read_fwf_docs
RonaldBarnes File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like an anti pattern, you should not use the function at all when you want to read the whole data as one column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @phofl,
I think there's a misunderstanding.
This PR is merely documenting the current anti-patterns that exist in
read_fwf
:Anti-patters observed:
read_csv
will preserve white spaceI think parts of read_fwf were designed to handle tabular, human readable data, not flat database files. For reading tabular data,
read_table
seems the appropriate tool, IMHO.TL;DR This PR is attempting to accurately describe the current behaviour as #16772 shows people still confused by it and #16950 didn't address readers.py.
Also, thank you @jbrockmendel for labelling this as Docs! I could not figure out how to do that myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather fix this instead of documenting a workaround then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to hear a fix is preferred. Working on that now.
Expecting controversy by breaking current default behaviour but will clearly document how to achieve current behaviour of stripping white space. Am inclined to also mention
read_table
as a potential solution for some users.If anyone is using
delimiter="~"
as is mentioned as an example in the documentation, planning to continue to support such usage, but thinking to raiseFutureWarning
ifdelimiter
keyword is used.Is this reasonable / acceptable pandas policy?
Should I amend
doc/source/whatsnew/v1.5.3.rst
ordoc/source/whatsnew/v2.0.0.rst
?Thank you for your help with all the issues with attempting a successful first PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.0, only regression fixes are backported
So to summarise: If a delimiter is passed and the character is present: What happens in this case? If a delimiter is passed and does not exist, all whitespaces are preserved, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct on both counts.
Assigning default value(s) to delimiter:
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1168
Stripping delimiter(s) from each field (thus also removes
\n\r
from each line):https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1267
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fact that we strip the character from the end of the fields documented anywhere? If no, we can definitely deprecate. This sounds odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mentioned obliquely:
https://github.com/pandas-dev/pandas/blob/main/doc/source/user_guide/io.rst
This is confusing, as for flat files any use of delimiters is unexpected since colspecs are defined (or, inferred - need to check the use of delimiters here).
In a flat file, there are no "filler character[s]", hence confusion.
Later, among the examples, is this:
All examples are using tabular (human-readable) data.
Some conflation between
read_table
andread_fwf
, IMHO.See mentions at #16772, from 2017, and follow up questions still in 2022.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've come up with a solution that
read_fwf
A newer PR can be found at: #51569