Skip to content

read_fwf docs #49832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
9 changes: 6 additions & 3 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1366,8 +1366,10 @@ a different usage of the ``delimiter`` parameter:
* ``widths``: A list of field widths which can be used instead of 'colspecs'
if the intervals are contiguous.
* ``delimiter``: Characters to consider as filler characters in the fixed-width file.
Can be used to specify the filler character of the fields
if it is not spaces (e.g., '~').
Default are space and tab characters.
Used to specify the character(s) to strip from start and end of every field.
To preserve whitespace, set to a character that does not exist in the data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like an anti pattern, you should not use the function at all when you want to read the whole data as one column

Copy link
Author

@RonaldBarnes RonaldBarnes Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @phofl,

I think there's a misunderstanding.

This PR is merely documenting the current anti-patterns that exist in read_fwf:

  1. Input file contains 172 fields / columns, precisely defined by colspecs.
  2. File is read into DataFrame
  3. Data is mangled - white space is stripped
  4. To preserve white space, a delimiter field is also required, and its value must be something that will never appear at start or end of any field

Anti-patters observed:

  • In a fixed-width data file, data should not be changed unless explicitly requested
  • Fixed-width files do not have delimiters, rather colspecs
  • read_csv will preserve white space

I think parts of read_fwf were designed to handle tabular, human readable data, not flat database files. For reading tabular data, read_table seems the appropriate tool, IMHO.

TL;DR This PR is attempting to accurately describe the current behaviour as #16772 shows people still confused by it and #16950 didn't address readers.py.

Also, thank you @jbrockmendel for labelling this as Docs! I could not figure out how to do that myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather fix this instead of documenting a workaround then

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear a fix is preferred. Working on that now.

Expecting controversy by breaking current default behaviour but will clearly document how to achieve current behaviour of stripping white space. Am inclined to also mention read_table as a potential solution for some users.

If anyone is using delimiter="~" as is mentioned as an example in the documentation, planning to continue to support such usage, but thinking to raise FutureWarning if delimiter keyword is used.

Is this reasonable / acceptable pandas policy?

Should I amend doc/source/whatsnew/v1.5.3.rst or doc/source/whatsnew/v2.0.0.rst?

Thank you for your help with all the issues with attempting a successful first PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.0, only regression fixes are backported

So to summarise: If a delimiter is passed and the character is present: What happens in this case? If a delimiter is passed and does not exist, all whitespaces are preserved, correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct on both counts.

  • If a delimiter is passed and the character is present, it is stripped from start & end of every field.
  • If a delimiter is passed (that is not a space char), then whitespaces are preserved.

Assigning default value(s) to delimiter:
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1168

Stripping delimiter(s) from each field (thus also removes \n\r from each line):
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1267

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fact that we strip the character from the end of the fields documented anywhere? If no, we can definitely deprecate. This sounds odd

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mentioned obliquely:

https://github.com/pandas-dev/pandas/blob/main/doc/source/user_guide/io.rst

The function parameters to read_fwf are largely the same as read_csv with two extra parameters, and a different usage of the delimiter parameter:
...
delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces (e.g., '~').

This is confusing, as for flat files any use of delimiters is unexpected since colspecs are defined (or, inferred - need to check the use of delimiters here).

In a flat file, there are no "filler character[s]", hence confusion.

Later, among the examples, is this:

The parser will take care of extra white spaces around the columns so it's ok to have extra separation between the columns in the file.

All examples are using tabular (human-readable) data.

Some conflation between read_table and read_fwf, IMHO.

See mentions at #16772, from 2017, and follow up questions still in 2022.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather fix this instead of documenting a workaround then

I think I've come up with a solution that

  • causes minimal disruption to users depending on existing behaviour
  • clearly documents existing behaviour
  • adds 2 options to give finer-grained control over the whitespace handling in read_fwf

A newer PR can be found at: #51569

i.e. "\0".

Consider a typical fixed-width data file:

Expand Down Expand Up @@ -1404,8 +1406,9 @@ column widths for contiguous columns:
df = pd.read_fwf("bar.csv", widths=widths, header=None)
df

The parser will take care of extra white spaces around the columns
The parser will take care of extra whitespace around the columns,
so it's ok to have extra separation between the columns in the file.
To preserve whitespace around the columns, see ``delimiter``.

By default, ``read_fwf`` will try to infer the file's ``colspecs`` by using the
first 100 rows of the file. It can do it only in cases when the columns are
Expand Down
14 changes: 12 additions & 2 deletions pandas/io/parsers/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,12 @@
"float_precision": None,
}

_fwf_defaults = {"colspecs": "infer", "infer_nrows": 100, "widths": None}
_fwf_defaults = {
"colspecs": "infer",
"infer_nrows": 100,
"widths": None,
"delimiter": " ", # space & [TAB]
}

_c_unsupported = {"skipfooter"}
_python_unsupported = {"low_memory", "float_precision"}
Expand Down Expand Up @@ -1236,6 +1241,7 @@ def read_fwf(
*,
colspecs: Sequence[tuple[int, int]] | str | None = "infer",
widths: Sequence[int] | None = None,
delimiter: str | None = " \t",
infer_nrows: int = 100,
**kwds,
) -> DataFrame | TextFileReader:
Expand All @@ -1256,7 +1262,7 @@ def read_fwf(
Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is
expected. A local file could be:
``file://localhost/path/to/table.csv``.
colspecs : list of tuple (int, int) or 'infer'. optional
colspecs : list of tuple (int, int) or 'infer', optional
A list of tuples giving the extents of the fixed-width
fields of each line as half-open intervals (i.e., [from, to[ ).
String value 'infer' can be used to instruct the parser to try
Expand All @@ -1265,6 +1271,10 @@ def read_fwf(
widths : list of int, optional
A list of field widths which can be used instead of 'colspecs' if
the intervals are contiguous.
delimiter : str, default " \t" (space and tab), optional
Character(s) to strip from start and end of each field. To
preserve whitespace, must be non-default value (i.e. delimiter="\0").
Used by `colspecs="infer"` to determine column boundaries.
infer_nrows : int, default 100
The number of rows to consider when letting the parser determine the
`colspecs`.
Expand Down