Skip to content

read_fwf does not take into account skiprows when inferring column width #11256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xchardon opened this issue Oct 7, 2015 · 3 comments
Closed
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@xchardon
Copy link

xchardon commented Oct 7, 2015

read_fwf can automatically detect column widths, but that does not seem to take into account rows to be skipped (parameter skiprows).
So, if the file contains text before the data, column detection fails. It could be solved by using only the actual data lines (and header) to infer column widths.
I worked around that issue by opening the text file, calling readline as many times as necessary (= skiprows), and then calling read_fwf on that buffer (with skiprows=0)

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

can you post a copy-pastable example which repros this?

@jreback jreback added the IO CSV read_csv, to_csv label Oct 7, 2015
@xchardon
Copy link
Author

xchardon commented Oct 7, 2015

Here is a minimal example:

import pandas as pd

with open('testfwf.txt', 'w') as f:
    f.write("Text contained in the file header\n")
    f.write("\n")
    f.write("DataCol1   DataCol2\n")
    f.write("     0.0        1.0\n")
    f.write("   101.6      956.1\n")

skiprows = 2

print("Using the read_fwf function simply:")
df = pd.read_fwf('testfwf.txt', skiprows = skiprows)
print(list(df))

print("Manually skipping the text lines before using read_fwf:")
with open('testfwf.txt', 'r') as f:
    for i in range(skiprows):
        f.readline()
    df = pd.read_fwf(f, encoding="utf8")
    print(list(df))

It gives the following result:

Using the read_fwf function simply:
['DataCol1   DataCol2', 'Unnamed: 1', 'Unnamed: 2']
Manually skipping the text lines before using read_fwf:
['DataCol1', 'DataCol2']

I'm using Python 3.4.3 and pandas 0.16.2

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

ok, looks like a bug. pull-requests are welcome to fix!

@jreback jreback added the Bug label Oct 7, 2015
@jreback jreback added this to the Next Major Release milestone Oct 7, 2015
@jreback jreback modified the milestones: 0.20.0, Next Major Release Jan 10, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
Fix the fact that we don't skip the rows when inferring colspecs by
passing skiprows down the chain until it's needed.  - [X] closes
pandas-dev#11256 - [X] 3 tests added / passed - [X] passes `git diff
upstream/master | flake8 --diff` - [X] whatsnew entry

Author: D.S. McNeil <[email protected]>

Closes pandas-dev#14028 from dsm054/bugfix/fwf_skiprows and squashes the following commits:

b5b3e66 [D.S. McNeil] BUG: read_fwf inference should respect skiprows (pandas-dev#11256)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants