Skip to content

read_fwf() doesn't work properly when both skiprows and iterator options are used. #10261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arenius opened this issue Jun 3, 2015 · 4 comments · Fixed by #44621
Closed

read_fwf() doesn't work properly when both skiprows and iterator options are used. #10261

arenius opened this issue Jun 3, 2015 · 4 comments · Fixed by #44621
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@arenius
Copy link

arenius commented Jun 3, 2015

When read_fwf is used with iterator = True and skiprows = [list] arguments it doesn't properly skip all the rows in the skiprows list. Things work properly when either of those arguments is used in isolation.

Here is a simple bit of code to reproduce:

import pandas as pd

#Create a fixed width file to test with.
df = pd.DataFrame({'a': range(10)})
with open('testfwf.txt', 'w') as f:
    f.write(df.to_string(index = False, header = False))

rows_to_skip = [0,1,2,6,9]

df_iter = pd.read_fwf('testfwf.txt', colspecs = [(0,2)], names = ['a'], iterator = True,
                      chunksize = 2, skiprows = rows_to_skip)

print('The fixed width file in chunks with rows [0,1,2,6,9] skipped: ')
for df in df_iter:
    print(df)

print('Notice how row 6 of the fixed width file has not been skipped even though it should')
print('have been.')

It seems that all rows are skipped until there are rows that aren't skipped. For example, the leading rows 0,1,2 are skipped. But since there are then rows that aren't skipped the skipping stops for all rows until then end, when row 9 IS skipped.

@arenius
Copy link
Author

arenius commented Jun 3, 2015

Oh, I forgot to add my version information:

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.16.1
nose: 1.3.4
Cython: 0.22
numpy: 1.8.0
scipy: 0.14.0
statsmodels: None
IPython: 3.1.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Jun 3, 2015

pls show the input file, or even better a self-contained

data = """ 
.......
""""
pd.read_fwf(StringIO(data), .....)

@jreback jreback added the IO CSV read_csv, to_csv label Jun 3, 2015
@arenius
Copy link
Author

arenius commented Jun 3, 2015

I create an input file to be read at the very start of the code block.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2015

sorry, see that now

In [50]: pd.read_fwf('testfwf.txt', colspecs = [(0,2)], names = ['a'],skiprows=rows_to_skip)
Out[50]: 
   a
0  3
1  4
2  5
3  7
4  8

This looks correct to me. I suppose its a problem with the iterator. I am pretty sure that skiprows should not be allowed with an iterator in general. The rows get renumbered each time the iterator runs. So its pretty useless. I suppose it could be fixed, but would be some effort. You are welcome to look at this in detail.

@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Jun 3, 2015
@jreback jreback added this to the 1.4 milestone Nov 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants