read_fwf() doesn't work properly when both skiprows and iterator options are used. #10261

arenius · 2015-06-03T17:21:38Z

When read_fwf is used with iterator = True and skiprows = [list] arguments it doesn't properly skip all the rows in the skiprows list. Things work properly when either of those arguments is used in isolation.

Here is a simple bit of code to reproduce:

import pandas as pd

#Create a fixed width file to test with.
df = pd.DataFrame({'a': range(10)})
with open('testfwf.txt', 'w') as f:
    f.write(df.to_string(index = False, header = False))

rows_to_skip = [0,1,2,6,9]

df_iter = pd.read_fwf('testfwf.txt', colspecs = [(0,2)], names = ['a'], iterator = True,
                      chunksize = 2, skiprows = rows_to_skip)

print('The fixed width file in chunks with rows [0,1,2,6,9] skipped: ')
for df in df_iter:
    print(df)

print('Notice how row 6 of the fixed width file has not been skipped even though it should')
print('have been.')

It seems that all rows are skipped until there are rows that aren't skipped. For example, the leading rows 0,1,2 are skipped. But since there are then rows that aren't skipped the skipping stops for all rows until then end, when row 9 IS skipped.

arenius · 2015-06-03T17:23:12Z

Oh, I forgot to add my version information:

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.16.1
nose: 1.3.4
Cython: 0.22
numpy: 1.8.0
scipy: 0.14.0
statsmodels: None
IPython: 3.1.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

jreback · 2015-06-03T19:25:14Z

pls show the input file, or even better a self-contained

data = """ 
.......
""""
pd.read_fwf(StringIO(data), .....)

arenius · 2015-06-03T20:39:11Z

I create an input file to be read at the very start of the code block.

jreback · 2015-06-03T22:45:39Z

sorry, see that now

In [50]: pd.read_fwf('testfwf.txt', colspecs = [(0,2)], names = ['a'],skiprows=rows_to_skip)
Out[50]: 
   a
0  3
1  4
2  5
3  7
4  8

This looks correct to me. I suppose its a problem with the iterator. I am pretty sure that skiprows should not be allowed with an iterator in general. The rows get renumbered each time the iterator runs. So its pretty useless. I suppose it could be fixed, but would be some effort. You are welcome to look at this in detail.

jreback added the IO CSV read_csv, to_csv label Jun 3, 2015

jreback added the Error Reporting Incorrect or improved errors from pandas label Jun 3, 2015

mroeschke added the Enhancement label Apr 18, 2021

phofl mentioned this issue Nov 25, 2021

BUG: read_fwf not handling skiprows correctly with iterator #44621

Merged

4 tasks

jreback added this to the 1.4 milestone Nov 26, 2021

jreback closed this as completed in #44621 Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_fwf() doesn't work properly when both skiprows and iterator options are used. #10261

read_fwf() doesn't work properly when both skiprows and iterator options are used. #10261

arenius commented Jun 3, 2015

arenius commented Jun 3, 2015

jreback commented Jun 3, 2015

arenius commented Jun 3, 2015

jreback commented Jun 3, 2015

read_fwf() doesn't work properly when both skiprows and iterator options are used. #10261

read_fwf() doesn't work properly when both skiprows and iterator options are used. #10261

Comments

arenius commented Jun 3, 2015

arenius commented Jun 3, 2015

INSTALLED VERSIONS

jreback commented Jun 3, 2015

arenius commented Jun 3, 2015

jreback commented Jun 3, 2015