read_fwf 'infer' where first hundred lines differ from other lines #15138

adamboche · 2017-01-16T01:30:38Z

Code Sample, a copy-pastable example if possible

  1     1   -13.120080   0.229   0.484  -0.378  -0.872
  1     2    -1.902843  -0.090   0.256   1.791   0.967
  1     3   -22.050698  -0.176  -0.394   0.922  -0.454
  1     4   -30.349928   0.081  -0.194  -0.327  -0.981
  1     5   -22.204160  -0.168  -0.197   0.984  -0.266
  1     6   -28.001753  -0.065   0.597  -0.203  -0.802
  1     7   -17.247524   0.108   0.194   0.474   0.774
  1     8   -28.014811   0.017   0.994   0.493   0.112
  1     9   -13.325491   0.259   0.189  -1.275   0.149
  1    10   -10.063621   0.327   0.108  -1.784   0.061
...
115    18     5.697000   0.391  -0.027   0.252   1.000
115    19     8.324000  -0.283   0.132   0.227  -0.216
115    20    48.451000   0.070  -0.041   0.379  -0.082
115    21     0.146000   0.677   0.031  -0.561  -0.149
115    22     1.443000  -0.706  -0.033  -0.222   0.035
115    23     4.595000   0.654  -0.081   0.774   0.997
115    24     0.146000  -0.677   0.031   0.561  -0.149
115    25     4.595000   0.654  -0.081   0.774   0.997
115    26     6.769000  -0.363  -0.093  -0.298   0.996
115    27    24.157000  -0.280  -0.324  -0.142  -0.946

Problem description

I have a long fixed-width file (>100k lines) that whose head and tail are shown above. I want to read this file with pandas. I figure pd.read_fwf is the way to do this. The issue comes up because it reads the first hundred lines, which start with ' 1' to say "lets start reading at [2]" whereas the last hundred lines start with 115, so it skips the initial 11 and starts the line with 5, so I lose data.

A couple of approaches to solving this issue come to mind, though I'm sure there are others:

Don't infer until all lines are scanned
Take as an argument the number of lines to be scanned before concluding the format, including the option to scan all (e.g. infer_from_all)
Take as an argument which direction to scan -- top to bottom or bottom to top

Output of `pd.show_versions()`

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-107-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-17T13:15:25Z

I think it would be reasonable to pass down a new parameter, maybe infer_nrows with a default of 100 to do this. Then you can control the amount of inference you want. It could accept infer_nrows='all' if you want to infer from all the rows I suppose.

PR's would be welcome!

keshavramaswamy · 2017-01-21T21:41:59Z

I am on it :)

yakovkeselman · 2017-03-20T19:59:09Z

Ran into the same issue. Had to sort the file to have larger numbers or top.

I'd suggest, at least for numeric columns, not to infer the left boundary for the first column but go with 0.

I'd rather see a very conservative approach where nothing gets stripped/truncated initially. Rather, we can trim strings later on, if needed.

Thanks!

jreback · 2017-03-20T20:03:12Z

@yakovkeselman contributions welcome! This is pretty straightforward to actually implement.

rdmontgomery · 2018-10-19T14:32:05Z

I just submitted a PR to address this issue. I left out the infer_nrows='all' option because I couldn't find an easy/efficient manner of knowing the number of rows in the file. I could have read in the whole file and calculated it, but I wasn't sure how that would react to partial reads or partial buffers. Any ideas on that would be welcome. Thanks!

jreback added Difficulty Intermediate Enhancement IO CSV read_csv, to_csv labels Jan 17, 2017

jreback added this to the 0.20.0 milestone Jan 17, 2017

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

rdmontgomery mentioned this issue Oct 19, 2018

ENH/TST/DOC: set infer_nrows for read_fwf (GH15138) #23238

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 11, 2018

WillAyd closed this as completed in #23238 Nov 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_fwf 'infer' where first hundred lines differ from other lines #15138

read_fwf 'infer' where first hundred lines differ from other lines #15138

adamboche commented Jan 16, 2017

jreback commented Jan 17, 2017 •

edited

Loading

keshavramaswamy commented Jan 21, 2017

yakovkeselman commented Mar 20, 2017

jreback commented Mar 20, 2017

rdmontgomery commented Oct 19, 2018

read_fwf 'infer' where first hundred lines differ from other lines #15138

read_fwf 'infer' where first hundred lines differ from other lines #15138

Comments

adamboche commented Jan 16, 2017

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

jreback commented Jan 17, 2017 • edited Loading

keshavramaswamy commented Jan 21, 2017

yakovkeselman commented Mar 20, 2017

jreback commented Mar 20, 2017

rdmontgomery commented Oct 19, 2018

Output of `pd.show_versions()`

jreback commented Jan 17, 2017 •

edited

Loading