Skip to content

read_fwf 'infer' where first hundred lines differ from other lines #15138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adamboche opened this issue Jan 16, 2017 · 5 comments · Fixed by #23238
Closed

read_fwf 'infer' where first hundred lines differ from other lines #15138

adamboche opened this issue Jan 16, 2017 · 5 comments · Fixed by #23238
Labels
Enhancement IO CSV read_csv, to_csv
Milestone

Comments

@adamboche
Copy link

Code Sample, a copy-pastable example if possible

  1     1   -13.120080   0.229   0.484  -0.378  -0.872
  1     2    -1.902843  -0.090   0.256   1.791   0.967
  1     3   -22.050698  -0.176  -0.394   0.922  -0.454
  1     4   -30.349928   0.081  -0.194  -0.327  -0.981
  1     5   -22.204160  -0.168  -0.197   0.984  -0.266
  1     6   -28.001753  -0.065   0.597  -0.203  -0.802
  1     7   -17.247524   0.108   0.194   0.474   0.774
  1     8   -28.014811   0.017   0.994   0.493   0.112
  1     9   -13.325491   0.259   0.189  -1.275   0.149
  1    10   -10.063621   0.327   0.108  -1.784   0.061
...
115    18     5.697000   0.391  -0.027   0.252   1.000
115    19     8.324000  -0.283   0.132   0.227  -0.216
115    20    48.451000   0.070  -0.041   0.379  -0.082
115    21     0.146000   0.677   0.031  -0.561  -0.149
115    22     1.443000  -0.706  -0.033  -0.222   0.035
115    23     4.595000   0.654  -0.081   0.774   0.997
115    24     0.146000  -0.677   0.031   0.561  -0.149
115    25     4.595000   0.654  -0.081   0.774   0.997
115    26     6.769000  -0.363  -0.093  -0.298   0.996
115    27    24.157000  -0.280  -0.324  -0.142  -0.946

Problem description

I have a long fixed-width file (>100k lines) that whose head and tail are shown above. I want to read this file with pandas. I figure pd.read_fwf is the way to do this. The issue comes up because it reads the first hundred lines, which start with ' 1' to say "lets start reading at [2]" whereas the last hundred lines start with 115, so it skips the initial 11 and starts the line with 5, so I lose data.

A couple of approaches to solving this issue come to mind, though I'm sure there are others:

  • Don't infer until all lines are scanned
  • Take as an argument the number of lines to be scanned before concluding the format, including the option to scan all (e.g. infer_from_all)
  • Take as an argument which direction to scan -- top to bottom or bottom to top

Output of pd.show_versions()

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-107-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

I think it would be reasonable to pass down a new parameter, maybe infer_nrows with a default of 100 to do this. Then you can control the amount of inference you want. It could accept infer_nrows='all' if you want to infer from all the rows I suppose.

PR's would be welcome!

@jreback jreback added this to the 0.20.0 milestone Jan 17, 2017
@keshavramaswamy
Copy link
Contributor

I am on it :)

@yakovkeselman
Copy link

Ran into the same issue. Had to sort the file to have larger numbers or top.

I'd suggest, at least for numeric columns, not to infer the left boundary for the first column but go with 0.

I'd rather see a very conservative approach where nothing gets stripped/truncated initially. Rather, we can trim strings later on, if needed.

Thanks!

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@yakovkeselman contributions welcome! This is pretty straightforward to actually implement.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@rdmontgomery
Copy link
Contributor

I just submitted a PR to address this issue. I left out the infer_nrows='all' option because I couldn't find an easy/efficient manner of knowing the number of rows in the file. I could have read in the whole file and calculated it, but I wasn't sure how that would react to partial reads or partial buffers. Any ideas on that would be welcome. Thanks!

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants