Skip to content

Auto-detect field widths in read_fwf when unspecified #4488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Aug 6, 2013 · 14 comments · Fixed by #4955
Closed

Auto-detect field widths in read_fwf when unspecified #4488

ghost opened this issue Aug 6, 2013 · 14 comments · Fixed by #4955
Labels
Enhancement IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@ghost
Copy link

ghost commented Aug 6, 2013

Here's a quick hack for generating the widths arg.
It's basic, but it works for me.

def detect_fwf_widths(rows,wsratio=0.5,nlines=100):
    from collections import Counter # 2.6, is this in pd.compat?

    # could do a p-value thing instead, but this works fine in practice
    nlines = min(len(rows),nlines) 

    # columwise, get occurence counts, use threshold to prune candidates
    # for "blank".
    counts = [{k:v for k,v in d.iteritems()  if v > nlines*wsratio} 
              for d in map(Counter,zip(*rows))]

    # among candidates, find the most common and choose it as "blank"
    # then extract only the cols where "blank" appears "frequently".
    c=Counter()
    for d in counts: 
        c.update(d)
    delim=c.most_common(1)[0][0]
    cols = [i for i,d in enumerate(counts) if delim in d]

    # filter the col indexes to keep only the rightmost index of
    # each contiguous sequence
    width_list =[]
    lidx=0
    prev = cols[0]
    for v in cols[1:]:
        if v-prev != 1:
            width_list.append(prev+1-lidx)
            lidx=prev+1
        prev = v
    width_list.append(len(rows[0])-lidx)

    return width_list

any takers to flesh this out into a PR with tests?

@alefnula
Copy link
Contributor

I could take this, if that's OK? I just need a few pointers:

  1. Should this be used in the read_fwf function before passing the filepath_or_buffer to the _read or in the FixedWidthReader's __init__ function?
  2. I assume that this should read the first nlines and then try to determine the widths before parsing. But what about the non-seekable buffers? If the buffer passed in is just a generator and doesn't support seeking back to the beginning? I assume that FixedWidthReader is a better place for this since it can buffer the lines it used for determining the widths and then just yield them in it's next.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

your point 2 is correct, here is a fix for that (and maybe you should just do this in all cases)

don't start with the buffer/path itself, but wrap it in a StringIO like class that can handle this type of bufferring,
e.g. its essentially a pass-thru if you are not talking to a generator, otherwise its a seekable buffer

kind of like the Reader class you created in that generator question on SO; except it also would have a reset method (prob a more consistent name that is compat with StringIO though), that seeks to the beginning and buffers what it already read first to you before resuming getting from the generator

make sense?

@alefnula
Copy link
Contributor

@jreback Yes. I would just avoid buffering the whole data, just the last nlines max. Or even completely skip it if no detection is needed.

Also, is adding nlines=100 parameter to the read_fwf OK?

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

is that for the 'checking' part? if so yes,....but I would think you only need even a few lines (e.g. no header, and use didn't use skiprows, or if they did you skipped them), and I think you have to assume that the format is the same, yes? it IS fixed width (but then I guess you have to 'validate' it)...maybe raise if you find discrepancies?

@alefnula
Copy link
Contributor

Yes just for inferring the width of columns. There are a lot of things that should be taken into account, as you mentioned (lineterminator, skiprows, comment and nrows if I'm correct). I'll try to cover all of them in test cases.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@alefnula I realize you are right, you prob do need more lines in order to 'figure out' where the proper terminations are....because the fields can be ragged within the 'unknown' widths

@alefnula
Copy link
Contributor

Think I found a simpler solution based on the minimal starting point of the column: detect_fwf_widths. I just have to work out the edge cases and implement the buffering.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@alefnula gr8....prob will need a number of tests for this....what API are you proposing for this?

maybe widths='detect'? (and make it the default?)

@alefnula
Copy link
Contributor

@jreback I added a detect_from_rows parameter.

detect_from_rows : number or list of rows that will be inspected in
    order to infer the column widths, if neither the 'colspecs' nor
    'widths' are specified (default=100).

I wanted to enable user to say either:

  1. Detect from first 20 rows: detect_from_rows=20
  2. Detect from rows 0, 4, 7 and 8: detect_from_rows=[0, 4, 7, 8]

If this just complicates things I can use just a simple widths='detect' and default to 100 rows. That would be even simpler to implement.

@alefnula
Copy link
Contributor

Also do you have an idea why did I get this error from Travis for every python interpreter except 2.7?!

IOError: [Errno 2] No such file or directory:
'/home/travis/virtualenv/python2.6_with_system_site_packages/lib/python2.6/site-packages/pandas-0.0.0-py2.6-linux-x86_64.egg/pandas/io/tests/data/test1.fwf'

The file is in the repository.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

You need to edit setup.py to make it copy that extension to the correct place

e.g. tests/data/*.fwf

# The build cache system does string matching below this point.
# if you change something, be careful.

setup(name=DISTNAME,
      version=FULLVERSION,
      maintainer=AUTHOR,
      packages=['pandas',
                'pandas.compat',
                'pandas.computation',
                'pandas.computation.tests',
                'pandas.core',
                'pandas.io',
                'pandas.rpy',
                'pandas.sandbox',
                'pandas.sparse',
                'pandas.sparse.tests',
                'pandas.stats',
                'pandas.util',
                'pandas.tests',
                'pandas.tools',
                'pandas.tools.tests',
                'pandas.tseries',
                'pandas.tseries.tests',
                'pandas.io.tests',
                'pandas.io.tests.test_json',
                'pandas.stats.tests',
                ],
      package_data={'pandas.io': ['tests/data/legacy_hdf/*.h5',
                                  'tests/data/legacy_pickle/0.10.1/*.pickle',
                                  'tests/data/legacy_pickle/0.11.0/*.pickle',
                                  'tests/data/legacy_pickle/0.12.0/*.pickle',
                                  'tests/data/*.csv',
                                  'tests/data/*.dta',
                                  'tests/data/*.txt',
                                  'tests/data/*.xls',
                                  'tests/data/*.xlsx',
                                  'tests/data/*.table',
                                  'tests/data/*.html',
                                  'tests/test_json/data/*.json'],
                    'pandas.tools': ['tests/*.csv'],
                    'pandas.tests': ['data/*.pickle',
                                     'data/*.csv'],
                    'pandas.tseries.tests': ['data/*.pickle',
                                             'data/*.csv']
                    },

@alefnula
Copy link
Contributor

Ok i think I found an optimal algorithm that uses a bitmask to detect the gaps between columns. detect_fwf_widths.

Implemented a simple buffering (I think it couldn't be simpler, but if you want me to change it I'll do whatever you want) get_rows and next. And also the tests are here. Some of the data set's are pretty messy but it can detect the columns correctly: example.

If you have more ideas about test files please post them. Also if the naming conventions are OK I'll change the documentation. If they are not, just tell me what to change and I'll do it.

@jtratner
Copy link
Contributor

Just to check: would it still work if you had 1000 columns? 10K? or even
just a few columns, but with very long names.

@jtratner
Copy link
Contributor

@alefnula, can you open a PR to discuss? Easier to see diffs there. (and you can always change it up by pushing / editing commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants