Auto-detect field widths in read_fwf when unspecified #4488

ghost · 2013-08-06T23:49:27Z

Here's a quick hack for generating the widths arg.
It's basic, but it works for me.

def detect_fwf_widths(rows,wsratio=0.5,nlines=100):
    from collections import Counter # 2.6, is this in pd.compat?

    # could do a p-value thing instead, but this works fine in practice
    nlines = min(len(rows),nlines) 

    # columwise, get occurence counts, use threshold to prune candidates
    # for "blank".
    counts = [{k:v for k,v in d.iteritems()  if v > nlines*wsratio} 
              for d in map(Counter,zip(*rows))]

    # among candidates, find the most common and choose it as "blank"
    # then extract only the cols where "blank" appears "frequently".
    c=Counter()
    for d in counts: 
        c.update(d)
    delim=c.most_common(1)[0][0]
    cols = [i for i,d in enumerate(counts) if delim in d]

    # filter the col indexes to keep only the rightmost index of
    # each contiguous sequence
    width_list =[]
    lidx=0
    prev = cols[0]
    for v in cols[1:]:
        if v-prev != 1:
            width_list.append(prev+1-lidx)
            lidx=prev+1
        prev = v
    width_list.append(len(rows[0])-lidx)

    return width_list

any takers to flesh this out into a PR with tests?

The text was updated successfully, but these errors were encountered:

alefnula · 2013-09-21T13:03:08Z

I could take this, if that's OK? I just need a few pointers:

Should this be used in the read_fwf function before passing the filepath_or_buffer to the _read or in the FixedWidthReader's __init__ function?
I assume that this should read the first nlines and then try to determine the widths before parsing. But what about the non-seekable buffers? If the buffer passed in is just a generator and doesn't support seeking back to the beginning? I assume that FixedWidthReader is a better place for this since it can buffer the lines it used for determining the widths and then just yield them in it's next.

jreback · 2013-09-21T13:09:18Z

your point 2 is correct, here is a fix for that (and maybe you should just do this in all cases)

don't start with the buffer/path itself, but wrap it in a StringIO like class that can handle this type of bufferring,
e.g. its essentially a pass-thru if you are not talking to a generator, otherwise its a seekable buffer

kind of like the Reader class you created in that generator question on SO; except it also would have a reset method (prob a more consistent name that is compat with StringIO though), that seeks to the beginning and buffers what it already read first to you before resuming getting from the generator

make sense?

alefnula · 2013-09-21T13:27:26Z

@jreback Yes. I would just avoid buffering the whole data, just the last nlines max. Or even completely skip it if no detection is needed.

Also, is adding nlines=100 parameter to the read_fwf OK?

jreback · 2013-09-21T13:31:02Z

is that for the 'checking' part? if so yes,....but I would think you only need even a few lines (e.g. no header, and use didn't use skiprows, or if they did you skipped them), and I think you have to assume that the format is the same, yes? it IS fixed width (but then I guess you have to 'validate' it)...maybe raise if you find discrepancies?

alefnula · 2013-09-21T13:43:29Z

Yes just for inferring the width of columns. There are a lot of things that should be taken into account, as you mentioned (lineterminator, skiprows, comment and nrows if I'm correct). I'll try to cover all of them in test cases.

jreback · 2013-09-21T16:28:57Z

@alefnula I realize you are right, you prob do need more lines in order to 'figure out' where the proper terminations are....because the fields can be ragged within the 'unknown' widths

alefnula · 2013-09-22T13:32:47Z

Think I found a simpler solution based on the minimal starting point of the column: detect_fwf_widths. I just have to work out the edge cases and implement the buffering.

jreback · 2013-09-22T13:34:27Z

@alefnula gr8....prob will need a number of tests for this....what API are you proposing for this?

maybe widths='detect'? (and make it the default?)

alefnula · 2013-09-22T13:44:06Z

@jreback I added a detect_from_rows parameter.

detect_from_rows : number or list of rows that will be inspected in
    order to infer the column widths, if neither the 'colspecs' nor
    'widths' are specified (default=100).

I wanted to enable user to say either:

Detect from first 20 rows: detect_from_rows=20
Detect from rows 0, 4, 7 and 8: detect_from_rows=[0, 4, 7, 8]

If this just complicates things I can use just a simple widths='detect' and default to 100 rows. That would be even simpler to implement.

alefnula · 2013-09-22T13:50:05Z

Also do you have an idea why did I get this error from Travis for every python interpreter except 2.7?!

IOError: [Errno 2] No such file or directory:
'/home/travis/virtualenv/python2.6_with_system_site_packages/lib/python2.6/site-packages/pandas-0.0.0-py2.6-linux-x86_64.egg/pandas/io/tests/data/test1.fwf'

The file is in the repository.

jreback · 2013-09-22T13:55:42Z

You need to edit setup.py to make it copy that extension to the correct place

e.g. tests/data/*.fwf

# The build cache system does string matching below this point.
# if you change something, be careful.

setup(name=DISTNAME,
      version=FULLVERSION,
      maintainer=AUTHOR,
      packages=['pandas',
                'pandas.compat',
                'pandas.computation',
                'pandas.computation.tests',
                'pandas.core',
                'pandas.io',
                'pandas.rpy',
                'pandas.sandbox',
                'pandas.sparse',
                'pandas.sparse.tests',
                'pandas.stats',
                'pandas.util',
                'pandas.tests',
                'pandas.tools',
                'pandas.tools.tests',
                'pandas.tseries',
                'pandas.tseries.tests',
                'pandas.io.tests',
                'pandas.io.tests.test_json',
                'pandas.stats.tests',
                ],
      package_data={'pandas.io': ['tests/data/legacy_hdf/*.h5',
                                  'tests/data/legacy_pickle/0.10.1/*.pickle',
                                  'tests/data/legacy_pickle/0.11.0/*.pickle',
                                  'tests/data/legacy_pickle/0.12.0/*.pickle',
                                  'tests/data/*.csv',
                                  'tests/data/*.dta',
                                  'tests/data/*.txt',
                                  'tests/data/*.xls',
                                  'tests/data/*.xlsx',
                                  'tests/data/*.table',
                                  'tests/data/*.html',
                                  'tests/test_json/data/*.json'],
                    'pandas.tools': ['tests/*.csv'],
                    'pandas.tests': ['data/*.pickle',
                                     'data/*.csv'],
                    'pandas.tseries.tests': ['data/*.pickle',
                                             'data/*.csv']
                    },

alefnula · 2013-09-23T21:47:00Z

Ok i think I found an optimal algorithm that uses a bitmask to detect the gaps between columns. detect_fwf_widths.

Implemented a simple buffering (I think it couldn't be simpler, but if you want me to change it I'll do whatever you want) get_rows and next. And also the tests are here. Some of the data set's are pretty messy but it can detect the columns correctly: example.

If you have more ideas about test files please post them. Also if the naming conventions are OK I'll change the documentation. If they are not, just tell me what to change and I'll do it.

jtratner · 2013-09-23T21:51:22Z

Just to check: would it still work if you had 1000 columns? 10K? or even
just a few columns, but with very long names.

jtratner · 2013-09-23T21:54:02Z

@alefnula, can you open a PR to discuss? Easier to see diffs there. (and you can always change it up by pushing / editing commits.

jreback mentioned this issue Sep 21, 2013

ignore newline while reading files? #4917

Closed

alefnula mentioned this issue Sep 23, 2013

ENH: Added colspecs detection to read_fwf #4955

Merged

jreback closed this as completed in #4955 Sep 30, 2013

ghost mentioned this issue Sep 30, 2013

read_fwf colspecs autodetection could be more flexible #5056

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-detect field widths in read_fwf when unspecified #4488

Auto-detect field widths in read_fwf when unspecified #4488

ghost commented Aug 6, 2013

alefnula commented Sep 21, 2013

jreback commented Sep 21, 2013

alefnula commented Sep 21, 2013

jreback commented Sep 21, 2013

alefnula commented Sep 21, 2013

jreback commented Sep 21, 2013

alefnula commented Sep 22, 2013

jreback commented Sep 22, 2013

alefnula commented Sep 22, 2013

alefnula commented Sep 22, 2013

jreback commented Sep 22, 2013

alefnula commented Sep 23, 2013

jtratner commented Sep 23, 2013

jtratner commented Sep 23, 2013

Auto-detect field widths in read_fwf when unspecified #4488

Auto-detect field widths in read_fwf when unspecified #4488

Comments

ghost commented Aug 6, 2013

alefnula commented Sep 21, 2013

jreback commented Sep 21, 2013

alefnula commented Sep 21, 2013

jreback commented Sep 21, 2013

alefnula commented Sep 21, 2013

jreback commented Sep 21, 2013

alefnula commented Sep 22, 2013

jreback commented Sep 22, 2013

alefnula commented Sep 22, 2013

alefnula commented Sep 22, 2013

jreback commented Sep 22, 2013

alefnula commented Sep 23, 2013

jtratner commented Sep 23, 2013

jtratner commented Sep 23, 2013