Skip to content

BUG: DataFrame.to_hdf doesn't pass along min_itemsize for index #10381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jun 17, 2015 · 4 comments
Closed

BUG: DataFrame.to_hdf doesn't pass along min_itemsize for index #10381

TomAugspurger opened this issue Jun 17, 2015 · 4 comments
Labels
Bug IO HDF5 read_hdf, HDFStore
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Unless I'm seeing something wrong

In [21]: df = DataFrame(dict(A = 'foo', B = 'bar'),index=range(5)).set_index("A")

In [22]: df.to_hdf('store.h5', 'test', format='table', min_itemsize={'index': 10})

In [23]: store = pd.HDFStore('store.h5')

In [24]: store.get_storer('test').table
Out[24]:
/test/table (Table(5,)) ''
  description := {
  "index": StringCol(itemsize=3, shape=(), dflt=b'', pos=0),
  "values_block_0": StringCol(itemsize=3, shape=(1,), dflt=b'', pos=1)}   # <---- I think this should be 10
  byteorder := 'irrelevant'
  chunkshape := (10922,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

and FYI this raises (not sure if it should work)

In [25]: df.index.name = 'theindex'

In [26]: df.to_hdf('store.h5', 'test2', format='table', min_itemsize={'theindex': 10})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

Just a report right now... no time.

@TomAugspurger TomAugspurger added Bug IO HDF5 read_hdf, HDFStore labels Jun 17, 2015
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jun 17, 2015
@TomAugspurger
Copy link
Contributor Author

FYI the solution here works. Something like

store.append('test', df, min_itemsize={'index': 30})

so it should just be a matter of passing along arguments.

@jreback
Copy link
Contributor

jreback commented Jun 17, 2015

I think maybe add to the docs a bit more here. This is like saying that the default for column A should be the same as B, which is not very explicit.

That said it should work for min_itemsize=30 (e.g. defaults all object columns)

@toobaz
Copy link
Member

toobaz commented Oct 22, 2015

Notice that if I do

ddf = pd.DataFrame([['a', 'b', 1],
                    ['a', 'b', 2]],
                    columns=['A', 'B', 'C']).set_index(['A', 'B'])

and then

ddf['C'].to_hdf('/tmp/store.hdf', 'test',
          format="table",
          min_itemsize={'index' : 3})

(as far as I understand, the suggested workaround), I still get the error.

@toobaz
Copy link
Member

toobaz commented Dec 6, 2016

Just for the records: the bug doesn't have to do with to_hdf() specifically, but rather with storing in table format without (explicitly) appending:

store.put(df, 'key', format='table', min_itemsize={'index' : 10})

will fail the same.

I'm pushing a PR in few seconds.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.2, Next Major Release Dec 11, 2016
jorisvandenbossche pushed a commit that referenced this issue Dec 15, 2016
closes #10381

Author: Pietro Battiston <[email protected]>

Closes #14812 from toobaz/to_hdf_min_itemsize and squashes the following commits:

c07f1e4 [Pietro Battiston] Whatsnew
38b8fcc [Pietro Battiston] Tests for previous commit
c838afa [Pietro Battiston] BUG: set min_itemsize even when there is no need to validate (#10381)

(cherry picked from commit e833096)
ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016
closes pandas-dev#10381

Author: Pietro Battiston <[email protected]>

Closes pandas-dev#14812 from toobaz/to_hdf_min_itemsize and squashes the following commits:

c07f1e4 [Pietro Battiston] Whatsnew
38b8fcc [Pietro Battiston] Tests for previous commit
c838afa [Pietro Battiston] BUG: set min_itemsize even when there is no need to validate (pandas-dev#10381)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment