Skip to content

Commit c45e769

Browse files
committed
Merge pull request #3357 from jreback/hdf_fix
ENH: HDFStore now auto creates data_columns if they are specified in min_itemsize
2 parents 6e86975 + d2b2d13 commit c45e769

File tree

6 files changed

+107
-58
lines changed

6 files changed

+107
-58
lines changed

RELEASE.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,7 @@ pandas 0.11.0
178178

179179
- added the method ``select_column`` to select a single column from a table as a Series.
180180
- deprecated the ``unique`` method, can be replicated by ``select_column(key,column).unique()``
181+
- ``min_itemsize`` parameter will now automatically create data_columns for passed keys
181182

182183
- Downcast on pivot if possible (GH3283_), adds argument ``downcast`` to ``fillna``
183184

doc/source/cookbook.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -282,6 +282,9 @@ The :ref:`HDFStores <io.hdf5>` docs
282282
`Troubleshoot HDFStore exceptions
283283
<http://stackoverflow.com/questions/15488809/how-to-trouble-shoot-hdfstore-exception-cannot-find-the-correct-atom-type>`__
284284

285+
`Setting min_itemsize with strings
286+
<http://stackoverflow.com/questions/15988871/hdfstore-appendstring-dataframe-fails-when-string-column-contents-are-longer>`__
287+
285288
Storing Attributes to a group node
286289

287290
.. ipython:: python

doc/source/io.rst

Lines changed: 29 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1391,7 +1391,7 @@ of rows in an object.
13911391
Multiple Table Queries
13921392
~~~~~~~~~~~~~~~~~~~~~~
13931393

1394-
New in 0.10.1 are the methods ``append_to_multple`` and
1394+
New in 0.10.1 are the methods ``append_to_multiple`` and
13951395
``select_as_multiple``, that can perform appending/selecting from
13961396
multiple tables at once. The idea is to have one table (call it the
13971397
selector table) that you index most/all of the columns, and perform your
@@ -1535,24 +1535,6 @@ Notes & Caveats
15351535
``tables``. The sizes of a string based indexing column
15361536
(e.g. *columns* or *minor_axis*) are determined as the maximum size
15371537
of the elements in that axis or by passing the parameter
1538-
``min_itemsize`` on the first table creation (``min_itemsize`` can
1539-
be an integer or a dict of column name to an integer). If
1540-
subsequent appends introduce elements in the indexing axis that are
1541-
larger than the supported indexer, an Exception will be raised
1542-
(otherwise you could have a silent truncation of these indexers,
1543-
leading to loss of information). Just to be clear, this fixed-width
1544-
restriction applies to **indexables** (the indexing columns) and
1545-
**string values** in a mixed_type table.
1546-
1547-
.. ipython:: python
1548-
1549-
store.append('wp_big_strings', wp, min_itemsize = { 'minor_axis' : 30 })
1550-
wp = wp.rename_axis(lambda x: x + '_big_strings', axis=2)
1551-
store.append('wp_big_strings', wp)
1552-
store.select('wp_big_strings')
1553-
1554-
# we have provided a minimum minor_axis indexable size
1555-
store.root.wp_big_strings.table
15561538

15571539
DataTypes
15581540
~~~~~~~~~
@@ -1589,6 +1571,34 @@ conversion may not be necessary in future versions of pandas)
15891571
df
15901572
df.dtypes
15911573
1574+
String Columns
1575+
~~~~~~~~~~~~~~
1576+
1577+
The underlying implementation of ``HDFStore`` uses a fixed column width (itemsize) for string columns. A string column itemsize is calculated as the maximum of the
1578+
length of data (for that column) that is passed to the ``HDFStore``, **in the first append**. Subsequent appends, may introduce a string for a column **larger** than the column can hold, an Exception will be raised (otherwise you could have a silent truncation of these columns, leading to loss of information). In the future we may relax this and allow a user-specified truncation to occur.
1579+
1580+
Pass ``min_itemsize`` on the first table creation to a-priori specifiy the minimum length of a particular string column. ``min_itemsize`` can be an integer, or a dict mapping a column name to an integer. You can pass ``values`` as a key to allow all *indexables* or *data_columns* to have this min_itemsize.
1581+
1582+
Starting in 0.11, passing a ``min_itemsize`` dict will cause all passed columns to be created as *data_columns* automatically.
1583+
1584+
.. note::
1585+
1586+
If you are not passing any *data_columns*, then the min_itemsize will be the maximum of the length of any string passed
1587+
1588+
.. ipython:: python
1589+
1590+
dfs = DataFrame(dict(A = 'foo', B = 'bar'),index=range(5))
1591+
dfs
1592+
1593+
# A and B have a size of 30
1594+
store.append('dfs', dfs, min_itemsize = 30)
1595+
store.get_storer('dfs').table
1596+
1597+
# A is created as a data_column with a size of 30
1598+
# B is size is calculated
1599+
store.append('dfs2', dfs, min_itemsize = { 'A' : 30 })
1600+
store.get_storer('dfs2').table
1601+
15921602
External Compatibility
15931603
~~~~~~~~~~~~~~~~~~~~~~
15941604

doc/source/v0.11.0.txt

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -229,9 +229,11 @@ API changes
229229
- Added to_series() method to indicies, to facilitate the creation of indexers
230230
(GH3275_)
231231

232-
- In ``HDFStore``, added the method ``select_column`` to select a single column from a table as a Series.
232+
- ``HDFStore``
233233

234-
- In ``HDFStore``, deprecated the ``unique`` method, can be replicated by ``select_column(key,column).unique()``
234+
- added the method ``select_column`` to select a single column from a table as a Series.
235+
- deprecated the ``unique`` method, can be replicated by ``select_column(key,column).unique()``
236+
- ``min_itemsize`` parameter to ``append`` will now automatically create data_columns for passed keys
235237

236238
Enhancements
237239
~~~~~~~~~~~~
@@ -244,25 +246,26 @@ Enhancements
244246
- Bottleneck is now a :ref:`Recommended Dependencies <install.recommended_dependencies>`, to accelerate certain
245247
types of ``nan`` operations
246248

247-
- For ``HDFStore``, support ``read_hdf/to_hdf`` API similar to ``read_csv/to_csv``
249+
- ``HDFStore``
248250

249-
.. ipython:: python
251+
- support ``read_hdf/to_hdf`` API similar to ``read_csv/to_csv``
250252

251-
df = DataFrame(dict(A=range(5), B=range(5)))
252-
df.to_hdf('store.h5','table',append=True)
253-
read_hdf('store.h5', 'table', where = ['index>2'])
253+
.. ipython:: python
254254

255-
.. ipython:: python
256-
:suppress:
257-
:okexcept:
255+
df = DataFrame(dict(A=range(5), B=range(5)))
256+
df.to_hdf('store.h5','table',append=True)
257+
read_hdf('store.h5', 'table', where = ['index>2'])
258+
259+
.. ipython:: python
260+
:suppress:
261+
:okexcept:
258262

259-
os.remove('store.h5')
263+
os.remove('store.h5')
260264

261-
- In ``HDFStore``, provide dotted attribute access to ``get`` from stores
262-
(e.g. ``store.df == store['df']``)
265+
- provide dotted attribute access to ``get`` from stores, e.g. ``store.df == store['df']``
263266

264-
- In ``HDFStore``, new keywords ``iterator=boolean``, and ``chunksize=number_in_a_chunk`` are
265-
provided to support iteration on ``select`` and ``select_as_multiple`` (GH3076_)
267+
- new keywords ``iterator=boolean``, and ``chunksize=number_in_a_chunk`` are
268+
provided to support iteration on ``select`` and ``select_as_multiple`` (GH3076_)
266269

267270
- You can now select timestamps from an *unordered* timeseries similarly to an *ordered* timeseries (GH2437_)
268271

pandas/io/pytables.py

Lines changed: 32 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2181,7 +2181,7 @@ def validate_min_itemsize(self, min_itemsize):
21812181
if k == 'values':
21822182
continue
21832183
if k not in q:
2184-
raise ValueError("min_itemsize has [%s] which is not an axis or data_column" % k)
2184+
raise ValueError("min_itemsize has the key [%s] which is not an axis or data_column" % k)
21852185

21862186
@property
21872187
def indexables(self):
@@ -2293,6 +2293,30 @@ def get_object(self, obj):
22932293
""" return the data for this obj """
22942294
return obj
22952295

2296+
def validate_data_columns(self, data_columns, min_itemsize):
2297+
""" take the input data_columns and min_itemize and create a data_columns spec """
2298+
2299+
if not len(self.non_index_axes):
2300+
return []
2301+
2302+
axis_labels = self.non_index_axes[0][1]
2303+
2304+
# evaluate the passed data_columns, True == use all columns
2305+
# take only valide axis labels
2306+
if data_columns is True:
2307+
data_columns = axis_labels
2308+
elif data_columns is None:
2309+
data_columns = []
2310+
2311+
# if min_itemsize is a dict, add the keys (exclude 'values')
2312+
if isinstance(min_itemsize,dict):
2313+
2314+
existing_data_columns = set(data_columns)
2315+
data_columns.extend([ k for k in min_itemsize.keys() if k != 'values' and k not in existing_data_columns ])
2316+
2317+
# return valid columns in the order of our axis
2318+
return [c for c in data_columns if c in axis_labels]
2319+
22962320
def create_axes(self, axes, obj, validate=True, nan_rep=None, data_columns=None, min_itemsize=None, **kwargs):
22972321
""" create and return the axes
22982322
leagcy tables create an indexable column, indexable index, non-indexable fields
@@ -2380,26 +2404,18 @@ def create_axes(self, axes, obj, validate=True, nan_rep=None, data_columns=None,
23802404
for a in self.non_index_axes:
23812405
obj = obj.reindex_axis(a[1], axis=a[0], copy=False)
23822406

2383-
# get out blocks
2407+
# figure out data_columns and get out blocks
23842408
block_obj = self.get_object(obj)
2385-
blocks = None
2386-
2387-
if data_columns is not None and len(self.non_index_axes):
2388-
axis = self.non_index_axes[0][0]
2389-
axis_labels = self.non_index_axes[0][1]
2390-
if data_columns is True:
2391-
data_columns = axis_labels
2392-
2393-
data_columns = [c for c in data_columns if c in axis_labels]
2409+
blocks = block_obj._data.blocks
2410+
if len(self.non_index_axes):
2411+
axis, axis_labels = self.non_index_axes[0]
2412+
data_columns = self.validate_data_columns(data_columns, min_itemsize)
23942413
if len(data_columns):
23952414
blocks = block_obj.reindex_axis(Index(axis_labels) - Index(
2396-
data_columns), axis=axis, copy=False)._data.blocks
2415+
data_columns), axis=axis, copy=False)._data.blocks
23972416
for c in data_columns:
23982417
blocks.extend(block_obj.reindex_axis(
2399-
[c], axis=axis, copy=False)._data.blocks)
2400-
2401-
if blocks is None:
2402-
blocks = block_obj._data.blocks
2418+
[c], axis=axis, copy=False)._data.blocks)
24032419

24042420
# add my values
24052421
self.values_axes = []

pandas/io/tests/test_pytables.py

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -694,25 +694,41 @@ def check_col(key,name,size):
694694

695695
with ensure_clean(self.path) as store:
696696

697-
# infer the .typ on subsequent appends
697+
def check_col(key,name,size):
698+
self.assert_(getattr(store.get_storer(key).table.description,name).itemsize == size)
699+
698700
df = DataFrame(dict(A = 'foo', B = 'bar'),index=range(10))
701+
702+
# a min_itemsize that creates a data_column
703+
store.remove('df')
704+
store.append('df', df, min_itemsize={'A' : 200 })
705+
check_col('df', 'A', 200)
706+
self.assert_(store.get_storer('df').data_columns == ['A'])
707+
708+
# a min_itemsize that creates a data_column2
709+
store.remove('df')
710+
store.append('df', df, data_columns = ['B'], min_itemsize={'A' : 200 })
711+
check_col('df', 'A', 200)
712+
self.assert_(store.get_storer('df').data_columns == ['B','A'])
713+
714+
# a min_itemsize that creates a data_column2
715+
store.remove('df')
716+
store.append('df', df, data_columns = ['B'], min_itemsize={'values' : 200 })
717+
check_col('df', 'B', 200)
718+
check_col('df', 'values_block_0', 200)
719+
self.assert_(store.get_storer('df').data_columns == ['B'])
720+
721+
# infer the .typ on subsequent appends
699722
store.remove('df')
700723
store.append('df', df[:5], min_itemsize=200)
701724
store.append('df', df[5:], min_itemsize=200)
702725
tm.assert_frame_equal(store['df'], df)
703726

704727
# invalid min_itemsize keys
705-
706728
df = DataFrame(['foo','foo','foo','barh','barh','barh'],columns=['A'])
707-
708729
store.remove('df')
709730
self.assertRaises(ValueError, store.append, 'df', df, min_itemsize={'foo' : 20, 'foobar' : 20})
710731

711-
# invalid sizes
712-
store.remove('df')
713-
store.append('df', df[:3], min_itemsize=3)
714-
self.assertRaises(ValueError, store.append, 'df', df[3:])
715-
716732
def test_append_with_data_columns(self):
717733

718734
with ensure_clean(self.path) as store:

0 commit comments

Comments
 (0)