Skip to content

Commit 57f103a

Browse files
committed
Merge pull request #3949 from jreback/hdf_iterator
ENH: enable support for iterator with read_hdf in HDFStore (GH3937)
2 parents 95ca455 + 5758cc8 commit 57f103a

File tree

5 files changed

+140
-51
lines changed

5 files changed

+140
-51
lines changed

RELEASE.rst

+1
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ pandas 0.11.1
101101
to select with a Storer; these are invalid parameters at this time
102102
- can now specify an ``encoding`` option to ``append/put``
103103
to enable alternate encodings (GH3750_)
104+
- enable support for ``iterator/chunksize`` with ``read_hdf``
104105
- The repr() for (Multi)Index now obeys display.max_seq_items rather
105106
then numpy threshold print options. (GH3426_, GH3466_)
106107
- Added mangle_dupe_cols option to read_table/csv, allowing users

doc/source/io.rst

+12
Original file line numberDiff line numberDiff line change
@@ -1925,6 +1925,18 @@ The default is 50,000 rows returned in a chunk.
19251925
for df in store.select('df', chunksize=3):
19261926
print df
19271927
1928+
.. note::
1929+
1930+
.. versionadded:: 0.11.1
1931+
1932+
You can also use the iterator with ``read_hdf`` which will open, then
1933+
automatically close the store when finished iterating.
1934+
1935+
.. code-block:: python
1936+
1937+
for df in read_hdf('store.h5','df', chunsize=3):
1938+
print df
1939+
19281940
Note, that the chunksize keyword applies to the **returned** rows. So if you
19291941
are doing a query, then that set will be subdivided and returned in the
19301942
iterator. Keep in mind that if you do not pass a ``where`` selection criteria

doc/source/v0.11.1.txt

+58-34
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@ v0.11.1 (June ??, 2013)
66
This is a minor release from 0.11.0 and includes several new features and
77
enhancements along with a large number of bug fixes.
88

9+
Highlites include a consistent I/O API naming scheme, routines to read html,
10+
write multi-indexes to csv files, read & write STATA data files, read & write JSON format
11+
files, Python 3 support for ``HDFStore``, filtering of groupby expressions via ``filter``, and a
12+
revamped ``replace`` routine that accepts regular expressions.
13+
914
API changes
1015
~~~~~~~~~~~
1116

@@ -148,8 +153,8 @@ API changes
148153
``bs4`` + ``html5lib`` when lxml fails to parse. a list of parsers to try
149154
until success is also valid
150155

151-
Enhancements
152-
~~~~~~~~~~~~
156+
I/O Enhancements
157+
~~~~~~~~~~~~~~~~
153158

154159
- ``pd.read_html()`` can now parse HTML strings, files or urls and return
155160
DataFrames, courtesy of @cpcloud. (GH3477_, GH3605_, GH3606_, GH3616_).
@@ -184,28 +189,6 @@ Enhancements
184189
accessable via ``read_json`` top-level function for reading,
185190
and ``to_json`` DataFrame method for writing, :ref:`See the docs<io.json>`
186191

187-
- ``DataFrame.replace()`` now allows regular expressions on contained
188-
``Series`` with object dtype. See the examples section in the regular docs
189-
:ref:`Replacing via String Expression <missing_data.replace_expression>`
190-
191-
For example you can do
192-
193-
.. ipython :: python
194-
195-
df = DataFrame({'a': list('ab..'), 'b': [1, 2, 3, 4]})
196-
df.replace(regex=r'\s*\.\s*', value=np.nan)
197-
198-
to replace all occurrences of the string ``'.'`` with zero or more
199-
instances of surrounding whitespace with ``NaN``.
200-
201-
Regular string replacement still works as expected. For example, you can do
202-
203-
.. ipython :: python
204-
205-
df.replace('.', np.nan)
206-
207-
to replace all occurrences of the string ``'.'`` with ``NaN``.
208-
209192
- Multi-index column support for reading and writing csv format files
210193

211194
- The ``header`` option in ``read_csv`` now accepts a
@@ -225,19 +208,62 @@ Enhancements
225208
with ``df.to_csv(..., index=False``), then any ``names`` on the columns index will
226209
be *lost*.
227210

211+
.. ipython:: python
212+
213+
from pandas.util.testing import makeCustomDataframe as mkdf
214+
df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)
215+
df.to_csv('mi.csv',tupleize_cols=False)
216+
print open('mi.csv').read()
217+
pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)
218+
219+
.. ipython:: python
220+
:suppress:
221+
222+
import os
223+
os.remove('mi.csv')
224+
225+
- Support for ``HDFStore`` (via ``PyTables 3.0.0``) on Python3
226+
227+
- Iterator support via ``read_hdf`` that automatically opens and closes the
228+
store when iteration is finished. This is only for *tables*
229+
228230
.. ipython:: python
229231

230-
from pandas.util.testing import makeCustomDataframe as mkdf
231-
df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)
232-
df.to_csv('mi.csv',tupleize_cols=False)
233-
print open('mi.csv').read()
234-
pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False)
232+
path = 'store_iterator.h5'
233+
DataFrame(randn(10,2)).to_hdf(path,'df',table=True)
234+
for df in read_hdf(path,'df', chunksize=3):
235+
print df
235236

236237
.. ipython:: python
237-
:suppress:
238+
:suppress:
238239

239-
import os
240-
os.remove('mi.csv')
240+
import os
241+
os.remove(path)
242+
243+
Other Enhancements
244+
~~~~~~~~~~~~~~~~~~
245+
246+
- ``DataFrame.replace()`` now allows regular expressions on contained
247+
``Series`` with object dtype. See the examples section in the regular docs
248+
:ref:`Replacing via String Expression <missing_data.replace_expression>`
249+
250+
For example you can do
251+
252+
.. ipython :: python
253+
254+
df = DataFrame({'a': list('ab..'), 'b': [1, 2, 3, 4]})
255+
df.replace(regex=r'\s*\.\s*', value=np.nan)
256+
257+
to replace all occurrences of the string ``'.'`` with zero or more
258+
instances of surrounding whitespace with ``NaN``.
259+
260+
Regular string replacement still works as expected. For example, you can do
261+
262+
.. ipython :: python
263+
264+
df.replace('.', np.nan)
265+
266+
to replace all occurrences of the string ``'.'`` with ``NaN``.
241267

242268
- ``pd.melt()`` now accepts the optional parameters ``var_name`` and ``value_name``
243269
to specify custom column names of the returned DataFrame.
@@ -261,8 +287,6 @@ Enhancements
261287
pd.get_option('a.b')
262288
pd.get_option('b.c')
263289

264-
- Support for ``HDFStore`` (via ``PyTables 3.0.0``) on Python3
265-
266290
- The ``filter`` method for group objects returns a subset of the original
267291
object. Suppose we want to take only elements that belong to groups with a
268292
group sum greater than 2.

pandas/io/pytables.py

+46-16
Original file line numberDiff line numberDiff line change
@@ -196,12 +196,27 @@ def to_hdf(path_or_buf, key, value, mode=None, complevel=None, complib=None, app
196196

197197
def read_hdf(path_or_buf, key, **kwargs):
198198
""" read from the store, closeit if we opened it """
199-
f = lambda store: store.select(key, **kwargs)
199+
f = lambda store, auto_close: store.select(key, auto_close=auto_close, **kwargs)
200200

201201
if isinstance(path_or_buf, basestring):
202-
with get_store(path_or_buf) as store:
203-
return f(store)
204-
f(path_or_buf)
202+
203+
# can't auto open/close if we are using an iterator
204+
# so delegate to the iterator
205+
store = HDFStore(path_or_buf)
206+
try:
207+
return f(store, True)
208+
except:
209+
210+
# if there is an error, close the store
211+
try:
212+
store.close()
213+
except:
214+
pass
215+
216+
raise
217+
218+
# a passed store; user controls open/close
219+
f(path_or_buf, False)
205220

206221
class HDFStore(object):
207222
"""
@@ -405,7 +420,7 @@ def get(self, key):
405420
raise KeyError('No object named %s in the file' % key)
406421
return self._read_group(group)
407422

408-
def select(self, key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs):
423+
def select(self, key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False, **kwargs):
409424
"""
410425
Retrieve pandas object stored in file, optionally based on where
411426
criteria
@@ -419,6 +434,7 @@ def select(self, key, where=None, start=None, stop=None, columns=None, iterator=
419434
columns : a list of columns that if not None, will limit the return columns
420435
iterator : boolean, return an iterator, default False
421436
chunksize : nrows to include in iteration, return an iterator
437+
auto_close : boolean, should automatically close the store when finished, default is False
422438
423439
"""
424440
group = self.get_node(key)
@@ -434,9 +450,11 @@ def func(_start, _stop):
434450
return s.read(where=where, start=_start, stop=_stop, columns=columns, **kwargs)
435451

436452
if iterator or chunksize is not None:
437-
return TableIterator(func, nrows=s.nrows, start=start, stop=stop, chunksize=chunksize)
453+
if not s.is_table:
454+
raise TypeError("can only use an iterator or chunksize on a table")
455+
return TableIterator(self, func, nrows=s.nrows, start=start, stop=stop, chunksize=chunksize, auto_close=auto_close)
438456

439-
return TableIterator(func, nrows=s.nrows, start=start, stop=stop).get_values()
457+
return TableIterator(self, func, nrows=s.nrows, start=start, stop=stop, auto_close=auto_close).get_values()
440458

441459
def select_as_coordinates(self, key, where=None, start=None, stop=None, **kwargs):
442460
"""
@@ -473,7 +491,7 @@ def select_column(self, key, column, **kwargs):
473491
"""
474492
return self.get_storer(key).read_column(column = column, **kwargs)
475493

476-
def select_as_multiple(self, keys, where=None, selector=None, columns=None, start=None, stop=None, iterator=False, chunksize=None, **kwargs):
494+
def select_as_multiple(self, keys, where=None, selector=None, columns=None, start=None, stop=None, iterator=False, chunksize=None, auto_close=False, **kwargs):
477495
""" Retrieve pandas objects from multiple tables
478496
479497
Parameters
@@ -541,9 +559,9 @@ def func(_start, _stop):
541559
return concat(objs, axis=axis, verify_integrity=True)
542560

543561
if iterator or chunksize is not None:
544-
return TableIterator(func, nrows=nrows, start=start, stop=stop, chunksize=chunksize)
562+
return TableIterator(self, func, nrows=nrows, start=start, stop=stop, chunksize=chunksize, auto_close=auto_close)
545563

546-
return TableIterator(func, nrows=nrows, start=start, stop=stop).get_values()
564+
return TableIterator(self, func, nrows=nrows, start=start, stop=stop, auto_close=auto_close).get_values()
547565

548566

549567
def put(self, key, value, table=None, append=False, **kwargs):
@@ -916,16 +934,20 @@ class TableIterator(object):
916934
Parameters
917935
----------
918936
919-
func : the function to get results
937+
store : the reference store
938+
func : the function to get results
920939
nrows : the rows to iterate on
921940
start : the passed start value (default is None)
922-
stop : the passed stop value (default is None)
941+
stop : the passed stop value (default is None)
923942
chunksize : the passed chunking valeu (default is 50000)
943+
auto_close : boolean, automatically close the store at the end of iteration,
944+
default is False
924945
kwargs : the passed kwargs
925946
"""
926947

927-
def __init__(self, func, nrows, start=None, stop=None, chunksize=None):
928-
self.func = func
948+
def __init__(self, store, func, nrows, start=None, stop=None, chunksize=None, auto_close=False):
949+
self.store = store
950+
self.func = func
929951
self.nrows = nrows or 0
930952
self.start = start or 0
931953

@@ -937,6 +959,7 @@ def __init__(self, func, nrows, start=None, stop=None, chunksize=None):
937959
chunksize = 100000
938960

939961
self.chunksize = chunksize
962+
self.auto_close = auto_close
940963

941964
def __iter__(self):
942965
current = self.start
@@ -950,9 +973,16 @@ def __iter__(self):
950973

951974
yield v
952975

976+
self.close()
977+
978+
def close(self):
979+
if self.auto_close:
980+
self.store.close()
981+
953982
def get_values(self):
954-
return self.func(self.start, self.stop)
955-
983+
results = self.func(self.start, self.stop)
984+
self.close()
985+
return results
956986

957987
class IndexCol(object):
958988
""" an index column description class

pandas/io/tests/test_pytables.py

+23-1
Original file line numberDiff line numberDiff line change
@@ -2078,14 +2078,36 @@ def test_select_iterator(self):
20782078
results = []
20792079
for s in store.select('df',chunksize=100):
20802080
results.append(s)
2081+
self.assert_(len(results) == 5)
20812082
result = concat(results)
20822083
tm.assert_frame_equal(expected, result)
20832084

20842085
results = []
20852086
for s in store.select('df',chunksize=150):
20862087
results.append(s)
20872088
result = concat(results)
2088-
tm.assert_frame_equal(expected, result)
2089+
tm.assert_frame_equal(result, expected)
2090+
2091+
with tm.ensure_clean(self.path) as path:
2092+
2093+
df = tm.makeTimeDataFrame(500)
2094+
df.to_hdf(path,'df_non_table')
2095+
self.assertRaises(TypeError, read_hdf, path,'df_non_table',chunksize=100)
2096+
self.assertRaises(TypeError, read_hdf, path,'df_non_table',iterator=True)
2097+
2098+
with tm.ensure_clean(self.path) as path:
2099+
2100+
df = tm.makeTimeDataFrame(500)
2101+
df.to_hdf(path,'df',table=True)
2102+
2103+
results = []
2104+
for x in read_hdf(path,'df',chunksize=100):
2105+
results.append(x)
2106+
2107+
self.assert_(len(results) == 5)
2108+
result = concat(results)
2109+
tm.assert_frame_equal(result, df)
2110+
tm.assert_frame_equal(result, read_hdf(path,'df'))
20892111

20902112
# multiple
20912113

0 commit comments

Comments
 (0)