Skip to content

BUG: HDFStore fixes #2675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 20, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions RELEASE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ pandas 0.10.1
- added method ``unique`` to select the unique values in an indexable or data column
- added method ``copy`` to copy an existing store (and possibly upgrade)
- show the shape of the data on disk for non-table stores when printing the store
- added ability to read PyTables flavor tables (allows compatiblity to other HDF5 systems)
- Add ``logx`` option to DataFrame/Series.plot (GH2327_, #2565)
- Support reading gzipped data from file-like object
- ``pivot_table`` aggfunc can be anything used in GroupBy.aggregate (GH2643_)
Expand All @@ -66,6 +67,8 @@ pandas 0.10.1
- handle correctly ``Term`` passed types (e.g. ``index<1000``, when index
is ``Int64``), (closes GH512_)
- handle Timestamp correctly in data_columns (closes GH2637_)
- contains correctly matches on non-natural names
- correctly store ``float32`` dtypes in tables (if not other float types in the same table)
- Fix DataFrame.info bug with UTF8-encoded columns. (GH2576_)
- Fix DatetimeIndex handling of FixedOffset tz (GH2604_)
- More robust detection of being in IPython session for wide DataFrame
Expand All @@ -86,6 +89,7 @@ pandas 0.10.1
- refactored HFDStore to deal with non-table stores as objects, will allow future enhancements
- removed keyword ``compression`` from ``put`` (replaced by keyword
``complib`` to be consistent across library)
- warn `PerformanceWarning` if you are attempting to store types that will be pickled by PyTables

.. _GH512: https://github.com/pydata/pandas/issues/512
.. _GH1277: https://github.com/pydata/pandas/issues/1277
Expand All @@ -98,6 +102,7 @@ pandas 0.10.1
.. _GH2625: https://github.com/pydata/pandas/issues/2625
.. _GH2643: https://github.com/pydata/pandas/issues/2643
.. _GH2637: https://github.com/pydata/pandas/issues/2637
.. _GH2694: https://github.com/pydata/pandas/issues/2694

pandas 0.10.0
=============
Expand Down
36 changes: 30 additions & 6 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1211,7 +1211,7 @@ You can create/modify an index for a table with ``create_table_index`` after dat

Query via Data Columns
~~~~~~~~~~~~~~~~~~~~~~
You can designate (and index) certain columns that you want to be able to perform queries (other than the `indexable` columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query.
You can designate (and index) certain columns that you want to be able to perform queries (other than the `indexable` columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. You can specify ``data_columns = True`` to force all columns to be data_columns

.. ipython:: python

Expand Down Expand Up @@ -1260,7 +1260,7 @@ To retrieve the *unique* values of an indexable or data column, use the method `

concat([ store.select('df_dc',c) for c in [ crit1, crit2 ] ])

**Table Object**
**Storer Object**

If you want to inspect the stored object, retrieve via ``get_storer``. You could use this progamatically to say get the number of rows in an object.

Expand Down Expand Up @@ -1363,17 +1363,40 @@ Notes & Caveats
# we have provided a minimum minor_axis indexable size
store.root.wp_big_strings.table

Compatibility
~~~~~~~~~~~~~
External Compatibility
~~~~~~~~~~~~~~~~~~~~~~

``HDFStore`` write storer objects in specific formats suitable for producing loss-less roundtrips to pandas objects. For external compatibility, ``HDFStore`` can read native ``PyTables`` format tables. It is possible to write an ``HDFStore`` object that can easily be imported into ``R`` using the ``rhdf5`` library. Create a table format store like this:

.. ipython:: python

store_export = HDFStore('export.h5')
store_export.append('df_dc',df_dc,data_columns=df_dc.columns)
store_export

.. ipython:: python
:suppress:

store_export.close()
import os
os.remove('export.h5')

Backwards Compatibility
~~~~~~~~~~~~~~~~~~~~~~~

0.10.1 of ``HDFStore`` is backwards compatible for reading tables created in a prior version of pandas however, query terms using the prior (undocumented) methodology are unsupported. ``HDFStore`` will issue a warning if you try to use a prior-version format file. You must read in the entire file and write it out using the new format, using the method ``copy`` to take advantage of the updates. The group attribute ``pandas_version`` contains the version information. ``copy`` takes a number of options, please see the docstring.


.. ipython:: python
:suppress:

import os
legacy_file_path = os.path.abspath('source/_static/legacy_0.10.h5')

.. ipython:: python

# a legacy store
import os
legacy_store = HDFStore('legacy_0.10.h5', 'r')
legacy_store = HDFStore(legacy_file_path,'r')
legacy_store

# copy (and return the new handle)
Expand All @@ -1397,6 +1420,7 @@ Performance
- You can pass ``chunksize=an integer`` to ``append``, to change the writing chunksize (default is 50000). This will signficantly lower your memory usage on writing.
- You can pass ``expectedrows=an integer`` to the first ``append``, to set the TOTAL number of expectedrows that ``PyTables`` will expected. This will optimize read/write performance.
- Duplicate rows can be written to tables, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)
- A ``PerformanceWarning`` will be raised if you are attempting to store types that will be pickled by PyTables (rather than stored as endemic types). See <http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190> for more information and some solutions.

Experimental
~~~~~~~~~~~~
Expand Down
3 changes: 3 additions & 0 deletions doc/source/v0.10.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -119,12 +119,15 @@ Multi-table creation via ``append_to_multiple`` and selection via ``select_as_mu

**Enhancements**

- ``HDFStore`` now can read native PyTables table format tables
- You can pass ``nan_rep = 'my_nan_rep'`` to append, to change the default nan representation on disk (which converts to/from `np.nan`), this defaults to `nan`.
- You can pass ``index`` to ``append``. This defaults to ``True``. This will automagically create indicies on the *indexables* and *data columns* of the table
- You can pass ``chunksize=an integer`` to ``append``, to change the writing chunksize (default is 50000). This will signficantly lower your memory usage on writing.
- You can pass ``expectedrows=an integer`` to the first ``append``, to set the TOTAL number of expectedrows that ``PyTables`` will expected. This will optimize read/write performance.
- ``Select`` now supports passing ``start`` and ``stop`` to provide selection space limiting in selection.

**Bug Fixes**
- ``HDFStore`` tables can now store ``float32`` types correctly (cannot be mixed with ``float64`` however)

See the `full release notes
<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -835,4 +835,4 @@ def block2d_to_blocknd(values, items, shape, labels, ref_items=None):
def factor_indexer(shape, labels):
""" given a tuple of shape and a list of Factor lables, return the expanded label indexer """
mult = np.array(shape)[::-1].cumprod()[::-1]
return np.sum(np.array(labels).T * np.append(mult, [1]), axis=1).T
return com._ensure_platform_int(np.sum(np.array(labels).T * np.append(mult, [1]), axis=1).T)
Loading