Skip to content

Fixed HDFSTore.groups() performance. #21543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 21, 2018

Conversation

spott
Copy link
Contributor

@spott spott commented Jun 19, 2018

No longer walks every node, but rather every group.

@spott
Copy link
Contributor Author

spott commented Jun 19, 2018

I'm not sure if this deserves it's own 'whatsnew' entry... it is pretty simple.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2018

I believe we have an asv for this, can you run. and yes this does need a whatsnew, you can put in 0.23.2 perf.

@jreback jreback added Performance Memory or execution speed performance IO HDF5 read_hdf, HDFStore labels Jun 19, 2018
@spott
Copy link
Contributor Author

spott commented Jun 19, 2018

This is my first pull request for pandas. What is "asv"?

I've updated the whatsnew.

@codecov
Copy link

codecov bot commented Jun 19, 2018

Codecov Report

Merging #21543 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21543   +/-   ##
=======================================
  Coverage   91.92%   91.92%           
=======================================
  Files         153      153           
  Lines       49563    49563           
=======================================
  Hits        45559    45559           
  Misses       4004     4004
Flag Coverage Δ
#multiple 90.32% <ø> (ø) ⬆️
#single 41.8% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/pytables.py 92.43% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 506935c...e9a02df. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2018

@spott
Copy link
Contributor Author

spott commented Jun 19, 2018

Thanks.

I ran the io.hdf.* benchmarks and got the following:

       before           after         ratio
     [49188296]       [38b70fcf]
+     7.02±0.07ms         25.1±1ms     3.57  io.hdf.HDFStoreDataFrame.time_query_store_table
+      28.9±0.4ms           50.7ms     1.75  io.hdf.HDFStoreDataFrame.time_read_store_mixed
+      16.3±0.4ms           28.3ms     1.74  io.hdf.HDFStoreDataFrame.time_query_store_table_wide
+     25.5±0.01ms           38.0ms     1.49  io.hdf.HDFStoreDataFrame.time_read_store
+        68.1±2ms           98.1ms     1.44  io.hdf.HDFStoreDataFrame.time_read_store_table_mixed
+        41.8±1ms         57.5±7ms     1.37  io.hdf.HDFStoreDataFrame.time_write_store_mixed
-          22.5ms           19.5ms     0.86  io.hdf.HDFStoreDataFrame.time_read_store_table_wide
-          29.8ms         25.1±1ms     0.84  io.hdf.HDFStoreDataFrame.time_write_store
-          10.4μs       8.63±0.9μs     0.83  io.hdf.HDFStoreDataFrame.time_store_repr
-      6.04±0.1ms       4.61±0.2ms     0.76  io.hdf.HDFStoreDataFrame.time_store_info

Which is obviously an issue.

Because this doesn't make sense (some of those tests don't even appear to be touching the part of the code that I changed...), I ran it again:

       before           after         ratio
     [49188296]       [38b70fcf]
+      28.2±0.3ms         41.0±1ms     1.45  io.hdf.HDFStoreDataFrame.time_write_store_mixed
+         200±3ms            284ms     1.42  io.hdf.HDFStoreDataFrame.time_write_store_table_dc
+        44.5±2ms         60.7±2ms     1.36  io.hdf.HDFStorePanel.time_read_store_table_panel
+     5.84±0.03μs       7.64±0.3μs     1.31  io.hdf.HDFStoreDataFrame.time_store_repr
+        49.3±3ms           58.0ms     1.18  io.hdf.HDFStoreDataFrame.time_write_store_table
+      70.6±0.4ms         80.8±2ms     1.15  io.hdf.HDFStoreDataFrame.time_write_store_table_mixed
-      28.4±0.2ms       24.4±0.2ms     0.86  io.hdf.HDFStoreDataFrame.time_read_store
-         159±2ms         136±10ms     0.85  io.hdf.HDF.time_read_hdf('table')
-         152±4ms          128±5ms     0.84  io.hdf.HDF.time_write_hdf('table')
-      16.6±0.9ms       13.1±0.2ms     0.79  io.hdf.HDFStoreDataFrame.time_query_store_table_wide
-      5.10±0.4ms      3.04±0.08ms     0.60  io.hdf.HDFStoreDataFrame.time_store_info

And I get what looks to be a completely different set of tests that are different.

A third time gives me:

       before           after         ratio
     [49188296]       [38b70fcf]
+      27.6±0.1ms       36.1±0.7ms     1.31  io.hdf.HDFStoreDataFrame.time_read_store_mixed
+      19.1±0.3ms       22.5±0.7ms     1.18  io.hdf.HDFStoreDataFrame.time_read_store_table_wide
-      71.7±0.7ms       56.6±0.1ms     0.79  io.hdf.HDFStoreDataFrame.time_write_store_table_mixed
-     26.0±0.07ms       19.0±0.2ms     0.73  io.hdf.HDFStoreDataFrame.time_write_store
-        60.6±3ms         41.4±1ms     0.68  io.hdf.HDFStorePanel.time_read_store_table_panel
-        42.7±2ms       26.9±0.1ms     0.63  io.hdf.HDFStoreDataFrame.time_write_store_mixed

Which is at least a little better. I think this test is fairly sensitive to what else is going on on my computer. In order to get a more accurate result, I think I'll need to clean boot and run the tests without anything else running.

@jreback
Copy link
Contributor

jreback commented Jun 20, 2018

yeah there don't appear to be any benchmarks for this (except for .info()) which is related. can you see if it makes sense to add any benchmarks here (i don't remember exactly what .info hits)

@@ -27,7 +27,9 @@ Performance Improvements
- Improved performance of membership checks in :class:`CategoricalIndex`
(i.e. ``x in ci``-style checks are much faster). :meth:`CategoricalIndex.contains`
is likewise much faster (:issue:`21369`)
-
- Improved performance of :meth:`HDFStore.groups` (and dependent functions like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to say dependent functions, you can just list .keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of places that this hits (.keys, .info, .iterate, len, items, and those that call those functions.). Should I ignore them, list the ones that I know of, or just drop the "dependent functions" part and keep it simple?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on 2nd thought this is ok

@spott
Copy link
Contributor Author

spott commented Jun 20, 2018

.info() hits .keys(), so there is some benchmarking supported here.

This particular performance issue wasn't really a problem till I hit a couple hundred (thousand?) keys in a hdf file, so I don't know if we want to create a dataset to test this.

@jreback jreback added this to the 0.23.2 milestone Jun 21, 2018
@jreback
Copy link
Contributor

jreback commented Jun 21, 2018

I rebased, so ping on green.

@@ -27,7 +27,9 @@ Performance Improvements
- Improved performance of membership checks in :class:`CategoricalIndex`
(i.e. ``x in ci``-style checks are much faster). :meth:`CategoricalIndex.contains`
is likewise much faster (:issue:`21369`)
-
- Improved performance of :meth:`HDFStore.groups` (and dependent functions like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on 2nd thought this is ok

@jreback jreback merged commit 1c0740c into pandas-dev:master Jun 21, 2018
@jreback
Copy link
Contributor

jreback commented Jun 21, 2018

thanks @spott

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve HDFStore.groups performance
3 participants