BUG/PERF: Stata value labels #11591

kshedden · 2015-11-13T14:17:48Z

closes #12014

This PR fixes a minor bug and introduces some performance enhancements, all related to value label reading in Stata files.

The bug is that when reading a Stata file incrementally, the value labels will be read even when specifying convert_categoricals=False (this does not happen when reading the entire file at once).

The performance enhancements are:

Read key/value information as an ndarray using np.frombuffer rather than as a Python loop.
When splitting the value labels at the offsets, we currently pass txt[off[i]:] to null_terminate, which creates many copies of large segments of a potentially large byte array. I changed it so that
only the relevant part of the string is copied.

Relating to 2, further performance improvements might be possible since there is no trailing null byte to remove except for the last element of txt (thus some of the work in _null_terminate is superfluous).

Background: This is an issue when processing large Stata files with millions of distinct value labels.

jreback · 2015-11-13T14:34:27Z

ok, sounds gr8!.

just confirm that we have reasonable benchmarks in asv (the packers.py) file is the only one for stata.

kshedden · 2015-11-15T15:44:36Z

Is there a place to put data files for ASV to read in? The stata writer is more limited than the stata reader, so it's hard to test some features like strls that are not part of dta version 114.

jreback · 2015-11-15T16:45:58Z

you can just make a directory under the asv_benchmarks

jreback · 2015-11-25T15:39:28Z

@kshedden want to update

kshedden · 2015-11-26T15:35:01Z

I added a test but haven't been able to get asv to run it. I don't have conda so switched the asv conf file to use virtualenv. It fails when using pip to install pytables:

(py3)-bash-4.1$ pip install --upgrade pytables Collecting pytables Could not find a version that satisfies the requirement pytables (from versions: ) No matching distribution found for pytables

I'm using python 3.4.2rc1 (and also switched asv conf to use python 3.4)

jreback · 2015-11-27T13:26:22Z

asv can work with pip but needs a slighty modified config file (e.g. a different one) as pip uses tables (not that conda accepts both of these).

jreback · 2015-11-27T13:27:23Z

pls add a whatsnew note (in performance). you don't necessarily need to include an asv benchmark (though nice if its easy), but post a perf comparison at the top of the issue.

pv · 2015-11-27T15:01:26Z

Re conda vs. virtualenv, this may be of interest: airspeed-velocity/asv#322 (comment) airspeed-velocity/asv#329

jreback · 2015-11-29T18:11:40Z

@pv that's a nice feature....care to do a PR to upgrade pandas/asv.conf.json

jreback · 2015-12-11T01:08:57Z

@kshedden can you update

jreback · 2016-01-02T23:15:33Z

@kshedden can you update

kshedden · 2016-01-09T21:12:53Z

Sorry for the delay... I wasn't able to test this in ASV. I updated whatsnew, not sure what else needs to be done.

jreback · 2016-01-10T00:18:05Z

doc/source/whatsnew/v0.18.0.txt

@@ -416,7 +416,7 @@ Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~

 - Improved performance of ``andrews_curves`` (:issue:`11534`)
-
+- Improved performance of ``StataReader``


add this issue number here

jreback · 2016-01-10T00:19:55Z

looks good
can u run flake8 on the diff and fix any issue

jreback · 2016-01-11T13:14:19Z

doc/source/whatsnew/v0.18.0.txt

@@ -440,6 +439,7 @@ Bug Fixes
 - Bug in consistency of passing nested dicts to ``.groupby(...).agg(...)`` (:issue:`9052`)
 - Accept unicode in ``Timedelta`` constructor (:issue:`11995`)

+- Bug in value label reading for ``StataReader`` when reading incrementally (:issue:`12014`)


this is also a perf enhancement, yes? pls add a line in Performance (use this PR number for it)

jreback · 2016-01-11T13:14:40Z

minor comments. pls squash. ping when green.

added text to whatsnew Update whatsnew flake8 edits edited whatsnew

kshedden · 2016-01-12T01:18:08Z

@jreback should be good to go

jreback · 2016-01-13T13:41:59Z

merged via 449ab6b

thanks!

jreback added Performance Memory or execution speed performance IO Stata read_stata, to_stata labels Nov 13, 2015

kshedden force-pushed the stata_value_label branch from 66f6382 to 8294d55 Compare January 9, 2016 20:08

jreback reviewed Jan 10, 2016
View reviewed changes

kshedden mentioned this pull request Jan 11, 2016

BUG: Stata value labels #12014

Closed

jreback added this to the 0.18.0 milestone Jan 11, 2016

jreback reviewed Jan 11, 2016
View reviewed changes

Initial commit for PR

098c805

added text to whatsnew Update whatsnew flake8 edits edited whatsnew

kshedden force-pushed the stata_value_label branch from a024f4b to 098c805 Compare January 12, 2016 00:29

jreback closed this Jan 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/PERF: Stata value labels #11591

BUG/PERF: Stata value labels #11591

kshedden commented Nov 13, 2015

jreback commented Nov 13, 2015

kshedden commented Nov 15, 2015

jreback commented Nov 15, 2015

jreback commented Nov 25, 2015

kshedden commented Nov 26, 2015

jreback commented Nov 27, 2015

jreback commented Nov 27, 2015

pv commented Nov 27, 2015

jreback commented Nov 29, 2015

jreback commented Dec 11, 2015

jreback commented Jan 2, 2016

kshedden commented Jan 9, 2016

jreback Jan 10, 2016

jreback commented Jan 10, 2016

jreback Jan 11, 2016

jreback commented Jan 11, 2016

kshedden commented Jan 12, 2016

jreback commented Jan 13, 2016

BUG/PERF: Stata value labels #11591

BUG/PERF: Stata value labels #11591

Conversation

kshedden commented Nov 13, 2015

jreback commented Nov 13, 2015

kshedden commented Nov 15, 2015

jreback commented Nov 15, 2015

jreback commented Nov 25, 2015

kshedden commented Nov 26, 2015

jreback commented Nov 27, 2015

jreback commented Nov 27, 2015

pv commented Nov 27, 2015

jreback commented Nov 29, 2015

jreback commented Dec 11, 2015

jreback commented Jan 2, 2016

kshedden commented Jan 9, 2016

jreback Jan 10, 2016

Choose a reason for hiding this comment

jreback commented Jan 10, 2016

jreback Jan 11, 2016

Choose a reason for hiding this comment

jreback commented Jan 11, 2016

kshedden commented Jan 12, 2016

jreback commented Jan 13, 2016