Skip to content

BUG/PERF: Stata value labels #11591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kshedden
Copy link
Contributor

closes #12014

This PR fixes a minor bug and introduces some performance enhancements, all related to value label reading in Stata files.

The bug is that when reading a Stata file incrementally, the value labels will be read even when specifying convert_categoricals=False (this does not happen when reading the entire file at once).

The performance enhancements are:

  1. Read key/value information as an ndarray using np.frombuffer rather than as a Python loop.
  2. When splitting the value labels at the offsets, we currently pass txt[off[i]:] to null_terminate, which creates many copies of large segments of a potentially large byte array. I changed it so that
    only the relevant part of the string is copied.

Relating to 2, further performance improvements might be possible since there is no trailing null byte to remove except for the last element of txt (thus some of the work in _null_terminate is superfluous).

Background: This is an issue when processing large Stata files with millions of distinct value labels.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

ok, sounds gr8!.

just confirm that we have reasonable benchmarks in asv (the packers.py) file is the only one for stata.

@jreback jreback added Performance Memory or execution speed performance IO Stata read_stata, to_stata labels Nov 13, 2015
@kshedden
Copy link
Contributor Author

Is there a place to put data files for ASV to read in? The stata writer is more limited than the stata reader, so it's hard to test some features like strls that are not part of dta version 114.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2015

you can just make a directory under the asv_benchmarks

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

@kshedden want to update

@kshedden
Copy link
Contributor Author

I added a test but haven't been able to get asv to run it. I don't have conda so switched the asv conf file to use virtualenv. It fails when using pip to install pytables:

(py3)-bash-4.1$ pip install --upgrade pytables Collecting pytables Could not find a version that satisfies the requirement pytables (from versions: ) No matching distribution found for pytables

I'm using python 3.4.2rc1 (and also switched asv conf to use python 3.4)

@jreback
Copy link
Contributor

jreback commented Nov 27, 2015

asv can work with pip but needs a slighty modified config file (e.g. a different one) as pip uses tables (not that conda accepts both of these).

@jreback
Copy link
Contributor

jreback commented Nov 27, 2015

pls add a whatsnew note (in performance). you don't necessarily need to include an asv benchmark (though nice if its easy), but post a perf comparison at the top of the issue.

@pv
Copy link
Contributor

pv commented Nov 27, 2015

Re conda vs. virtualenv, this may be of interest: airspeed-velocity/asv#322 (comment) airspeed-velocity/asv#329

@jreback
Copy link
Contributor

jreback commented Nov 29, 2015

@pv that's a nice feature....care to do a PR to upgrade pandas/asv.conf.json

@jreback
Copy link
Contributor

jreback commented Dec 11, 2015

@kshedden can you update

1 similar comment
@jreback
Copy link
Contributor

jreback commented Jan 2, 2016

@kshedden can you update

@kshedden
Copy link
Contributor Author

kshedden commented Jan 9, 2016

Sorry for the delay... I wasn't able to test this in ASV. I updated whatsnew, not sure what else needs to be done.

@@ -416,7 +416,7 @@ Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved performance of ``andrews_curves`` (:issue:`11534`)

- Improved performance of ``StataReader``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this issue number here

@jreback
Copy link
Contributor

jreback commented Jan 10, 2016

looks good
can u run flake8 on the diff and fix any issue

@jreback jreback added this to the 0.18.0 milestone Jan 11, 2016
@@ -440,6 +439,7 @@ Bug Fixes
- Bug in consistency of passing nested dicts to ``.groupby(...).agg(...)`` (:issue:`9052`)
- Accept unicode in ``Timedelta`` constructor (:issue:`11995`)

- Bug in value label reading for ``StataReader`` when reading incrementally (:issue:`12014`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also a perf enhancement, yes? pls add a line in Performance (use this PR number for it)

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

minor comments. pls squash. ping when green.

added text to whatsnew

Update whatsnew

flake8 edits

edited whatsnew
@kshedden
Copy link
Contributor Author

@jreback should be good to go

@jreback
Copy link
Contributor

jreback commented Jan 13, 2016

merged via 449ab6b

thanks!

@jreback jreback closed this Jan 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Stata value labels
3 participants