-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/PERF: Stata value labels #11591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/PERF: Stata value labels #11591
Conversation
ok, sounds gr8!. just confirm that we have reasonable benchmarks in |
Is there a place to put data files for ASV to read in? The stata writer is more limited than the stata reader, so it's hard to test some features like strls that are not part of dta version 114. |
you can just make a directory under the |
@kshedden want to update |
I added a test but haven't been able to get asv to run it. I don't have conda so switched the asv conf file to use virtualenv. It fails when using pip to install pytables:
I'm using python 3.4.2rc1 (and also switched asv conf to use python 3.4) |
asv can work with pip but needs a slighty modified config file (e.g. a different one) as pip uses |
pls add a whatsnew note (in performance). you don't necessarily need to include an asv benchmark (though nice if its easy), but post a perf comparison at the top of the issue. |
Re conda vs. virtualenv, this may be of interest: airspeed-velocity/asv#322 (comment) airspeed-velocity/asv#329 |
@pv that's a nice feature....care to do a PR to upgrade |
@kshedden can you update |
1 similar comment
@kshedden can you update |
66f6382
to
8294d55
Compare
Sorry for the delay... I wasn't able to test this in ASV. I updated whatsnew, not sure what else needs to be done. |
@@ -416,7 +416,7 @@ Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- Improved performance of ``andrews_curves`` (:issue:`11534`) | |||
|
|||
- Improved performance of ``StataReader`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this issue number here
looks good |
@@ -440,6 +439,7 @@ Bug Fixes | |||
- Bug in consistency of passing nested dicts to ``.groupby(...).agg(...)`` (:issue:`9052`) | |||
- Accept unicode in ``Timedelta`` constructor (:issue:`11995`) | |||
|
|||
- Bug in value label reading for ``StataReader`` when reading incrementally (:issue:`12014`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also a perf enhancement, yes? pls add a line in Performance (use this PR number for it)
minor comments. pls squash. ping when green. |
added text to whatsnew Update whatsnew flake8 edits edited whatsnew
a024f4b
to
098c805
Compare
@jreback should be good to go |
merged via 449ab6b thanks! |
closes #12014
This PR fixes a minor bug and introduces some performance enhancements, all related to value label reading in Stata files.
The bug is that when reading a Stata file incrementally, the value labels will be read even when specifying convert_categoricals=False (this does not happen when reading the entire file at once).
The performance enhancements are:
only the relevant part of the string is copied.
Relating to 2, further performance improvements might be possible since there is no trailing null byte to remove except for the last element of
txt
(thus some of the work in_null_terminate
is superfluous).Background: This is an issue when processing large Stata files with millions of distinct value labels.