-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
FIX: Correct Categorical behavior in StataReader #8836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Correct Categorical behavior in StataReader #8836
Conversation
092dfa1
to
2a746a7
Compare
This changes the default to returning ordered series. The old version would return ordered series when the series was fully labeled (ordered by alphabetical order of labels) and an unordered series when the series was partially labeled (since the labels cannot be sorted in this case). With this change all series are ordered by the underlying data values, whether partially labeled or fully labeled. |
fd74c7e
to
33be1a7
Compare
Added test from #8816 |
33be1a7
to
5493a7a
Compare
This seems to work and passes both new tests as well as old tests. Also handles partially labeled, float, etc. Comments welcome. |
@jreback Green |
Sounds good, especially since my implementation caused problems with an existing unit test that was added recently. |
Looks great, thanks! @jreback Would it be feasible to add "drop_order" and "reverse_order" methods to pd.Categorical? For any survey dataset, one-size fits all usually won't work (regardless of the default) and the easiest thing will be to use what works best and then change some columns ex post. Why these two?
|
these already exist like so: (they would similarly directly on a Categorical; that's all these .cat ops are doing anyhow)
|
I suppose what is left to decide for this PR is whether:
|
@jreback: Thanks, that's sweet. On 18.11.14 13:30, Kevin Sheppard wrote:
The behaviour across 0.14 and 0.15 is inconsistent anyhow, if I am not
I would also suggest to document that and @jreback's elaboration here: I can take a stab at that if you want.
|
@bashtage we can just fix/change this (and document it in v0.15.2). This is very very new, so if it changes slightly no big deal. No deprecation cycle needed. as @hmgaudecker points out, more docs are good for the Stata read/write of Categoricals (but I would put it in io.rst and provide a linlk back to categorical data (as I did for HDF5 changes)). I like then the default use case will be to order, if their are special case issues, people can turn it off (and always order them post-import). |
@bashtage in the release note you can just reference this PR number |
ae24038
to
0aefc17
Compare
I implemented the Aside from my typo in the issue number which I have wrong in the release notes, I think it should be ready. |
0aefc17
to
8ec2fc0
Compare
@jreback Green, subject to review (esp for docs) |
.. versionadded:: 0.15.2 | ||
|
||
``Categorical`` data can be exported to *Stata* data files as value labeled data. | ||
The exported data is consists of the underlying category codes as integer data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'is consists' -> 'consists'
|
||
|
||
Similarly, labeled data can be imported from *Stata* data files as ``Categorical`` | ||
variables by setting ``convert_categoricals=True``. Imported ``Categorical`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add that the default value is True
@bashtage aside from @jorisvandenbossche comments looks fine to me. maybe explain the 'partial ordering' issue and how |
8ec2fc0
to
34e6afc
Compare
@jorisvandenbossche @jreback Many doc edits...rewrote some text that probably only made sense to me to be simpler. |
@@ -45,6 +45,7 @@ Enhancements | |||
- Added ability to export Categorical data to to/from HDF5 (:issue:`7621`). Queries work the same as if it was an object array. However, the ``category`` dtyped data is stored in a more efficient manner. See :ref:`here <io.hdf5-categorical>` for an example and caveats w.r.t. prior versions of pandas. | |||
- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`). | |||
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`. | |||
- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`). | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a link here to the new docs
@bashtage minor doc ref needed in the release notes, otherwise I am ok with this. Looks good! (needed quite a few note/warning in the docs), but when dealing with a non-trivial type conversion that's how it is! thanks! |
34e6afc
to
6cf2e48
Compare
Ensure that category codes have the same order as the underlying Stata data. Also adds a flag that allows categorical data to be treated as ordered or unordered when importing.
FIX: Correct Categorical behavior in StataReader
@bashtage thanks! as always, pls review the built docs for typos / how they look and submit a followup pr if needed. |
Ensure that category codes have the same order as the underlying Stata data.
xref #8816