Skip to content

FIX: Correct Categorical behavior in StataReader #8836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 19, 2014

Conversation

bashtage
Copy link
Contributor

Ensure that category codes have the same order as the underlying Stata data.

xref #8816

@bashtage bashtage force-pushed the stata-monotonic-categoricals branch from 092dfa1 to 2a746a7 Compare November 17, 2014 14:53
@bashtage
Copy link
Contributor Author

This changes the default to returning ordered series. The old version would return ordered series when the series was fully labeled (ordered by alphabetical order of labels) and an unordered series when the series was partially labeled (since the labels cannot be sorted in this case).

With this change all series are ordered by the underlying data values, whether partially labeled or fully labeled.

@bashtage bashtage force-pushed the stata-monotonic-categoricals branch 3 times, most recently from fd74c7e to 33be1a7 Compare November 17, 2014 17:06
@bashtage
Copy link
Contributor Author

Added test from #8816

@bashtage bashtage force-pushed the stata-monotonic-categoricals branch from 33be1a7 to 5493a7a Compare November 17, 2014 19:17
@bashtage
Copy link
Contributor Author

@PKEuS @hmgaudecker @jreback

This seems to work and passes both new tests as well as old tests. Also handles partially labeled, float, etc.

Comments welcome.

@bashtage
Copy link
Contributor Author

@jreback Green

@PKEuS
Copy link
Contributor

PKEuS commented Nov 17, 2014

Sounds good, especially since my implementation caused problems with an existing unit test that was added recently.

@hmgaudecker
Copy link

Looks great, thanks!

@jreback Would it be feasible to add "drop_order" and "reverse_order" methods to pd.Categorical? For any survey dataset, one-size fits all usually won't work (regardless of the default) and the easiest thing will be to use what works best and then change some columns ex post.

Why these two?

  • drop_order() for unordered Categoricals like gender
  • reverse_order() because often the numbers used for an ordinal ranking will be in arbitrary order. E.g., the self-reported health example (excellent, very good, ..., poor) you may find in Stata coded as 5, 4, 3, 2, 1 or 1, 2, 3, 4, 5 -- whatever the data producer thought looked best in Stata output. When Pandas displays the latter coding, excellent < very good < ... < poor looks odd.
    Both are probably just syntactic sugar, but they'd be much easier to work with than what I would guess how it could be done now.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

@hmgaudecker

these already exist like so: (they would similarly directly on a Categorical; that's all these .cat ops are doing anyhow)

In [1]: s = Series(list('aabcde')).astype('category')

In [2]: s
Out[2]: 
0    a
1    a
2    b
3    c
4    d
5    e
dtype: category
Categories (5, object): [a < b < c < d < e]

In [3]: s.cat.ordered
Out[3]: True

In [4]: s.cat.ordered=False

In [5]: s
Out[5]: 
0    a
1    a
2    b
3    c
4    d
5    e
dtype: category
Categories (5, object): [a, b, c, d, e]

In [6]: s.cat.categories = s.cat.categories[::-1]

In [7]: s
Out[7]: 
0    e
1    e
2    d
3    c
4    b
5    a
dtype: category
Categories (5, object): [e, d, c, b, a]

In [8]: s.cat.ordered=True 
In [9]: s
Out[9]: 
0    e
1    e
2    d
3    c
4    b
5    a
dtype: category
Categories (5, object): [e < d < c < b < a]

@jreback jreback added Categorical Categorical Data Type IO Stata read_stata, to_stata labels Nov 18, 2014
@bashtage
Copy link
Contributor Author

I suppose what is left to decide for this PR is whether:

  • Categoricals should be always ordered by default. This is a change from 0.15.1 since they are ordered by value labels if orderable, while they are always ordered now (order comes from underlying values).
  • Whether to include an input to control this. It could be a simple bool to hit all categoricals, e.g. order_categoricals=True, or it could be a list of column names to control it on a case-by-case basis, e.g. unordered_categoricals=['sex','state']. These are sugar since all information needed to do them is in the loaded data.
  • Is a deprecation cycle needed? I think of this as a bug fix, mostly to avoid needing a dep cycle (although I tried ot make the case that the loss of information in the existing method is bug-like).

@hmgaudecker
Copy link

@jreback: Thanks, that's sweet.

On 18.11.14 13:30, Kevin Sheppard wrote:

I suppose what is left to decide for this PR is whether:

  • Categoricals should be always ordered by default. This is a change
    from 0.15.1 since they are ordered by value labels if orderable,
    while they are always ordered now (order comes from underlying
    values).

+1

The behaviour across 0.14 and 0.15 is inconsistent anyhow, if I am not
mistaken. So that change seems to be due to a usecase that was
overlooked so far. Any workarounds that people might have come up with
should work with the new behaviour as well.

  • Whether to include an input to control this. It could be a simple
    bool to hit all categoricals, e.g. |order_categoricals=True|, or
    it could be a list of column names to control it on a case-by-case
    basis, e.g. |unordered_categoricals=['sex','state']|. These are
    sugar since all information needed to do them is in the loaded data.

How about a keyword unordered_categoricals=[] ? Then it should be fairly
obvious that the default behaviour is to order them, and to include a
list of column names for the unordered ones.

I would also suggest to document that and @jreback's elaboration here:
http://pandas-docs.github.io/pandas-docs-travis/categorical.html#getting-data-in-out

I can take a stab at that if you want.

  • Is a deprecation cycle needed? I think of this as a bug fix,
    mostly to avoid needing a dep cycle (although I tried ot make the
    case that the loss of information in the existing method is bug-like).

-1, see above

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

@bashtage we can just fix/change this (and document it in v0.15.2). This is very very new, so if it changes slightly no big deal. No deprecation cycle needed.

as @hmgaudecker points out, more docs are good for the Stata read/write of Categoricals (but I would put it in io.rst and provide a linlk back to categorical data (as I did for HDF5 changes)).

I like order_categoricals=True|False

then the default use case will be to order, if their are special case issues, people can turn it off (and always order them post-import).

@jreback jreback added this to the 0.15.2 milestone Nov 18, 2014
@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

@bashtage in the release note you can just reference this PR number

@bashtage bashtage force-pushed the stata-monotonic-categoricals branch 5 times, most recently from ae24038 to 0aefc17 Compare November 18, 2014 16:26
@bashtage
Copy link
Contributor Author

I implemented the order_categoricals, tests and added some docs.

Aside from my typo in the issue number which I have wrong in the release notes, I think it should be ready.

@bashtage bashtage force-pushed the stata-monotonic-categoricals branch from 0aefc17 to 8ec2fc0 Compare November 18, 2014 18:47
@bashtage
Copy link
Contributor Author

@jreback Green, subject to review (esp for docs)

.. versionadded:: 0.15.2

``Categorical`` data can be exported to *Stata* data files as value labeled data.
The exported data is consists of the underlying category codes as integer data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'is consists' -> 'consists'



Similarly, labeled data can be imported from *Stata* data files as ``Categorical``
variables by setting ``convert_categoricals=True``. Imported ``Categorical``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add that the default value is True

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

@bashtage aside from @jorisvandenbossche comments looks fine to me.

maybe explain the 'partial ordering' issue and how order_categoricals=True maybe have an adverse affect.

@bashtage bashtage force-pushed the stata-monotonic-categoricals branch from 8ec2fc0 to 34e6afc Compare November 18, 2014 22:35
@bashtage
Copy link
Contributor Author

@jorisvandenbossche @jreback Many doc edits...rewrote some text that probably only made sense to me to be simpler.

@@ -45,6 +45,7 @@ Enhancements
- Added ability to export Categorical data to to/from HDF5 (:issue:`7621`). Queries work the same as if it was an object array. However, the ``category`` dtyped data is stored in a more efficient manner. See :ref:`here <io.hdf5-categorical>` for an example and caveats w.r.t. prior versions of pandas.
- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`).
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`.
- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a link here to the new docs

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

@bashtage minor doc ref needed in the release notes, otherwise I am ok with this.

Looks good! (needed quite a few note/warning in the docs), but when dealing with a non-trivial type conversion that's how it is!

thanks!

@bashtage bashtage force-pushed the stata-monotonic-categoricals branch from 34e6afc to 6cf2e48 Compare November 19, 2014 02:19
Ensure that category codes have the same order as the underlying Stata data.
Also adds a flag that allows categorical data to be treated as ordered or
unordered when importing.
jreback added a commit that referenced this pull request Nov 19, 2014
FIX: Correct Categorical behavior in StataReader
@jreback jreback merged commit 750151c into pandas-dev:master Nov 19, 2014
@jreback
Copy link
Contributor

jreback commented Nov 19, 2014

@bashtage thanks!
@PKEuS thanks!
@hmgaudecker thanks!

as always, pls review the built docs for typos / how they look and submit a followup pr if needed.

@bashtage bashtage deleted the stata-monotonic-categoricals branch November 19, 2014 16:06
@bashtage bashtage restored the stata-monotonic-categoricals branch November 19, 2014 16:06
@bashtage bashtage deleted the stata-monotonic-categoricals branch November 19, 2014 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants