FIX: Correct Categorical behavior in StataReader #8836

bashtage · 2014-11-17T04:09:58Z

Ensure that category codes have the same order as the underlying Stata data.

bashtage · 2014-11-17T15:24:32Z

This changes the default to returning ordered series. The old version would return ordered series when the series was fully labeled (ordered by alphabetical order of labels) and an unordered series when the series was partially labeled (since the labels cannot be sorted in this case).

With this change all series are ordered by the underlying data values, whether partially labeled or fully labeled.

bashtage · 2014-11-17T17:11:26Z

Added test from #8816

bashtage · 2014-11-17T20:05:00Z

@PKEuS @hmgaudecker @jreback

This seems to work and passes both new tests as well as old tests. Also handles partially labeled, float, etc.

Comments welcome.

bashtage · 2014-11-17T22:18:27Z

@jreback Green

PKEuS · 2014-11-17T22:22:01Z

Sounds good, especially since my implementation caused problems with an existing unit test that was added recently.

hmgaudecker · 2014-11-18T06:45:59Z

Looks great, thanks!

@jreback Would it be feasible to add "drop_order" and "reverse_order" methods to pd.Categorical? For any survey dataset, one-size fits all usually won't work (regardless of the default) and the easiest thing will be to use what works best and then change some columns ex post.

Why these two?

drop_order() for unordered Categoricals like gender
reverse_order() because often the numbers used for an ordinal ranking will be in arbitrary order. E.g., the self-reported health example (excellent, very good, ..., poor) you may find in Stata coded as 5, 4, 3, 2, 1 or 1, 2, 3, 4, 5 -- whatever the data producer thought looked best in Stata output. When Pandas displays the latter coding, excellent < very good < ... < poor looks odd.
Both are probably just syntactic sugar, but they'd be much easier to work with than what I would guess how it could be done now.

jreback · 2014-11-18T12:23:06Z

@hmgaudecker

these already exist like so: (they would similarly directly on a Categorical; that's all these .cat ops are doing anyhow)

In [1]: s = Series(list('aabcde')).astype('category')

In [2]: s
Out[2]: 
0    a
1    a
2    b
3    c
4    d
5    e
dtype: category
Categories (5, object): [a < b < c < d < e]

In [3]: s.cat.ordered
Out[3]: True

In [4]: s.cat.ordered=False

In [5]: s
Out[5]: 
0    a
1    a
2    b
3    c
4    d
5    e
dtype: category
Categories (5, object): [a, b, c, d, e]

In [6]: s.cat.categories = s.cat.categories[::-1]

In [7]: s
Out[7]: 
0    e
1    e
2    d
3    c
4    b
5    a
dtype: category
Categories (5, object): [e, d, c, b, a]

In [8]: s.cat.ordered=True 
In [9]: s
Out[9]: 
0    e
1    e
2    d
3    c
4    b
5    a
dtype: category
Categories (5, object): [e < d < c < b < a]

bashtage · 2014-11-18T12:30:02Z

I suppose what is left to decide for this PR is whether:

Categoricals should be always ordered by default. This is a change from 0.15.1 since they are ordered by value labels if orderable, while they are always ordered now (order comes from underlying values).
Whether to include an input to control this. It could be a simple bool to hit all categoricals, e.g. order_categoricals=True, or it could be a list of column names to control it on a case-by-case basis, e.g. unordered_categoricals=['sex','state']. These are sugar since all information needed to do them is in the loaded data.
Is a deprecation cycle needed? I think of this as a bug fix, mostly to avoid needing a dep cycle (although I tried ot make the case that the loss of information in the existing method is bug-like).

hmgaudecker · 2014-11-18T13:10:15Z

@jreback: Thanks, that's sweet.

On 18.11.14 13:30, Kevin Sheppard wrote:

I suppose what is left to decide for this PR is whether:

Categoricals should be always ordered by default. This is a change
from 0.15.1 since they are ordered by value labels if orderable,
while they are always ordered now (order comes from underlying
values).

+1

The behaviour across 0.14 and 0.15 is inconsistent anyhow, if I am not
mistaken. So that change seems to be due to a usecase that was
overlooked so far. Any workarounds that people might have come up with
should work with the new behaviour as well.

Whether to include an input to control this. It could be a simple
bool to hit all categoricals, e.g. |order_categoricals=True|, or
it could be a list of column names to control it on a case-by-case
basis, e.g. |unordered_categoricals=['sex','state']|. These are
sugar since all information needed to do them is in the loaded data.

How about a keyword unordered_categoricals=[] ? Then it should be fairly
obvious that the default behaviour is to order them, and to include a
list of column names for the unordered ones.

I would also suggest to document that and @jreback's elaboration here:
http://pandas-docs.github.io/pandas-docs-travis/categorical.html#getting-data-in-out

I can take a stab at that if you want.

Is a deprecation cycle needed? I think of this as a bug fix,
mostly to avoid needing a dep cycle (although I tried ot make the
case that the loss of information in the existing method is bug-like).

-1, see above

jreback · 2014-11-18T13:22:50Z

@bashtage we can just fix/change this (and document it in v0.15.2). This is very very new, so if it changes slightly no big deal. No deprecation cycle needed.

as @hmgaudecker points out, more docs are good for the Stata read/write of Categoricals (but I would put it in io.rst and provide a linlk back to categorical data (as I did for HDF5 changes)).

I like order_categoricals=True|False

then the default use case will be to order, if their are special case issues, people can turn it off (and always order them post-import).

jreback · 2014-11-18T13:23:59Z

@bashtage in the release note you can just reference this PR number

bashtage · 2014-11-18T18:37:48Z

I implemented the order_categoricals, tests and added some docs.

Aside from my typo in the issue number which I have wrong in the release notes, I think it should be ready.

bashtage · 2014-11-18T21:31:46Z

@jreback Green, subject to review (esp for docs)

jorisvandenbossche · 2014-11-18T21:33:02Z

doc/source/io.rst

+.. versionadded:: 0.15.2
+
+``Categorical`` data can be exported to *Stata* data files as value labeled data.
+The exported data is consists of the underlying category codes as integer data


'is consists' -> 'consists'

jorisvandenbossche · 2014-11-18T21:38:30Z

doc/source/io.rst

+
+
+Similarly, labeled data can be imported from *Stata* data files as ``Categorical``
+variables by setting ``convert_categoricals=True``.  Imported ``Categorical``


maybe add that the default value is True

jreback · 2014-11-18T21:47:33Z

@bashtage aside from @jorisvandenbossche comments looks fine to me.

maybe explain the 'partial ordering' issue and how order_categoricals=True maybe have an adverse affect.

bashtage · 2014-11-18T22:39:04Z

@jorisvandenbossche @jreback Many doc edits...rewrote some text that probably only made sense to me to be simpler.

jreback · 2014-11-18T23:45:16Z

doc/source/whatsnew/v0.15.2.txt

@@ -45,6 +45,7 @@ Enhancements
 - Added ability to export Categorical data to to/from HDF5 (:issue:`7621`). Queries work the same as if it was an object array. However, the ``category`` dtyped data is stored in a more efficient manner. See :ref:`here <io.hdf5-categorical>` for an example and caveats w.r.t. prior versions of pandas.
 - Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`).
 - Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`.
+- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`).



maybe add a link here to the new docs

jreback · 2014-11-18T23:46:35Z

@bashtage minor doc ref needed in the release notes, otherwise I am ok with this.

Looks good! (needed quite a few note/warning in the docs), but when dealing with a non-trivial type conversion that's how it is!

thanks!

Ensure that category codes have the same order as the underlying Stata data. Also adds a flag that allows categorical data to be treated as ordered or unordered when importing.

FIX: Correct Categorical behavior in StataReader

jreback · 2014-11-19T11:20:47Z

@bashtage thanks!
@PKEuS thanks!
@hmgaudecker thanks!

as always, pls review the built docs for typos / how they look and submit a followup pr if needed.

bashtage force-pushed the stata-monotonic-categoricals branch from 092dfa1 to 2a746a7 Compare November 17, 2014 14:53

bashtage force-pushed the stata-monotonic-categoricals branch 3 times, most recently from fd74c7e to 33be1a7 Compare November 17, 2014 17:06

bashtage force-pushed the stata-monotonic-categoricals branch from 33be1a7 to 5493a7a Compare November 17, 2014 19:17

jreback added Categorical Categorical Data Type IO Stata read_stata, to_stata labels Nov 18, 2014

jreback mentioned this pull request Nov 18, 2014

StataReader: Support sorting categoricals #8816

Closed

jreback added this to the 0.15.2 milestone Nov 18, 2014

bashtage force-pushed the stata-monotonic-categoricals branch 5 times, most recently from ae24038 to 0aefc17 Compare November 18, 2014 16:26

bashtage force-pushed the stata-monotonic-categoricals branch from 0aefc17 to 8ec2fc0 Compare November 18, 2014 18:47

jorisvandenbossche reviewed Nov 18, 2014
View reviewed changes

bashtage force-pushed the stata-monotonic-categoricals branch from 8ec2fc0 to 34e6afc Compare November 18, 2014 22:35

jreback reviewed Nov 18, 2014
View reviewed changes

bashtage force-pushed the stata-monotonic-categoricals branch from 34e6afc to 6cf2e48 Compare November 19, 2014 02:19

BUG: Correct importing behavior for Categoricals in StataReader

6cf2e48

Ensure that category codes have the same order as the underlying Stata data. Also adds a flag that allows categorical data to be treated as ordered or unordered when importing.

jreback added a commit that referenced this pull request Nov 19, 2014

Merge pull request #8836 from bashtage/stata-monotonic-categoricals

750151c

FIX: Correct Categorical behavior in StataReader

jreback merged commit 750151c into pandas-dev:master Nov 19, 2014

bashtage deleted the stata-monotonic-categoricals branch November 19, 2014 16:06

bashtage restored the stata-monotonic-categoricals branch November 19, 2014 16:06

bashtage deleted the stata-monotonic-categoricals branch November 19, 2014 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Correct Categorical behavior in StataReader #8836

FIX: Correct Categorical behavior in StataReader #8836

bashtage commented Nov 17, 2014

bashtage commented Nov 17, 2014

bashtage commented Nov 17, 2014

bashtage commented Nov 17, 2014

bashtage commented Nov 17, 2014

PKEuS commented Nov 17, 2014

hmgaudecker commented Nov 18, 2014

jreback commented Nov 18, 2014

bashtage commented Nov 18, 2014

hmgaudecker commented Nov 18, 2014

jreback commented Nov 18, 2014

jreback commented Nov 18, 2014

bashtage commented Nov 18, 2014

bashtage commented Nov 18, 2014

jorisvandenbossche Nov 18, 2014

jorisvandenbossche Nov 18, 2014

jreback commented Nov 18, 2014

bashtage commented Nov 18, 2014

jreback Nov 18, 2014

jreback commented Nov 18, 2014

jreback commented Nov 19, 2014



		Similarly, labeled data can be imported from Stata data files as ``Categorical``
		variables by setting ``convert_categoricals=True``. Imported ``Categorical``

FIX: Correct Categorical behavior in StataReader #8836

FIX: Correct Categorical behavior in StataReader #8836

Conversation

bashtage commented Nov 17, 2014

bashtage commented Nov 17, 2014

bashtage commented Nov 17, 2014

bashtage commented Nov 17, 2014

bashtage commented Nov 17, 2014

PKEuS commented Nov 17, 2014

hmgaudecker commented Nov 18, 2014

jreback commented Nov 18, 2014

bashtage commented Nov 18, 2014

hmgaudecker commented Nov 18, 2014

jreback commented Nov 18, 2014

jreback commented Nov 18, 2014

bashtage commented Nov 18, 2014

bashtage commented Nov 18, 2014

jorisvandenbossche Nov 18, 2014

Choose a reason for hiding this comment

jorisvandenbossche Nov 18, 2014

Choose a reason for hiding this comment

jreback commented Nov 18, 2014

bashtage commented Nov 18, 2014

jreback Nov 18, 2014

Choose a reason for hiding this comment

jreback commented Nov 18, 2014

jreback commented Nov 19, 2014