Skip to content

Commit 750151c

Browse files
committed
Merge pull request #8836 from bashtage/stata-monotonic-categoricals
FIX: Correct Categorical behavior in StataReader
2 parents f504885 + 6cf2e48 commit 750151c

File tree

9 files changed

+160
-25
lines changed

9 files changed

+160
-25
lines changed

doc/source/categorical.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -546,7 +546,8 @@ Getting Data In/Out
546546
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype was implemented
547547
in 0.15.2. See :ref:`here <io.hdf5-categorical>` for an example and caveats.
548548

549-
Writing data to/from Stata format files was implemented in 0.15.2.
549+
Writing data to and reading data from *Stata* format files was implemented in
550+
0.15.2. See :ref:`here <io.stata-categorical>` for an example and caveats.
550551

551552
Writing to a CSV file will convert the data, effectively removing any information about the
552553
categorical (categories and ordering). So if you read back the CSV file you have to convert the

doc/source/io.rst

+53-6
Original file line numberDiff line numberDiff line change
@@ -3204,8 +3204,8 @@ format store like this:
32043204
.. ipython:: python
32053205
32063206
store_export = HDFStore('export.h5')
3207-
store_export.append('df_dc', df_dc, data_columns=df_dc.columns)
3208-
store_export
3207+
store_export.append('df_dc', df_dc, data_columns=df_dc.columns)
3208+
store_export
32093209
32103210
.. ipython:: python
32113211
:suppress:
@@ -3240,8 +3240,8 @@ number of options, please see the docstring.
32403240
legacy_store
32413241
32423242
# copy (and return the new handle)
3243-
new_store = legacy_store.copy('store_new.h5')
3244-
new_store
3243+
new_store = legacy_store.copy('store_new.h5')
3244+
new_store
32453245
new_store.close()
32463246
32473247
.. ipython:: python
@@ -3651,14 +3651,14 @@ You can access the management console to determine project id's by:
36513651

36523652
.. _io.stata:
36533653

3654-
STATA Format
3654+
Stata Format
36553655
------------
36563656

36573657
.. versionadded:: 0.12.0
36583658

36593659
.. _io.stata_writer:
36603660

3661-
Writing to STATA format
3661+
Writing to Stata format
36623662
~~~~~~~~~~~~~~~~~~~~~~~
36633663

36643664
The method :func:`~pandas.core.frame.DataFrame.to_stata` will write a DataFrame
@@ -3753,6 +3753,53 @@ Alternatively, the function :func:`~pandas.io.stata.read_stata` can be used
37533753
import os
37543754
os.remove('stata.dta')
37553755
3756+
.. _io.stata-categorical:
3757+
3758+
Categorical Data
3759+
~~~~~~~~~~~~~~~~
3760+
3761+
.. versionadded:: 0.15.2
3762+
3763+
``Categorical`` data can be exported to *Stata* data files as value labeled data.
3764+
The exported data consists of the underlying category codes as integer data values
3765+
and the categories as value labels. *Stata* does not have an explicit equivalent
3766+
to a ``Categorical`` and information about *whether* the variable is ordered
3767+
is lost when exporting.
3768+
3769+
.. warning::
3770+
3771+
*Stata* only supports string value labels, and so ``str`` is called on the
3772+
categories when exporting data. Exporting ``Categorical`` variables with
3773+
non-string categories produces a warning, and can result a loss of
3774+
information if the ``str`` representations of the categories are not unique.
3775+
3776+
Labeled data can similarly be imported from *Stata* data files as ``Categorical``
3777+
variables using the keyword argument ``convert_categoricals`` (``True`` by default).
3778+
By default, imported ``Categorical`` variables are ordered according to the
3779+
underlying numerical data. However, setting ``order_categoricals=False`` will
3780+
import labeled data as ``Categorical`` variables without an order.
3781+
3782+
.. note::
3783+
3784+
When importing categorical data, the values of the variables in the *Stata*
3785+
data file are not generally preserved since ``Categorical`` variables always
3786+
use integer data types between ``-1`` and ``n-1`` where ``n`` is the number
3787+
of categories. If the original values in the *Stata* data file are required,
3788+
these can be imported by setting ``convert_categoricals=False``, which will
3789+
import original data (but not the variable labels). The original values can
3790+
be matched to the imported categorical data since there is a simple mapping
3791+
between the original *Stata* data values and the category codes of imported
3792+
Categorical variables: missing values are assigned code ``-1``, and the
3793+
smallest original value is assigned ``0``, the second smallest is assigned
3794+
``1`` and so on until the largest original value is assigned the code ``n-1``.
3795+
3796+
.. note::
3797+
3798+
*Stata* suppots partially labeled series. These series have value labels for
3799+
some but not all data values. Importing a partially labeled series will produce
3800+
a ``Categorial`` with string categories for the values that are labeled and
3801+
numeric categories for values with no label.
3802+
37563803
.. _io.perf:
37573804

37583805
Performance Considerations

doc/source/whatsnew/v0.15.2.txt

+3-1
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,11 @@ API changes
4141
Enhancements
4242
~~~~~~~~~~~~
4343

44-
- Added ability to export Categorical data to Stata (:issue:`8633`).
44+
- Added ability to export Categorical data to Stata (:issue:`8633`). See :ref:`here <io.stata-categorical>` for limitations of categorical variables exported to Stata data files.
4545
- Added ability to export Categorical data to to/from HDF5 (:issue:`7621`). Queries work the same as if it was an object array. However, the ``category`` dtyped data is stored in a more efficient manner. See :ref:`here <io.hdf5-categorical>` for an example and caveats w.r.t. prior versions of pandas.
4646
- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`).
4747
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`.
48+
- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`). See :ref:`here <io.stata-categorical>` for more information on importing categorical variables from Stata data files.
4849

4950
.. _whatsnew_0152.performance:
5051

@@ -73,6 +74,7 @@ Bug Fixes
7374

7475

7576

77+
- Imported categorical variables from Stata files retain the ordinal information in the underlying data (:issue:`8836`).
7678

7779

7880

pandas/io/stata.py

+30-17
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,8 @@
2929

3030
def read_stata(filepath_or_buffer, convert_dates=True,
3131
convert_categoricals=True, encoding=None, index=None,
32-
convert_missing=False, preserve_dtypes=True, columns=None):
32+
convert_missing=False, preserve_dtypes=True, columns=None,
33+
order_categoricals=True):
3334
"""
3435
Read Stata file into DataFrame
3536
@@ -58,11 +59,14 @@ def read_stata(filepath_or_buffer, convert_dates=True,
5859
columns : list or None
5960
Columns to retain. Columns will be returned in the given order. None
6061
returns all columns
62+
order_categoricals : boolean, defaults to True
63+
Flag indicating whether converted categorical data are ordered.
6164
"""
6265
reader = StataReader(filepath_or_buffer, encoding)
6366

6467
return reader.data(convert_dates, convert_categoricals, index,
65-
convert_missing, preserve_dtypes, columns)
68+
convert_missing, preserve_dtypes, columns,
69+
order_categoricals)
6670

6771
_date_formats = ["%tc", "%tC", "%td", "%d", "%tw", "%tm", "%tq", "%th", "%ty"]
6872

@@ -1136,7 +1140,8 @@ def _read_strls(self):
11361140
self.path_or_buf.read(1) # zero-termination
11371141

11381142
def data(self, convert_dates=True, convert_categoricals=True, index=None,
1139-
convert_missing=False, preserve_dtypes=True, columns=None):
1143+
convert_missing=False, preserve_dtypes=True, columns=None,
1144+
order_categoricals=True):
11401145
"""
11411146
Reads observations from Stata file, converting them into a dataframe
11421147
@@ -1161,6 +1166,8 @@ def data(self, convert_dates=True, convert_categoricals=True, index=None,
11611166
columns : list or None
11621167
Columns to retain. Columns will be returned in the given order.
11631168
None returns all columns
1169+
order_categoricals : boolean, defaults to True
1170+
Flag indicating whether converted categorical data are ordered.
11641171
11651172
Returns
11661173
-------
@@ -1228,7 +1235,7 @@ def data(self, convert_dates=True, convert_categoricals=True, index=None,
12281235

12291236
for col, typ in zip(data, self.typlist):
12301237
if type(typ) is int:
1231-
data[col] = data[col].apply(self._null_terminate, convert_dtype=True,)
1238+
data[col] = data[col].apply(self._null_terminate, convert_dtype=True)
12321239

12331240
cols_ = np.where(self.dtyplist)[0]
12341241

@@ -1288,19 +1295,25 @@ def data(self, convert_dates=True, convert_categoricals=True, index=None,
12881295
col = data.columns[i]
12891296
data[col] = _stata_elapsed_date_to_datetime_vec(data[col], self.fmtlist[i])
12901297

1291-
if convert_categoricals:
1292-
cols = np.where(
1293-
lmap(lambda x: x in compat.iterkeys(self.value_label_dict),
1294-
self.lbllist)
1295-
)[0]
1296-
for i in cols:
1297-
col = data.columns[i]
1298-
labeled_data = np.copy(data[col])
1299-
labeled_data = labeled_data.astype(object)
1300-
for k, v in compat.iteritems(
1301-
self.value_label_dict[self.lbllist[i]]):
1302-
labeled_data[(data[col] == k).values] = v
1303-
data[col] = Categorical.from_array(labeled_data)
1298+
if convert_categoricals and self.value_label_dict:
1299+
value_labels = list(compat.iterkeys(self.value_label_dict))
1300+
cat_converted_data = []
1301+
for col, label in zip(data, self.lbllist):
1302+
if label in value_labels:
1303+
# Explicit call with ordered=True
1304+
cat_data = Categorical(data[col], ordered=order_categoricals)
1305+
value_label_dict = self.value_label_dict[label]
1306+
categories = []
1307+
for category in cat_data.categories:
1308+
if category in value_label_dict:
1309+
categories.append(value_label_dict[category])
1310+
else:
1311+
categories.append(category) # Partially labeled
1312+
cat_data.categories = categories
1313+
cat_converted_data.append((col, cat_data))
1314+
else:
1315+
cat_converted_data.append((col, data[col]))
1316+
data = DataFrame.from_items(cat_converted_data)
13041317

13051318
if not preserve_dtypes:
13061319
retyped_data = []

pandas/io/tests/data/stata10_115.dta

2.24 KB
Binary file not shown.

pandas/io/tests/data/stata10_117.dta

2.24 KB
Binary file not shown.

pandas/io/tests/data/stata11_115.dta

810 Bytes
Binary file not shown.

pandas/io/tests/data/stata11_117.dta

1.24 KB
Binary file not shown.

pandas/io/tests/test_stata.py

+72
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
import pandas as pd
1515
from pandas.compat import iterkeys
1616
from pandas.core.frame import DataFrame, Series
17+
from pandas.core.common import is_categorical_dtype
1718
from pandas.io.parsers import read_csv
1819
from pandas.io.stata import (read_stata, StataReader, InvalidColumnName,
1920
PossiblePrecisionLoss, StataMissingValue)
@@ -81,6 +82,11 @@ def setUp(self):
8182
self.dta18_115 = os.path.join(self.dirpath, 'stata9_115.dta')
8283
self.dta18_117 = os.path.join(self.dirpath, 'stata9_117.dta')
8384

85+
self.dta19_115 = os.path.join(self.dirpath, 'stata10_115.dta')
86+
self.dta19_117 = os.path.join(self.dirpath, 'stata10_117.dta')
87+
88+
self.dta20_115 = os.path.join(self.dirpath, 'stata11_115.dta')
89+
self.dta20_117 = os.path.join(self.dirpath, 'stata11_117.dta')
8490

8591
def read_dta(self, file):
8692
# Legacy default reader configuration
@@ -817,6 +823,72 @@ def test_categorical_with_stata_missing_values(self):
817823
written_and_read_again = self.read_dta(path)
818824
tm.assert_frame_equal(written_and_read_again.set_index('index'), original)
819825

826+
def test_categorical_order(self):
827+
# Directly construct using expected codes
828+
# Format is is_cat, col_name, labels (in order), underlying data
829+
expected = [(True, 'ordered', ['a', 'b', 'c', 'd', 'e'], np.arange(5)),
830+
(True, 'reverse', ['a', 'b', 'c', 'd', 'e'], np.arange(5)[::-1]),
831+
(True, 'noorder', ['a', 'b', 'c', 'd', 'e'], np.array([2, 1, 4, 0, 3])),
832+
(True, 'floating', ['a', 'b', 'c', 'd', 'e'], np.arange(0, 5)),
833+
(True, 'float_missing', ['a', 'd', 'e'], np.array([0, 1, 2, -1, -1])),
834+
(False, 'nolabel', [1.0, 2.0, 3.0, 4.0, 5.0], np.arange(5)),
835+
(True, 'int32_mixed', ['d', 2, 'e', 'b', 'a'], np.arange(5))]
836+
cols = []
837+
for is_cat, col, labels, codes in expected:
838+
if is_cat:
839+
cols.append((col, pd.Categorical.from_codes(codes, labels)))
840+
else:
841+
cols.append((col, pd.Series(labels, dtype=np.float32)))
842+
expected = DataFrame.from_items(cols)
843+
844+
# Read with and with out categoricals, ensure order is identical
845+
parsed_115 = read_stata(self.dta19_115)
846+
parsed_117 = read_stata(self.dta19_117)
847+
tm.assert_frame_equal(expected, parsed_115)
848+
tm.assert_frame_equal(expected, parsed_117)
849+
850+
# Check identity of codes
851+
for col in expected:
852+
if is_categorical_dtype(expected[col]):
853+
print(col)
854+
tm.assert_series_equal(expected[col].cat.codes,
855+
parsed_115[col].cat.codes)
856+
tm.assert_index_equal(expected[col].cat.categories,
857+
parsed_115[col].cat.categories)
858+
859+
def test_categorical_sorting(self):
860+
parsed_115 = read_stata(self.dta20_115)
861+
parsed_117 = read_stata(self.dta20_117)
862+
# Sort based on codes, not strings
863+
parsed_115 = parsed_115.sort("srh")
864+
parsed_117 = parsed_117.sort("srh")
865+
# Don't sort index
866+
parsed_115.index = np.arange(parsed_115.shape[0])
867+
parsed_117.index = np.arange(parsed_117.shape[0])
868+
codes = [-1, -1, 0, 1, 1, 1, 2, 2, 3, 4]
869+
categories = ["Poor", "Fair", "Good", "Very good", "Excellent"]
870+
expected = pd.Series(pd.Categorical.from_codes(codes=codes,
871+
categories=categories))
872+
tm.assert_series_equal(expected, parsed_115["srh"])
873+
tm.assert_series_equal(expected, parsed_117["srh"])
874+
875+
def test_categorical_ordering(self):
876+
parsed_115 = read_stata(self.dta19_115)
877+
parsed_117 = read_stata(self.dta19_117)
878+
879+
parsed_115_unordered = read_stata(self.dta19_115,
880+
order_categoricals=False)
881+
parsed_117_unordered = read_stata(self.dta19_117,
882+
order_categoricals=False)
883+
for col in parsed_115:
884+
if not is_categorical_dtype(parsed_115[col]):
885+
continue
886+
tm.assert_equal(True, parsed_115[col].cat.ordered)
887+
tm.assert_equal(True, parsed_117[col].cat.ordered)
888+
tm.assert_equal(False, parsed_115_unordered[col].cat.ordered)
889+
tm.assert_equal(False, parsed_117_unordered[col].cat.ordered)
890+
891+
820892
if __name__ == '__main__':
821893
nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'],
822894
exit=False)

0 commit comments

Comments
 (0)