Skip to content

BUG: Fixed incorrect type in integer conversion in to_stata #6335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 71 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
2448f48
BUG: Changes types used in packing structs
bashtage Feb 12, 2014
eeded1f
Added test for integer conversion bug
bashtage Feb 12, 2014
fba4572
Removed unintended whitespace
bashtage Feb 12, 2014
50f579f
Fixed another typo
bashtage Feb 12, 2014
77939e0
BUG: Changes types used in packing structs
bashtage Feb 12, 2014
fa7faff
Added test for integer conversion bug
bashtage Feb 12, 2014
26b5cdc
Removed unintended whitespace
bashtage Feb 12, 2014
a572356
Fixed another typo
bashtage Feb 12, 2014
467a84f
FIX: Corrected incorrect data type conversion between pandas and Stata
bashtage Feb 26, 2014
3915e07
Merge branch 'stata-export-datatype' of https://github.com/bashtage/p…
bashtage Feb 26, 2014
174aca3
Removed unintended branch merge
bashtage Feb 26, 2014
89fb3c0
Fixed formatting in comparison after casting
bashtage Feb 26, 2014
6308776
Added docstring for new function and warning class
bashtage Feb 26, 2014
4e65c25
BUG: Changes types used in packing structs
bashtage Feb 12, 2014
6b99643
Added test for integer conversion bug
bashtage Feb 12, 2014
bfed97b
Removed unintended whitespace
bashtage Feb 12, 2014
144516a
Fixed another typo
bashtage Feb 12, 2014
f4eb138
FIX: Corrected incorrect data type conversion between pandas and Stata
bashtage Feb 26, 2014
faae4a0
BUG: Changes types used in packing structs
bashtage Feb 12, 2014
de11ef9
Added test for integer conversion bug
bashtage Feb 12, 2014
4e0df96
Removed unintended whitespace
bashtage Feb 12, 2014
c4cff55
Fixed another typo
bashtage Feb 12, 2014
afee2dc
Removed unintended branch merge
bashtage Feb 26, 2014
4a96faa
Fixed formatting in comparison after casting
bashtage Feb 26, 2014
238bb93
Added docstring for new function and warning class
bashtage Feb 26, 2014
5c0f438
Merge branch 'stata-export-datatype' of https://github.com/bashtage/p…
bashtage Feb 26, 2014
13f56ee
BUG: Changes types used in packing structs
bashtage Feb 12, 2014
d30b445
Added test for integer conversion bug
bashtage Feb 12, 2014
58bc8ce
Removed unintended whitespace
bashtage Feb 12, 2014
c925923
Fixed another typo
bashtage Feb 12, 2014
7fb4d1b
Added test for integer conversion bug
bashtage Feb 12, 2014
9163fe8
Fixed another typo
bashtage Feb 12, 2014
f329ed0
FIX: Corrected incorrect data type conversion between pandas and Stata
bashtage Feb 26, 2014
9e05c86
Fixed formatting in comparison after casting
bashtage Feb 26, 2014
4d21b71
Added docstring for new function and warning class
bashtage Feb 26, 2014
07b1885
BUG: Changes types used in packing structs
bashtage Feb 12, 2014
a0a0cad
Added test for integer conversion bug
bashtage Feb 12, 2014
f7aaa9e
FIX: Corrected incorrect data type conversion between pandas and Stata
bashtage Feb 26, 2014
1661158
BUG: Changes types used in packing structs
bashtage Feb 12, 2014
00232bc
Added test for integer conversion bug
bashtage Feb 12, 2014
f8de199
Fixed another typo
bashtage Feb 12, 2014
02e4472
Removed unintended branch merge
bashtage Feb 26, 2014
b3a3366
Merge branch 'stata-export-datatype' of https://github.com/bashtage/p…
bashtage Feb 26, 2014
9788ad1
PERF: optimize index.__getitem__ for slice & boolean mask indexers
immerrr Feb 22, 2014
cda4216
BUG: Fixes and tests for extreme values in all data types
bashtage Feb 28, 2014
f149927
Merge pull request #6440 from immerrr/index-getitem-performance
jreback Feb 28, 2014
3bde9c9
BUG: Fixes and tests for extreme values in all data types
bashtage Feb 28, 2014
016bbf0
Merge branch 'master' of git://github.com/pydata/pandas into stata-ex…
bashtage Feb 28, 2014
6efa4c1
Merge pull request #6506 from jreback/dup_loc
jreback Feb 28, 2014
20d6191
BUG: Fixes and tests for extreme values in all data types
bashtage Feb 28, 2014
ca85da8
Disabled the big endian skips
bashtage Feb 28, 2014
a66ae27
Fixed legacy date issue with format 114 files
bashtage Feb 28, 2014
840efe6
Added format 114 (Stata 9/10/11) data file
bashtage Feb 28, 2014
661ab24
Add test for Stata data with file format 114
bashtage Feb 28, 2014
61b141b
ENH: add method='dense' to rank
dsm054 Mar 1, 2014
1dc157c
Added additional data files for testing alternative Stata file formats
bashtage Mar 1, 2014
38ddd91
Added expected result to test
bashtage Mar 2, 2014
530311c
BUG: Changes types used in packing structs
bashtage Feb 12, 2014
0f9ff84
Corrected incorrect data type conversion between pandas and Stata
bashtage Feb 26, 2014
0f36d6b
Added docstring for new function and warning class
bashtage Feb 26, 2014
7cb87f4
BUG: Fixes and tests for extreme values in all data types
bashtage Feb 28, 2014
8ba7a35
BUG: Fixes and tests for extreme values in all data types
bashtage Feb 28, 2014
d171d96
BUG: Fixes and tests for extreme values in all data types
bashtage Feb 28, 2014
adace0f
Disabled the big endian skips
bashtage Feb 28, 2014
040b736
Fixed legacy date issue with format 114 files
bashtage Feb 28, 2014
4f719ad
Added format 114 (Stata 9/10/11) data file
bashtage Feb 28, 2014
266983d
Add test for Stata data with file format 114
bashtage Feb 28, 2014
4ce821b
Added additional data files for testing alternative Stata file formats
bashtage Mar 1, 2014
ae34642
Added expected result to test
bashtage Mar 2, 2014
d83e902
Fixed final PEP8 issues in test_stata
bashtage Mar 2, 2014
27b5278
Added changes and enhancements to documentation
bashtage Mar 2, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,8 @@ API Changes
- ``NameResolutionError`` was removed because it isn't necessary anymore.
- ``concat`` will now concatenate mixed Series and DataFrames using the Series name
or numbering columns as needed (:issue:`2385`)
- Slicing and advanced/boolean indexing operations on ``Index`` classes will no
longer change type of the resulting index (:issue:`6440`).

Experimental Features
~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -125,6 +127,7 @@ Improvements to existing features
- Performance improvement in indexing into a multi-indexed Series (:issue:`5567`)
- Testing statements updated to use specialized asserts (:issue:`6175`)
- ``Series.rank()`` now has a percentage rank option (:issue:`5971`)
- ``Series.rank()`` and ``DataFrame.rank()`` now accept ``method='dense'`` for ranks without gaps (:issue:`6514`)
- ``quotechar``, ``doublequote``, and ``escapechar`` can now be specified when
using ``DataFrame.to_csv`` (:issue:`5414`, :issue:`4528`)
- perf improvements in DataFrame construction with certain offsets, by removing faulty caching
Expand Down Expand Up @@ -191,6 +194,10 @@ Bug Fixes
- Bug in ``read_html`` tests where redirected invalid URLs would make one test
fail (:issue:`6445`).
- Bug in multi-axis indexing using ``.loc`` on non-unique indices (:issue:`6504`)
- Bug in ``pd.read_stata`` which would use the wrong data types and missing values (:issue:`6327`)
- Bug in ``DataFrame.to_stata`` that lead to data loss in certain cases (:issue:`6335`)
- Bug in ``DataFrame.to_stata`` which exported using he wrong data types and missing values (:issue:`6335`)


pandas 0.13.1
-------------
Expand Down
18 changes: 18 additions & 0 deletions doc/source/v0.14.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,21 @@ These are out-of-bounds selections
- ``NameResolutionError`` was removed because it isn't necessary anymore.
- ``concat`` will now concatenate mixed Series and DataFrames using the Series name
or numbering columns as needed (:issue:`2385`). See :ref:`the docs <merging.mixed_ndims>`
- Slicing and advanced/boolean indexing operations on ``Index`` classes will no
longer change type of the resulting index (:issue:`6440`)

.. ipython:: python

i = pd.Index([1, 2, 3, 'a' , 'b', 'c'])
i[[0,1,2]]

Previously, the above operation would return ``Int64Index``. If you'd like
to do this manually, use :meth:`Index.astype`

.. ipython:: python

i[[0,1,2]].astype(np.int_)


MultiIndexing Using Slicers
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -233,6 +248,9 @@ Enhancements
using ``DataFrame.to_csv`` (:issue:`5414`, :issue:`4528`)
- Added a ``to_julian_date`` function to ``TimeStamp`` and ``DatetimeIndex``
to convert to the Julian Date used primarily in astronomy. (:issue:`4041`)
- ``DataFrame.to_stata`` will now check data for compatibility with Stata data types
and will upcast when needed. When it isn't possibly to losslessly upcast, a warning
is raised (:issue:`6327`)

Performance
~~~~~~~~~~~
Expand Down
42 changes: 36 additions & 6 deletions pandas/algos.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,14 @@ cdef:
int TIEBREAK_MAX = 2
int TIEBREAK_FIRST = 3
int TIEBREAK_FIRST_DESCENDING = 4
int TIEBREAK_DENSE = 5

tiebreakers = {
'average' : TIEBREAK_AVERAGE,
'min' : TIEBREAK_MIN,
'max' : TIEBREAK_MAX,
'first' : TIEBREAK_FIRST
'first' : TIEBREAK_FIRST,
'dense' : TIEBREAK_DENSE,
}


Expand Down Expand Up @@ -137,7 +139,7 @@ def rank_1d_float64(object in_arr, ties_method='average', ascending=True,
"""

cdef:
Py_ssize_t i, j, n, dups = 0
Py_ssize_t i, j, n, dups = 0, total_tie_count = 0
ndarray[float64_t] sorted_data, ranks, values
ndarray[int64_t] argsorted
float64_t val, nan_value
Expand Down Expand Up @@ -200,6 +202,10 @@ def rank_1d_float64(object in_arr, ties_method='average', ascending=True,
elif tiebreak == TIEBREAK_FIRST_DESCENDING:
for j in range(i - dups + 1, i + 1):
ranks[argsorted[j]] = 2 * i - j - dups + 2
elif tiebreak == TIEBREAK_DENSE:
total_tie_count += 1
for j in range(i - dups + 1, i + 1):
ranks[argsorted[j]] = total_tie_count
sum_ranks = dups = 0
if pct:
return ranks / count
Expand All @@ -214,7 +220,7 @@ def rank_1d_int64(object in_arr, ties_method='average', ascending=True,
"""

cdef:
Py_ssize_t i, j, n, dups = 0
Py_ssize_t i, j, n, dups = 0, total_tie_count = 0
ndarray[int64_t] sorted_data, values
ndarray[float64_t] ranks
ndarray[int64_t] argsorted
Expand Down Expand Up @@ -265,6 +271,10 @@ def rank_1d_int64(object in_arr, ties_method='average', ascending=True,
elif tiebreak == TIEBREAK_FIRST_DESCENDING:
for j in range(i - dups + 1, i + 1):
ranks[argsorted[j]] = 2 * i - j - dups + 2
elif tiebreak == TIEBREAK_DENSE:
total_tie_count += 1
for j in range(i - dups + 1, i + 1):
ranks[argsorted[j]] = total_tie_count
sum_ranks = dups = 0
if pct:
return ranks / count
Expand All @@ -279,7 +289,7 @@ def rank_2d_float64(object in_arr, axis=0, ties_method='average',
"""

cdef:
Py_ssize_t i, j, z, k, n, dups = 0
Py_ssize_t i, j, z, k, n, dups = 0, total_tie_count = 0
ndarray[float64_t, ndim=2] ranks, values
ndarray[int64_t, ndim=2] argsorted
float64_t val, nan_value
Expand Down Expand Up @@ -324,6 +334,7 @@ def rank_2d_float64(object in_arr, axis=0, ties_method='average',

for i in range(n):
dups = sum_ranks = 0
total_tie_count = 0
for j in range(k):
sum_ranks += j + 1
dups += 1
Expand All @@ -347,6 +358,10 @@ def rank_2d_float64(object in_arr, axis=0, ties_method='average',
elif tiebreak == TIEBREAK_FIRST_DESCENDING:
for z in range(j - dups + 1, j + 1):
ranks[i, argsorted[i, z]] = 2 * j - z - dups + 2
elif tiebreak == TIEBREAK_DENSE:
total_tie_count += 1
for z in range(j - dups + 1, j + 1):
ranks[i, argsorted[i, z]] = total_tie_count
sum_ranks = dups = 0

if axis == 0:
Expand All @@ -362,7 +377,7 @@ def rank_2d_int64(object in_arr, axis=0, ties_method='average',
"""

cdef:
Py_ssize_t i, j, z, k, n, dups = 0
Py_ssize_t i, j, z, k, n, dups = 0, total_tie_count = 0
ndarray[float64_t, ndim=2] ranks
ndarray[int64_t, ndim=2] argsorted
ndarray[int64_t, ndim=2, cast=True] values
Expand Down Expand Up @@ -395,6 +410,7 @@ def rank_2d_int64(object in_arr, axis=0, ties_method='average',

for i in range(n):
dups = sum_ranks = 0
total_tie_count = 0
for j in range(k):
sum_ranks += j + 1
dups += 1
Expand All @@ -415,6 +431,10 @@ def rank_2d_int64(object in_arr, axis=0, ties_method='average',
elif tiebreak == TIEBREAK_FIRST_DESCENDING:
for z in range(j - dups + 1, j + 1):
ranks[i, argsorted[i, z]] = 2 * j - z - dups + 2
elif tiebreak == TIEBREAK_DENSE:
total_tie_count += 1
for z in range(j - dups + 1, j + 1):
ranks[i, argsorted[i, z]] = total_tie_count
sum_ranks = dups = 0

if axis == 0:
Expand All @@ -430,7 +450,7 @@ def rank_1d_generic(object in_arr, bint retry=1, ties_method='average',
"""

cdef:
Py_ssize_t i, j, n, dups = 0
Py_ssize_t i, j, n, dups = 0, total_tie_count = 0
ndarray[float64_t] ranks
ndarray sorted_data, values
ndarray[int64_t] argsorted
Expand Down Expand Up @@ -502,6 +522,10 @@ def rank_1d_generic(object in_arr, bint retry=1, ties_method='average',
ranks[argsorted[j]] = i + 1
elif tiebreak == TIEBREAK_FIRST:
raise ValueError('first not supported for non-numeric data')
elif tiebreak == TIEBREAK_DENSE:
total_tie_count += 1
for j in range(i - dups + 1, i + 1):
ranks[argsorted[j]] = total_tie_count
sum_ranks = dups = 0
if pct:
ranks / count
Expand Down Expand Up @@ -545,6 +569,7 @@ def rank_2d_generic(object in_arr, axis=0, ties_method='average',

cdef:
Py_ssize_t i, j, z, k, n, infs, dups = 0
Py_ssize_t total_tie_count = 0
ndarray[float64_t, ndim=2] ranks
ndarray[object, ndim=2] values
ndarray[int64_t, ndim=2] argsorted
Expand Down Expand Up @@ -600,6 +625,7 @@ def rank_2d_generic(object in_arr, axis=0, ties_method='average',

for i in range(n):
dups = sum_ranks = infs = 0
total_tie_count = 0
for j in range(k):
val = values[i, j]
if val is nan_value and keep_na:
Expand All @@ -621,6 +647,10 @@ def rank_2d_generic(object in_arr, axis=0, ties_method='average',
elif tiebreak == TIEBREAK_FIRST:
raise ValueError('first not supported for '
'non-numeric data')
elif tiebreak == TIEBREAK_DENSE:
total_tie_count += 1
for z in range(j - dups + 1, j + 1):
ranks[i, argsorted[i, z]] = total_tie_count
sum_ranks = dups = 0

if axis == 0:
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4182,11 +4182,12 @@ def rank(self, axis=0, numeric_only=None, method='average',
Ranks over columns (0) or rows (1)
numeric_only : boolean, default None
Include only float, int, boolean data
method : {'average', 'min', 'max', 'first'}
method : {'average', 'min', 'max', 'first', 'dense'}
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
na_option : {'keep', 'top', 'bottom'}
* keep: leave NA values where they are
* top: smallest rank if ascending
Expand Down
51 changes: 25 additions & 26 deletions pandas/core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -631,34 +631,35 @@ def __hash__(self):
raise TypeError("unhashable type: %r" % type(self).__name__)

def __getitem__(self, key):
"""Override numpy.ndarray's __getitem__ method to work as desired"""
arr_idx = self.view(np.ndarray)
"""
Override numpy.ndarray's __getitem__ method to work as desired.

This function adds lists and Series as valid boolean indexers
(ndarrays only supports ndarray with dtype=bool).

If resulting ndim != 1, plain ndarray is returned instead of
corresponding `Index` subclass.

"""
# There's no custom logic to be implemented in __getslice__, so it's
# not overloaded intentionally.
__getitem__ = super(Index, self).__getitem__
if np.isscalar(key):
return arr_idx[key]
else:
if com._is_bool_indexer(key):
key = np.asarray(key)
return __getitem__(key)

try:
result = arr_idx[key]
if result.ndim > 1:
return result
except (IndexError):
if not len(key):
result = []
else:
raise
if isinstance(key, slice):
# This case is separated from the conditional above to avoid
# pessimization of basic indexing.
return __getitem__(key)

return Index(result, name=self.name)
if com._is_bool_indexer(key):
return __getitem__(np.asarray(key))

def _getitem_slice(self, key):
""" getitem for a bool/sliceable, fallback to standard getitem """
try:
arr_idx = self.view(np.ndarray)
result = arr_idx[key]
return self.__class__(result, name=self.name, fastpath=True)
except:
return self.__getitem__(key)
result = __getitem__(key)
if result.ndim > 1:
return result.view(np.ndarray)
else:
return result

def append(self, other):
"""
Expand Down Expand Up @@ -2800,8 +2801,6 @@ def __getitem__(self, key):

return result

_getitem_slice = __getitem__

def take(self, indexer, axis=None):
"""
Analogous to ndarray.take
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -3737,7 +3737,7 @@ def get_slice(self, slobj, raise_on_error=False):
if raise_on_error:
_check_slice_bounds(slobj, self.index)
return self.__class__(self._block._slice(slobj),
self.index._getitem_slice(slobj), fastpath=True)
self.index[slobj], fastpath=True)

def set_axis(self, axis, value, maybe_rename=True, check_axis=True):
cur_axis, value = self._set_axis(axis, value, check_axis)
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1720,11 +1720,12 @@ def rank(self, method='average', na_option='keep', ascending=True,

Parameters
----------
method : {'average', 'min', 'max', 'first'}
method : {'average', 'min', 'max', 'first', 'dense'}
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
na_option : {'keep'}
keep: leave NA values where they are
ascending : boolean, default True
Expand Down
Loading