Skip to content

Commit c99dc49

Browse files
author
locojaydev
committed
Merge branch 'master' into excelfancy
Conflicts: pandas/src/parse_helper.h pandas/src/parser/tokenizer.c
2 parents d354267 + fbd77d5 commit c99dc49

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+2246
-902
lines changed

.travis.yml

+12-2
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,18 @@ python:
77
- 3.2
88

99
install:
10-
- pip install --use-mirrors cython numpy nose pytz python-dateutil
10+
- export PYTHONIOENCODING=utf8 # activate venv 1.8.4 "detach" fix
11+
- virtualenv --version
12+
- whoami
13+
- pwd
14+
# install 1.7.0b2 for 3.3, and pull a version of numpy git master
15+
# with a alternate fix for detach bug as a temporary workaround
16+
# for the others.
17+
- "if [ $TRAVIS_PYTHON_VERSION == '3.3' ]; then pip uninstall numpy; pip install http://downloads.sourceforge.net/project/numpy/NumPy/1.7.0b2/numpy-1.7.0b2.tar.gz; fi"
18+
- "if [ $TRAVIS_PYTHON_VERSION == '3.2' ] || [ $TRAVIS_PYTHON_VERSION == '3.1' ]; then pip install --use-mirrors git+git://github.com/numpy/numpy.git@089bfa5865cd39e2b40099755e8563d8f0d04f5f#egg=numpy; fi"
19+
- "if [ ${TRAVIS_PYTHON_VERSION:0:1} == '2' ]; then pip install numpy; fi" # should be nop if pre-installed
20+
- pip install --use-mirrors cython nose pytz python-dateutil
1121

1222
script:
1323
- python setup.py build_ext install
14-
- nosetests --exe -w /tmp pandas.tests
24+
- nosetests --exe -w /tmp -A "not slow" pandas

RELEASE.rst

+22-1
Original file line numberDiff line numberDiff line change
@@ -30,18 +30,31 @@ pandas 0.10.0
3030
**New features**
3131

3232
- Add error handling to Series.str.encode/decode (#2276)
33+
- Add ``where`` and ``mask`` to Series (#2337)
34+
- Grouped histogram via `by` keyword in Series/DataFrame.hist (#2186)
35+
- Support optional ``min_periods`` keyword in ``corr`` and ``cov``
36+
for both Series and DataFrame (#2002)
3337

3438
**API Changes**
3539

3640
- ``names`` handling in file parsing: if explicit column `names` passed,
3741
`header` argument will be respected. If there is an existing header column,
3842
this can rename the columns. To fix legacy code, put ``header=None`` when
3943
passing ``names``
44+
- DataFrame selection using a boolean frame now preserves input shape
45+
- If function passed to Series.apply yields a Series, result will be a
46+
DataFrame (#2316)
4047

4148
**Improvements to existing features**
4249

43-
- Grouped histogram via `by` keyword in Series/DataFrame.hist (#2186)
4450
- Add ``nrows`` option to DataFrame.from_records for iterators (#1794)
51+
- Unstack/reshape algorithm rewrite to avoid high memory use in cases where
52+
the number of observed key-tuples is much smaller than the total possible
53+
number that could occur (#2278). Also improves performance in most cases.
54+
- Support duplicate columns in DataFrame.from_records (#2179)
55+
- Add ``normalize`` option to Series/DataFrame.asfreq (#2137)
56+
- SparseSeries and SparseDataFrame construction from empty and scalar
57+
values now no longer create dense ndarrays unnecessarily (#2322)
4558

4659
**Bug fixes**
4760

@@ -51,6 +64,14 @@ pandas 0.10.0
5164
- Properly box datetime64 values when retrieving cross-section from
5265
mixed-dtype DataFrame (#2272)
5366
- Fix concatenation bug leading to #2057, #2257
67+
- Fix regression in Index console formatting (#2319)
68+
- Box Period data when assigning PeriodIndex to frame column (#2243, #2281)
69+
- Raise exception on calling reset_index on Series with inplace=True (#2277)
70+
- Enable setting multiple columns in DataFrame with hierarchical columns
71+
(#2295)
72+
- Respect dtype=object in DataFrame constructor (#2291)
73+
- Fix DatetimeIndex.join bug with tz-aware indexes and how='outer' (#2317)
74+
- pop(...) and del works with DataFrame with duplicate columns (#2349)
5475

5576
pandas 0.9.1
5677
============

doc/source/computation.rst

+43-15
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,21 @@ among the series in the DataFrame, also excluding NA/null values.
6262
frame = DataFrame(randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
6363
frame.cov()
6464
65+
``DataFrame.cov`` also supports an optional ``min_periods`` keyword that
66+
specifies the required minimum number of observations for each column pair
67+
in order to have a valid result.
68+
69+
.. ipython:: python
70+
71+
frame = DataFrame(randn(20, 3), columns=['a', 'b', 'c'])
72+
frame.ix[:5, 'a'] = np.nan
73+
frame.ix[5:10, 'b'] = np.nan
74+
75+
frame.cov()
76+
77+
frame.cov(min_periods=12)
78+
79+
6580
.. _computation.correlation:
6681

6782
Correlation
@@ -97,6 +112,19 @@ All of these are currently computed using pairwise complete observations.
97112
Note that non-numeric columns will be automatically excluded from the
98113
correlation calculation.
99114

115+
Like ``cov``, ``corr`` also supports the optional ``min_periods`` keyword:
116+
117+
.. ipython:: python
118+
119+
frame = DataFrame(randn(20, 3), columns=['a', 'b', 'c'])
120+
frame.ix[:5, 'a'] = np.nan
121+
frame.ix[5:10, 'b'] = np.nan
122+
123+
frame.corr()
124+
125+
frame.corr(min_periods=12)
126+
127+
100128
A related method ``corrwith`` is implemented on DataFrame to compute the
101129
correlation between like-labeled Series contained in different DataFrame
102130
objects.
@@ -290,9 +318,9 @@ columns using ``ix`` indexing:
290318
291319
Expanding window moment functions
292320
---------------------------------
293-
A common alternative to rolling statistics is to use an *expanding* window,
294-
which yields the value of the statistic with all the data available up to that
295-
point in time. As these calculations are a special case of rolling statistics,
321+
A common alternative to rolling statistics is to use an *expanding* window,
322+
which yields the value of the statistic with all the data available up to that
323+
point in time. As these calculations are a special case of rolling statistics,
296324
they are implemented in pandas such that the following two calls are equivalent:
297325

298326
.. ipython:: python
@@ -301,7 +329,7 @@ they are implemented in pandas such that the following two calls are equivalent:
301329
302330
expanding_mean(df)[:5]
303331
304-
Like the ``rolling_`` functions, the following methods are included in the
332+
Like the ``rolling_`` functions, the following methods are included in the
305333
``pandas`` namespace or can be located in ``pandas.stats.moments``.
306334

307335
.. csv-table::
@@ -324,12 +352,12 @@ Like the ``rolling_`` functions, the following methods are included in the
324352
``expanding_corr``, Correlation (binary)
325353
``expanding_corr_pairwise``, Pairwise correlation of DataFrame columns
326354

327-
Aside from not having a ``window`` parameter, these functions have the same
328-
interfaces as their ``rolling_`` counterpart. Like above, the parameters they
355+
Aside from not having a ``window`` parameter, these functions have the same
356+
interfaces as their ``rolling_`` counterpart. Like above, the parameters they
329357
all accept are:
330358

331-
- ``min_periods``: threshold of non-null data points to require. Defaults to
332-
minimum needed to compute statistic. No ``NaNs`` will be output once
359+
- ``min_periods``: threshold of non-null data points to require. Defaults to
360+
minimum needed to compute statistic. No ``NaNs`` will be output once
333361
``min_periods`` non-null data points have been seen.
334362
- ``freq``: optionally specify a :ref:`frequency string <timeseries.alias>`
335363
or :ref:`DateOffset <timeseries.offsets>` to pre-conform the data to.
@@ -338,15 +366,15 @@ all accept are:
338366

339367
.. note::
340368

341-
The output of the ``rolling_`` and ``expanding_`` functions do not return a
342-
``NaN`` if there are at least ``min_periods`` non-null values in the current
343-
window. This differs from ``cumsum``, ``cumprod``, ``cummax``, and
344-
``cummin``, which return ``NaN`` in the output wherever a ``NaN`` is
369+
The output of the ``rolling_`` and ``expanding_`` functions do not return a
370+
``NaN`` if there are at least ``min_periods`` non-null values in the current
371+
window. This differs from ``cumsum``, ``cumprod``, ``cummax``, and
372+
``cummin``, which return ``NaN`` in the output wherever a ``NaN`` is
345373
encountered in the input.
346374

347-
An expanding window statistic will be more stable (and less responsive) than
348-
its rolling window counterpart as the increasing window size decreases the
349-
relative impact of an individual data point. As an example, here is the
375+
An expanding window statistic will be more stable (and less responsive) than
376+
its rolling window counterpart as the increasing window size decreases the
377+
relative impact of an individual data point. As an example, here is the
350378
``expanding_mean`` output for the previous time series dataset:
351379

352380
.. ipython:: python

doc/source/indexing.rst

+64-8
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,7 @@ Using a boolean vector to index a Series works exactly as in a numpy ndarray:
190190
191191
s[s > 0]
192192
s[(s < 0) & (s > -0.5)]
193+
s[(s < -1) | (s > 1 )]
193194
194195
You may select rows from a DataFrame using a boolean vector the same length as
195196
the DataFrame's index (for example, something derived from one of the columns
@@ -231,22 +232,77 @@ Note, with the :ref:`advanced indexing <indexing.advanced>` ``ix`` method, you
231232
may select along more than one axis using boolean vectors combined with other
232233
indexing expressions.
233234

234-
Indexing a DataFrame with a boolean DataFrame
235-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
235+
Where and Masking
236+
~~~~~~~~~~~~~~~~~
236237

237-
You may wish to set values on a DataFrame based on some boolean criteria
238-
derived from itself or another DataFrame or set of DataFrames. This can be done
239-
intuitively like so:
238+
Selecting values from a Series with a boolean vector generally returns a subset of the data.
239+
To guarantee that selection output has the same shape as the original data, you can use the
240+
``where`` method in ``Series`` and ``DataFrame``.
240241

241242
.. ipython:: python
242243
244+
# return only the selected rows
245+
s[s > 0]
246+
247+
# return a Series of the same shape as the original
248+
s.where(s > 0)
249+
250+
Selecting values from a DataFrame with a boolean critierion now also preserves input data shape.
251+
``where`` is used under the hood as the implementation.
252+
253+
.. ipython:: python
254+
255+
# return a DataFrame of the same shape as the original
256+
# this is equiavalent to ``df.where(df < 0)``
257+
df[df < 0]
258+
259+
In addition, ``where`` takes an optional ``other`` argument for replacement of values where the
260+
condition is False, in the returned copy.
261+
262+
.. ipython:: python
263+
264+
df.where(df < 0, -df)
265+
266+
You may wish to set values based on some boolean criteria.
267+
This can be done intuitively like so:
268+
269+
.. ipython:: python
270+
271+
s2 = s.copy()
272+
s2[s2 < 0] = 0
273+
s2
274+
243275
df2 = df.copy()
244-
df2 < 0
245276
df2[df2 < 0] = 0
246277
df2
247278
248-
Note that such an operation requires that the boolean DataFrame is indexed
249-
exactly the same.
279+
Furthermore, ``where`` aligns the input boolean condition (ndarray or DataFrame), such that partial selection
280+
with setting is possible. This is analagous to partial setting via ``.ix`` (but on the contents rather than the axis labels)
281+
282+
.. ipython:: python
283+
284+
df2 = df.copy()
285+
df2[ df2[1:4] > 0 ] = 3
286+
df2
287+
288+
By default, ``where`` returns a modified copy of the data. There is an optional parameter ``inplace``
289+
so that the original data can be modified without creating a copy:
290+
291+
.. ipython:: python
292+
293+
df_orig = df.copy()
294+
295+
df_orig.where(df > 0, -df, inplace=True);
296+
297+
df_orig
298+
299+
``mask`` is the inverse boolean operation of ``where``.
300+
301+
.. ipython:: python
302+
303+
s.mask(s >= 0)
304+
305+
df.mask(df >= 0)
250306
251307
252308
Take Methods

doc/source/r_interface.rst

+8-4
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,14 @@ rpy2 / R interface
1515
If your computer has R and rpy2 (> 2.2) installed (which will be left to the
1616
reader), you will be able to leverage the below functionality. On Windows,
1717
doing this is quite an ordeal at the moment, but users on Unix-like systems
18-
should find it quite easy. rpy2 evolves in time and the current interface is
19-
designed for the 2.2.x series, and we recommend to use over other series
20-
unless you are prepared to fix parts of the code. Released packages are available
21-
in PyPi, but should the latest code in the 2.2.x series be wanted it can be obtained with:
18+
should find it quite easy. rpy2 evolves in time, and is currently reaching
19+
its release 2.3, while the current interface is
20+
designed for the 2.2.x series. We recommend to use 2.2.x over other series
21+
unless you are prepared to fix parts of the code, yet the rpy2-2.3.0
22+
introduces improvements such as a better R-Python bridge memory management
23+
layer so I might be a good idea to bite the bullet and submit patches for
24+
the few minor differences that need to be fixed.
25+
2226

2327
::
2428

pandas/core/algorithms.py

+11-11
Original file line numberDiff line numberDiff line change
@@ -117,16 +117,17 @@ def factorize(values, sort=False, order=None, na_sentinel=-1):
117117
"""
118118
values = np.asarray(values)
119119
is_datetime = com.is_datetime64_dtype(values)
120-
hash_klass, values = _get_data_algo(values, _hashtables)
120+
(hash_klass, vec_klass), values = _get_data_algo(values, _hashtables)
121121

122-
uniques = []
123122
table = hash_klass(len(values))
124-
labels, counts = table.get_labels(values, uniques, 0, na_sentinel)
123+
uniques = vec_klass()
124+
labels = table.get_labels(values, uniques, 0, na_sentinel)
125125

126126
labels = com._ensure_platform_int(labels)
127127

128-
uniques = com._asarray_tuplesafe(uniques)
129-
if sort and len(counts) > 0:
128+
uniques = uniques.to_array()
129+
130+
if sort and len(uniques) > 0:
130131
sorter = uniques.argsort()
131132
reverse_indexer = np.empty(len(sorter), dtype=np.int_)
132133
reverse_indexer.put(sorter, np.arange(len(sorter)))
@@ -136,12 +137,11 @@ def factorize(values, sort=False, order=None, na_sentinel=-1):
136137
np.putmask(labels, mask, -1)
137138

138139
uniques = uniques.take(sorter)
139-
counts = counts.take(sorter)
140140

141141
if is_datetime:
142-
uniques = np.array(uniques, dtype='M8[ns]')
142+
uniques = uniques.view('M8[ns]')
143143

144-
return labels, uniques, counts
144+
return labels, uniques
145145

146146

147147
def value_counts(values, sort=True, ascending=False):
@@ -325,7 +325,7 @@ def group_position(*args):
325325
}
326326

327327
_hashtables = {
328-
'float64': lib.Float64HashTable,
329-
'int64': lib.Int64HashTable,
330-
'generic': lib.PyObjectHashTable
328+
'float64': (lib.Float64HashTable, lib.Float64Vector),
329+
'int64': (lib.Int64HashTable, lib.Int64Vector),
330+
'generic': (lib.PyObjectHashTable, lib.ObjectVector)
331331
}

pandas/core/categorical.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,9 @@ def from_array(cls, data):
5353
labels, levels = data.factorize()
5454
else:
5555
try:
56-
labels, levels, _ = factorize(data, sort=True)
56+
labels, levels = factorize(data, sort=True)
5757
except TypeError:
58-
labels, levels, _ = factorize(data, sort=False)
58+
labels, levels = factorize(data, sort=False)
5959

6060
return Categorical(labels, levels,
6161
name=getattr(data, 'name', None))

0 commit comments

Comments
 (0)