Skip to content

Commit 77c017f

Browse files
committed
Merge branch 'master' into debian
* master: (313 commits) TST: more Python 2.5 sadness TST: Python 2.5 float formatting changed TST: cast to i8 when checking margins BUG: DataFrame.join on keys produce wrong result, does not preserve order DOC: release notes ENH: xs level can take multiple levels, pass multiple levels to MultiIndex.droplevel, GH pandas-dev#371 BUG: fix bugs related to comments in pandas-dev#371 BUG: fix TextParser with list buglet, enable parsing of DataFrame output with index names BUG: convert tuples in concat to MultiIndex BUG: don't lose index names when adding row margin ENH: add margins to crosstab ENH: add crosstab function and test ENH: crosstab prototype function, API needs fleshing out, GH pandas-dev#170 BUG: fix buglet with xs with level, GH pandas-dev#371 TST: add test_sql.py module TST: testing, cleanup of io.sql module TST: indexing testing with minor Series.__getitem__ refactoring ENH: hack toward pandas-dev#629 BUG: check for non-contiguous memory in SeriesGrouper, causing segfault ENH: add ability to pass list of dicts to DataFrame.append (GH pandas-dev#464) ...
2 parents 2ca93a1 + 195ec30 commit 77c017f

File tree

139 files changed

+16081
-4680
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

139 files changed

+16081
-4680
lines changed

.gitignore

+4-1
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ MANIFEST
1111
pandas/version.py
1212
doc/source/generated
1313
doc/source/_static
14+
doc/source/vbench
15+
doc/source/vbench.rst
1416
*flymake*
1517
scikits
16-
.coverage
18+
.coverage
19+
pandas.egg-info

RELEASE.rst

+219
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,224 @@ Where to get it
2222
* Binary installers on PyPI: http://pypi.python.org/pypi/pandas
2323
* Documentation: http://pandas.sourceforge.net
2424

25+
pandas 0.7.0
26+
============
27+
28+
**Release date:** NOT YET RELEASED
29+
30+
**New features / modules**
31+
32+
- New ``merge`` function for efficiently performing full gamut of database /
33+
relational-algebra operations. Refactored existing join methods to use the
34+
new infrastructure, resulting in substantial performance gains (GH #220,
35+
#249, #267)
36+
- New ``concat`` function for concatenating DataFrame or Panel objects along
37+
an axis. Can form union or intersection of the other axes. Improves
38+
performance of ``DataFrame.append`` (#468, #479, #273)
39+
- Handle differently-indexed output values in ``DataFrame.apply`` (GH #498)
40+
- Can pass list of dicts (e.g., a list of shallow JSON objects) to DataFrame
41+
constructor (GH #526)
42+
- Add ``reorder_levels`` method to Series and DataFrame (PR #534)
43+
- Add dict-like ``get`` function to DataFrame and Panel (PR #521)
44+
- ``DataFrame.iterrows`` method for efficiently iterating through the rows of
45+
a DataFrame
46+
- Added ``DataFrame.to_panel`` with code adapted from ``LongPanel.to_long``
47+
- ``reindex_axis`` method added to DataFrame
48+
- Add ``level`` option to binary arithmetic functions on ``DataFrame`` and
49+
``Series``
50+
- Add ``level`` option to the ``reindex`` and ``align`` methods on Series and
51+
DataFrame for broadcasting values across a level (GH #542, PR #552, others)
52+
- Add attribute-based item access to ``Panel`` and add IPython completion (PR
53+
#554)
54+
- Add ``logy`` option to ``Series.plot`` for log-scaling on the Y axis
55+
- Add ``index``, ``header``, and ``justify`` options to
56+
``DataFrame.to_string``. Add option to (GH #570, GH #571)
57+
- Can pass multiple DataFrames to ``DataFrame.join`` to join on index (GH #115)
58+
- Can pass multiple Panels to ``Panel.join`` (GH #115)
59+
- Can pass multiple DataFrames to `DataFrame.append` to concatenate (stack)
60+
and multiple Series to ``Series.append`` too
61+
- Added ``justify`` argument to ``DataFrame.to_string`` to allow different
62+
alignment of column headers
63+
- Add ``sort`` option to GroupBy to allow disabling sorting of the group keys
64+
for potential speedups (GH #595)
65+
- Can pass MaskedArray to Series constructor (PR #563)
66+
- Add Panel item access via attributes and IPython completion (GH #554)
67+
- Implement ``DataFrame.lookup``, fancy-indexing analogue for retrieving
68+
values given a sequence of row and column labels (GH #338)
69+
- Add ``verbose`` option to ``read_csv`` and ``read_table`` to show number of
70+
NA values inserted in non-numeric columns (GH #614)
71+
- Can pass a list of dicts or Series to ``DataFrame.append`` to concatenate
72+
multiple rows (GH #464)
73+
- Add ``level`` argument to ``DataFrame.xs`` for selecting data from other
74+
MultiIndex levels. Can take one or more levels with potentially a tuple of
75+
keys for flexible retrieval of data (GH #371, GH #629)
76+
- New ``crosstab`` function for easily computing frequency tables (GH #170)
77+
78+
**API Changes**
79+
80+
- Label-indexing with integer indexes now raises KeyError if a label is not
81+
found instead of falling back on location-based indexing
82+
- Label-based slicing via ``ix`` or ``[]`` on Series will now only work if
83+
exact matches for the labels are found or if the index is monotonic (for
84+
range selections)
85+
- Label-based slicing and sequences of labels can be passed to ``[]`` on a
86+
Series for both getting and setting (GH #86)
87+
- `[]` operator (``__getitem__`` and ``__setitem__``) will raise KeyError
88+
with integer indexes when an index is not contained in the index. The prior
89+
behavior would fall back on position-based indexing if a key was not found
90+
in the index which would lead to subtle bugs. This is now consistent with
91+
the behavior of ``.ix`` on DataFrame and friends (GH #328)
92+
- Rename ``DataFrame.delevel`` to ``DataFrame.reset_index`` and add
93+
deprecation warning
94+
- `Series.sort` (an in-place operation) called on a Series which is a view on
95+
a larger array (e.g. a column in a DataFrame) will generate an Exception to
96+
prevent accidentally modifying the data source (GH #316)
97+
- Refactor to remove deprecated ``LongPanel`` class (PR #552)
98+
- Deprecated ``Panel.to_long``, renamed to ``to_frame``
99+
- Deprecated ``colSpace`` argument in ``DataFrame.to_string``, renamed to
100+
``col_space``
101+
- Rename ``precision`` to ``accuracy`` in engineering float formatter (GH
102+
#395)
103+
104+
**Improvements to existing features**
105+
106+
- Better error message in DataFrame constructor when passed column labels
107+
don't match data (GH #497)
108+
- Substantially improve performance of multi-GroupBy aggregation when a
109+
Python function is passed, reuse ndarray object in Cython (GH #496)
110+
- Can store objects indexed by tuples and floats in HDFStore (GH #492)
111+
- Don't print length by default in Series.to_string, add `length` option (GH
112+
#489)
113+
- Improve Cython code for multi-groupby to aggregate without having to sort
114+
the data (GH #93)
115+
- Improve MultiIndex reindexing speed by storing tuples in the MultiIndex,
116+
test for backwards unpickling compatibility
117+
- Improve column reindexing performance by using specialized Cython take
118+
function
119+
- Further performance tweaking of Series.__getitem__ for standard use cases
120+
- Avoid Index dict creation in some cases (i.e. when getting slices, etc.),
121+
regression from prior versions
122+
- Friendlier error message in setup.py if NumPy not installed
123+
- Use common set of NA-handling operations (sum, mean, etc.) in Panel class
124+
also (GH #536)
125+
- Default name assignment when calling ``reset_index`` on DataFrame with a
126+
regular (non-hierarchical) index (GH #476)
127+
- Use Cythonized groupers when possible in Series/DataFrame stat ops with
128+
``level`` parameter passed (GH #545)
129+
- Ported skiplist data structure to C to speed up ``rolling_median`` by about
130+
5-10x in most typical use cases (GH #374)
131+
- Some performance enhancements in constructing a Panel from a dict of
132+
DataFrame objects
133+
- Made ``Index._get_duplicates`` a public method by removing the underscore
134+
- Prettier printing of floats, and column spacing fix (GH #395, GH #571)
135+
- Add ``bold_rows`` option to DataFrame.to_html (GH #586)
136+
- Improve the performance of ``DataFrame.sort_index`` by up to 5x or more
137+
when sorting by multiple columns
138+
- Substantially improve performance of DataFrame and Series constructors when
139+
passed a nested dict or dict, respectively (GH #540, GH #621)
140+
- Modified setup.py so that pip / setuptools will install dependencies (GH
141+
#507, various pull requests)
142+
- Unstack called on DataFrame with non-MultiIndex will return Series (GH
143+
#477)
144+
- Improve DataFrame.to_string and console formatting to be more consistent in
145+
the number of displayed digits (GH #395)
146+
- Use bottleneck if available for performing NaN-friendly statistical
147+
operations that it implemented (GH #91)
148+
- Can pass a list of functions to aggregate with groupby on a DataFrame,
149+
yielding an aggregated result with hierarchical columns (GH #166)
150+
- Monkey-patch context to traceback in ``DataFrame.apply`` to indicate which
151+
row/column the function application failed on (GH #614)
152+
- Improved ability of read_table and read_clipboard to parse
153+
console-formatted DataFrames (can read the row of index names, etc.)
154+
155+
**Bug fixes**
156+
157+
- Raise exception in out-of-bounds indexing of Series instead of
158+
seg-faulting, regression from earlier releases (GH #495)
159+
- Fix error when joining DataFrames of different dtypes within the same
160+
typeclass (e.g. float32 and float64) (GH #486)
161+
- Fix bug in Series.min/Series.max on objects like datetime.datetime (GH
162+
#487)
163+
- Preserve index names in Index.union (GH #501)
164+
- Fix bug in Index joining causing subclass information (like DateRange type)
165+
to be lost in some cases (GH #500)
166+
- Accept empty list as input to DataFrame constructor, regression from 0.6.0
167+
(GH #491)
168+
- Can output DataFrame and Series with ndarray objects in a dtype=object
169+
array (GH #490)
170+
- Return empty string from Series.to_string when called on empty Series (GH
171+
#488)
172+
- Fix exception passing empty list to DataFrame.from_records
173+
- Fix Index.format bug (excluding name field) with datetimes with time info
174+
- Fix scalar value access in Series to always return NumPy scalars,
175+
regression from prior versions (GH #510)
176+
- Handle rows skipped at beginning of file in read_* functions (GH #505)
177+
- Handle improper dtype casting in ``set_value`` methods
178+
- Unary '-' / __neg__ operator on DataFrame was returning integer values
179+
- Unbox 0-dim ndarrays from certain operators like all, any in Series
180+
- Fix handling of missing columns (was combine_first-specific) in
181+
DataFrame.combine for general case (GH #529)
182+
- Fix type inference logic with boolean lists and arrays in DataFrame indexing
183+
- Use centered sum of squares in R-square computation if entity_effects=True
184+
in panel regression
185+
- Handle all NA case in Series.{corr, cov}, was raising exception (GH #548)
186+
- Aggregating by multiple levels with ``level`` argument to DataFrame, Series
187+
stat method, was broken (GH #545)
188+
- Fix Cython buf when converter passed to read_csv produced a numeric array
189+
(buffer dtype mismatch when passed to Cython type inference function) (GH
190+
#546)
191+
- Fix exception when setting scalar value using .ix on a DataFrame with a
192+
MultiIndex (GH #551)
193+
- Fix outer join between two DateRanges with different offsets that returned
194+
an invalid DateRange
195+
- Cleanup DataFrame.from_records failure where index argument is an integer
196+
- Fix Data.from_records failure when passed a dictionary
197+
- Fix NA handling in {Series, DataFrame}.rank with non-floating point dtypes
198+
- Fix bug related to integer type-checking in .ix-based indexing
199+
- Handle non-string index name passed to DataFrame.from_records
200+
- DataFrame.insert caused the columns name(s) field to be discarded (GH #527)
201+
- Fix erroneous in monotonic many-to-one left joins
202+
- Fix DataFrame.to_string to remove extra column white space (GH #571)
203+
- Format floats to default to same number of digits (GH #395)
204+
- Added decorator to copy docstring from one function to another (GH #449)
205+
- Fix error in monotonic many-to-one left joins
206+
- Fix __eq__ comparison between DateOffsets with different relativedelta
207+
keywords passed
208+
- Fix exception caused by parser converter returning strings (GH #583)
209+
- Fix MultiIndex formatting bug with integer names (GH #601)
210+
- Fix bug in handling of non-numeric aggregates in Series.groupby (GH #612)
211+
- Fix TypeError with tuple subclasses (e.g. namedtuple) in
212+
DataFrame.from_records (GH #611)
213+
- Catch misreported console size when running IPython within Emacs
214+
- Fix minor bug in pivot table margins, loss of index names and length-1
215+
'All' tuple in row labels
216+
217+
Thanks
218+
------
219+
- Craig Austin
220+
- Marius Cobzarenco
221+
- Mario Gamboa-Cavazos
222+
- Arthur Gerigk
223+
- Yaroslav Halchenko
224+
- Jeff Hammerbacher
225+
- Matt Harrison
226+
- Andreas Hilboll
227+
- Luc Kesters
228+
- Adam Klein
229+
- Gregg Lind
230+
- Solomon Negusse
231+
- Wouter Overmeire
232+
- Christian Prinoth
233+
- Sam Reckoner
234+
- Craig Reeson
235+
- Jan Schulz
236+
- Ted Square
237+
- Graham Taylor
238+
- Chris Uga
239+
- Dieter Vandenbussche
240+
- Texas P.
241+
- Pinxing Ye
242+
25243
pandas 0.6.1
26244
============
27245

@@ -85,6 +303,7 @@ pandas 0.6.1
85303
- MultiIndex.get_level_values can take the level name
86304
- More helpful error message when DataFrame.plot fails on one of the columns
87305
(GH #478)
306+
- Improve performance of DataFrame.{index, columns} attribute lookup
88307

89308
**Bug fixes**
90309

TODO.rst

+7
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
DOCS 0.7.0
2+
----------
3+
- no sort in groupby
4+
- concat with dict
5+
16
DONE
27
----
38
- SparseSeries name integration + tests
@@ -49,3 +54,5 @@ Performance blog
4954
- Groupby
5055
- joining
5156
- Take
57+
58+
git log v0.6.1..master --pretty=format:%aN | sort | uniq -c | sort -rn

bench/bench_groupby.py

+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
from pandas import *
2+
from pandas.util.testing import rands
3+
4+
import string
5+
import random
6+
7+
k = 20000
8+
n = 10
9+
10+
foo = np.tile(np.array([rands(10) for _ in xrange(k)], dtype='O'), n)
11+
foo2 = list(foo)
12+
random.shuffle(foo)
13+
random.shuffle(foo2)
14+
15+
df = DataFrame({'A' : foo,
16+
'B' : foo2,
17+
'C' : np.random.randn(n * k)})
18+
19+
import pandas._sandbox as sbx
20+
21+
def f():
22+
table = sbx.StringHashTable(len(df))
23+
ret = table.factorize(df['A'])
24+
return ret
25+
def g():
26+
table = sbx.PyObjectHashTable(len(df))
27+
ret = table.factorize(df['A'])
28+
return ret
29+
30+
ret = f()
31+
32+
"""
33+
import pandas._tseries as lib
34+
35+
f = np.std
36+
37+
38+
grouped = df.groupby(['A', 'B'])
39+
40+
label_list = [ping.labels for ping in grouped.groupings]
41+
shape = [len(ping.ids) for ping in grouped.groupings]
42+
43+
from pandas.core.groupby import get_group_index
44+
45+
46+
group_index = get_group_index(label_list, shape).astype('i4')
47+
48+
ngroups = np.prod(shape)
49+
50+
indexer = lib.groupsort_indexer(group_index, ngroups)
51+
52+
values = df['C'].values.take(indexer)
53+
group_index = group_index.take(indexer)
54+
55+
f = lambda x: x.std(ddof=1)
56+
57+
grouper = lib.Grouper(df['C'], np.ndarray.std, group_index, ngroups)
58+
result = grouper.get_result()
59+
60+
expected = grouped.std()
61+
"""

0 commit comments

Comments
 (0)