Skip to content

Commit 097756f

Browse files
committed
Merge remote-tracking branch 'origin/master' into debian-0.8
* origin/master: DOC: string args to how in resample pandas-dev#1355 BLD: turning isreleased to False DOC: release notes to close pandas-dev#1349 ENH: Cython nancorr speeds up DataFrame.corr with method='pearson' by > 100x DOC: new parser functionality pandas-dev#1347 DOC: another pass at release notes DOC: what's new DOC: release notes, what's new DOC: release notes and what's new BUG: parser with multiple date col and multiple index col pandas-dev#1344 TST: additional tests for parsers and minor code cleanup ENH: KdePlot with DataFrame pandas-dev#1342. TST: frame kde and kde with logy pandas-dev#1341 BUG: respect logy argument in KdePlot, close pandas-dev#1341 BUG: set_xlim for time series plots pandas-dev#1339 BUG: PeriodIndex.map tries to get super(DatetimIndex, self) BUG: fixed doc bug that caused latex build to fail DOC: cleaned up parser doc string to stop sphinx from complaining ENH: better error msg for fillna() with invalid method
2 parents d19fa1c + 8948f49 commit 097756f

20 files changed

+626
-156
lines changed

RELEASE.rst

Lines changed: 53 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,21 @@ pandas 0.8.0
2929

3030
**New features**
3131

32+
- New unified DatetimeIndex class for nanosecond-level timestamp data
33+
- New Timestamp datetime.datetime subclass with easy time zone conversions,
34+
and support for nanoseconds
35+
- New PeriodIndex class for timespans, calendar logic, and Period scalar object
36+
- High performance resampling of timestamp and period data. New `resample`
37+
method of all pandas data structures
38+
- New frequency names plus shortcut string aliases like '15h', '1h30min'
39+
- Time series string indexing shorthand (#222)
40+
- Add week, dayofyear array and other timestamp array-valued field accessor
41+
functions to DatetimeIndex
3242
- Add GroupBy.prod optimized aggregation function and 'prod' fast time series
3343
conversion method (#1018)
3444
- Implement robust frequency inference function and `inferred_freq` attribute
3545
on DatetimeIndex (#391)
46+
- New ``tz_convert`` methods in Series / DataFrame
3647
- Convert DatetimeIndexes to UTC if time zones are different in join/setops
3748
(#864)
3849
- Add limit argument for forward/backward filling to reindex, fillna,
@@ -49,7 +60,7 @@ pandas 0.8.0
4960
- Can pass list of (name, function) to GroupBy.aggregate to get aggregates in
5061
a particular order (#610)
5162
- Can pass dicts with lists of functions or dicts to GroupBy aggregate to do
52-
much more flexible multiple function aggregation (#642)
63+
much more flexible multiple function aggregation (#642, #610)
5364
- New ordered_merge functions for merging DataFrames with ordered
5465
data. Also supports group-wise merging for panel data (#813)
5566
- Add keys() method to DataFrame
@@ -59,6 +70,14 @@ pandas 0.8.0
5970
- More flexible multiple function aggregation with GroupBy
6071
- Add pct_change function to Series/DataFrame
6172
- Add option to interpolate by Index values in Series.interpolate (#1206)
73+
- Add ``max_colwidth`` option for DataFrame, defaulting to 50
74+
- Conversion of DataFrame through rpy2 to R data.frame (#1282, )
75+
- Add keys() method on DataFrame (#1240)
76+
- Add new ``match`` function to API (similar to R) (#502)
77+
- Add dayfirst option to parsers (#854)
78+
- Add ``method`` argument to ``align`` method for forward/backward fillin
79+
(#216)
80+
- Add Panel.transpose method for rearranging axes (#695)
6281

6382
**Improvements to existing features**
6483

@@ -80,15 +99,29 @@ pandas 0.8.0
8099
DataFrame.drop_duplicates (#805, #207)
81100
- More helpful error message when nothing passed to Series.reindex (#1267)
82101
- Can mix array and scalars as dict-value inputs to DataFrame ctor (#1329)
102+
- Use DataFrame columns' name for legend title in plots
103+
- Preserve frequency in DatetimeIndex when possible in boolean indexing
104+
operations
105+
- Promote datetime.date values in data alignment operations (#867)
106+
- Add ``order`` method to Index classes (#1028)
107+
- Avoid hash table creation in large monotonic hash table indexes (#1160)
108+
- Store time zones in HDFStore (#1232)
109+
- Enable storage of sparse data structures in HDFStore (#85)
110+
- Enable Series.asof to work with arrays of timestamp inputs
111+
- Cython implementation of DataFrame.corr speeds up by > 100x (#1349)
112+
83113

84114
**API Changes**
85115

116+
- Frequency name overhaul, WEEKDAY/EOM and rules with @
117+
deprecated. get_legacy_offset_name backwards compatibility function added
86118
- Raise ValueError in DataFrame.__nonzero__, so "if df" no longer works
87119
(#1073)
88-
- Change BDay (business day) to not normalize dates by default
120+
- Change BDay (business day) to not normalize dates by default (#506)
89121
- Remove deprecated DataMatrix name
90122
- Default merge suffixes for overlap now have underscores instead of periods
91123
to facilitate tab completion, etc. (#1239)
124+
- Deprecation of offset, time_rule timeRule parameters throughout codebase
92125

93126
**Bug fixes**
94127

@@ -97,7 +130,7 @@ pandas 0.8.0
97130
- Fix logical error with February leap year end in YearEnd offset
98131
- Series([False, nan]) was getting casted to float64 (GH #1074)
99132
- Fix binary operations between boolean Series and object Series with
100-
booleans and NAs (GH #1074)
133+
booleans and NAs (GH #1074, #1079)
101134
- Couldn't assign whole array to column in mixed-type DataFrame via .ix
102135
(#1142)
103136
- Fix label slicing issues with float index values (#1167)
@@ -113,6 +146,23 @@ pandas 0.8.0
113146
to datetime64 representation (#1081, #809)
114147
- Fix DataFrame.duplicated/drop_duplicates NA value handling (#557)
115148
- Actually raise exceptions in fast reducer (#1243)
149+
- Fix various timezone-handling bugs from 0.7.3 (#969)
150+
- GroupBy on level=0 discarded index name (#1313)
151+
- Better error message with unmergeable DataFrames (#1307)
152+
- Series.__repr__ alignment fix with unicode index values (#1279)
153+
- Better error message if nothing passed to reindex (#1267)
154+
- More robust NA handling in DataFrame.drop_duplicates (#557)
155+
- Resolve locale-based and pre-epoch HDF5 timestamp deserialization issues
156+
(#973, #1081, #179)
157+
- Implement Series.repeat (#1229)
158+
- Fix indexing with namedtuple and other tuple subclasses (#1026)
159+
- Fix float64 slicing bug (#1167)
160+
- Parsing integers with commas (#796)
161+
- Fix groupby improper data type when group consists of one value (#1065)
162+
- Fix negative variance possibility in nanvar resulting from floating point
163+
error (#1090)
164+
- Consistently set name on groupby pieces (#184)
165+
- Treat dict return values as Series in GroupBy.apply (#823)
116166

117167
pandas 0.7.3
118168
============

doc/source/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@
209209
latex_documents = [
210210
('index', 'pandas.tex',
211211
u'pandas: powerful Python data analysis toolkit',
212-
u'Wes McKinney\n& PyData Development Team', 'manual'),
212+
u'Wes McKinney\n\& PyData Development Team', 'manual'),
213213
]
214214

215215
# The name of an image file (relative to this directory) to place at the top of

doc/source/io.rst

Lines changed: 178 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -68,35 +68,53 @@ data into a DataFrame object. They can take a number of arguments:
6868
whitespace.
6969
- ``header``: row number to use as the column names, and the start of the data.
7070
Defaults to 0 (first row); specify None if there is no header row.
71-
- ``names``: List of column names to use. If passed, header will be
72-
implicitly set to None.
7371
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
7472
also be an integer to skip the first ``n`` rows
75-
- ``index_col``: column number, or list of column numbers, to use as the
76-
``index`` (row labels) of the resulting DataFrame. By default, it will number
77-
the rows without using any column, unless there is one more data column than
78-
there are headers, in which case the first column is taken as the index.
79-
- ``parse_dates``: If True, attempt to parse the index column as dates. False
80-
by default.
73+
- ``index_col``: column number, column name, or list of column numbers/names,
74+
to use as the ``index`` (row labels) of the resulting DataFrame. By default,
75+
it will number the rows without using any column, unless there is one more
76+
data column than there are headers, in which case the first column is taken
77+
as the index.
78+
- ``names``: List of column names to use. If passed, header will be
79+
implicitly set to None.
80+
- ``na_values``: optional list of strings to recognize as NaN (missing values),
81+
in addition to a default set.
82+
- ``parse_dates``: if True then index will be parsed as dates
83+
(False by default). You can specify more complicated options to parse
84+
a subset of columns or a combination of columns into a single date column
85+
(list of ints or names, list of lists, or dict)
86+
[1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column
87+
[[1, 3]] -> combine columns 1 and 3 and parse as a single date column
88+
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
89+
- ``keep_date_col``: if True, then date component columns passed into
90+
``parse_dates`` will be retained in the output (False by default).
8191
- ``date_parser``: function to use to parse strings into datetime
8292
objects. If ``parse_dates`` is True, it defaults to the very robust
8393
``dateutil.parser``. Specifying this implicitly sets ``parse_dates`` as True.
84-
- ``na_values``: optional list of strings to recognize as NaN (missing values),
85-
in addition to a default set.
94+
You can also use functions from community supported date converters from
95+
date_converters.py
96+
- ``dayfirst``: if True then uses the DD/MM international/European date format
97+
(This is False by default)
98+
- ``thousands``: sepcifies the thousands separator. If not None, then parser
99+
will try to look for it in the output and parse relevant data to integers.
100+
Because it has to essentially scan through the data again, this causes a
101+
significant performance hit so only use if necessary.
102+
- ``comment``: denotes the start of a comment and ignores the rest of the line.
103+
Currently line commenting is not supported.
86104
- ``nrows``: Number of rows to read out of the file. Useful to only read a
87105
small portion of a large file
106+
- ``iterator``: If True, return a ``TextParser`` to enable reading a file
107+
into memory piece by piece
88108
- ``chunksize``: An number of rows to be used to "chunk" a file into
89109
pieces. Will cause an ``TextParser`` object to be returned. More on this
90110
below in the section on :ref:`iterating and chunking <io.chunking>`
91-
- ``iterator``: If True, return a ``TextParser`` to enable reading a file
92-
into memory piece by piece
93111
- ``skip_footer``: number of lines to skip at bottom of file (default 0)
94112
- ``converters``: a dictionary of functions for converting values in certain
95113
columns, where keys are either integers or column labels
96114
- ``encoding``: a string representing the encoding to use if the contents are
97115
non-ascii
98-
- ``verbose`` : show number of NA values inserted in non-numeric columns
99-
116+
- ``verbose``: show number of NA values inserted in non-numeric columns
117+
- ``squeeze``: if True then output with only one column is turned into Series
100118

101119
.. ipython:: python
102120
:suppress:
@@ -117,8 +135,22 @@ The default for `read_csv` is to create a DataFrame with simple numbered rows:
117135
118136
read_csv('foo.csv')
119137
120-
In the case of indexed data, you can pass the column number (or a list of
121-
column numbers, for a hierarchical index) you wish to use as the index.
138+
In the case of indexed data, you can pass the column number or column name you
139+
wish to use as the index:
140+
141+
.. ipython:: python
142+
143+
read_csv('foo.csv', index_col=0)
144+
145+
.. ipython:: python
146+
147+
read_csv('foo.csv', index_col='date')
148+
149+
You can also use a list of columns to create a hierarchical index:
150+
151+
.. ipython:: python
152+
153+
read_csv('foo.csv', index_col=[0, 'A'])
122154
123155
The parsers make every attempt to "do the right thing" and not be very
124156
fragile. Type inference is a pretty big deal. So if a column can be coerced to
@@ -127,6 +159,9 @@ columns will come through as object dtype as with the rest of pandas objects.
127159

128160
.. _io.parse_dates:
129161

162+
Specifying Date Columns
163+
~~~~~~~~~~~~~~~~~~~~~~~
164+
130165
To better facilitate working with datetime data, :func:`~pandas.io.parsers.read_csv` and :func:`~pandas.io.parsers.read_table`
131166
uses the keyword arguments ``parse_dates`` and ``date_parser`` to allow users
132167
to specify a variety of columns and date/time formats to turn the input text
@@ -139,6 +174,7 @@ The simplest case is to just pass in ``parse_dates=True``:
139174
# Use a column as an index, and parse it as dates.
140175
df = read_csv('foo.csv', index_col=0, parse_dates=True)
141176
df
177+
142178
# These are python datetime objects
143179
df.index
144180
@@ -184,6 +220,12 @@ to retain them via the ``keep_date_col`` keyword:
184220
keep_date_col=True)
185221
df
186222
223+
Note that if you wish to combine multiple columns into a single date column, a
224+
nested list must be used. In other words, ``parse_dates=[1, 2]`` indicates that
225+
the second and third columns should each be parsed as separate date columns
226+
while ``parse_dates=[[1, 2]]`` means the two columns should be parsed into a
227+
single column.
228+
187229
You can also use a dict to specify custom name columns:
188230

189231
.. ipython:: python
@@ -192,6 +234,8 @@ You can also use a dict to specify custom name columns:
192234
df = read_csv('tmp.csv', header=None, parse_dates=date_spec)
193235
df
194236
237+
Date Parsing Functions
238+
~~~~~~~~~~~~~~~~~~~~~~
195239
Finally, the parser allows you can specify a custom ``date_parser`` function to
196240
take full advantage of the flexiblity of the date parsing API:
197241

@@ -204,7 +248,124 @@ take full advantage of the flexiblity of the date parsing API:
204248
205249
You can explore the date parsing functionality in ``date_converters.py`` and
206250
add your own. We would love to turn this module into a community supported set
207-
of date/time parsers.
251+
of date/time parsers. To get you started, ``date_converters.py`` contains
252+
functions to parse dual date and time columns, year/month/day columns,
253+
and year/month/day/hour/minute/second columns. It also contains a
254+
``generic_parser`` function so you can curry it with a function that deals with
255+
a single date rather than the entire array.
256+
257+
.. ipython:: python
258+
:suppress:
259+
260+
os.remove('tmp.csv')
261+
262+
.. _io.convenience:
263+
264+
Thousand Separators
265+
~~~~~~~~~~~~~~~~~~~
266+
For large integers that have been written with a thousands separator, you can
267+
set the ``thousands`` keyword to ``True`` so that integers will be parsed
268+
correctly:
269+
270+
.. ipython:: python
271+
:suppress:
272+
273+
data = ("ID|level|category\n"
274+
"Patient1|123,000|x\n"
275+
"Patient2|23,000|y\n"
276+
"Patient3|1,234,018|z")
277+
278+
with open('tmp.csv', 'w') as fh:
279+
fh.write(data)
280+
281+
By default, integers with a thousands separator will be parsed as strings
282+
283+
.. ipython:: python
284+
285+
print open('tmp.csv').read()
286+
df = read_csv('tmp.csv', sep='|')
287+
df
288+
289+
df.level.dtype
290+
291+
The ``thousands`` keyword allows integers to be parsed correctly
292+
293+
.. ipython:: python
294+
295+
print open('tmp.csv').read()
296+
df = read_csv('tmp.csv', sep='|', thousands=',')
297+
df
298+
299+
df.level.dtype
300+
301+
.. ipython:: python
302+
:suppress:
303+
304+
os.remove('tmp.csv')
305+
306+
Comments
307+
~~~~~~~~
308+
Sometimes comments or meta data may be included in a file:
309+
310+
.. ipython:: python
311+
:suppress:
312+
313+
data = ("ID,level,category\n"
314+
"Patient1,123000,x # really unpleasant\n"
315+
"Patient2,23000,y # wouldn't take his medicine\n"
316+
"Patient3,1234018,z # awesome")
317+
318+
with open('tmp.csv', 'w') as fh:
319+
fh.write(data)
320+
321+
.. ipython:: python
322+
323+
print open('tmp.csv').read()
324+
325+
By default, the parse includes the comments in the output:
326+
327+
.. ipython:: python
328+
329+
df = read_csv('tmp.csv')
330+
df
331+
332+
We can suppress the comments using the ``comment`` keyword:
333+
334+
.. ipython:: python
335+
336+
df = read_csv('tmp.csv', comment='#')
337+
df
338+
339+
.. ipython:: python
340+
:suppress:
341+
342+
os.remove('tmp.csv')
343+
344+
Returning Series
345+
~~~~~~~~~~~~~~~~
346+
347+
Using the ``squeeze`` keyword, the parser will return output with a single column
348+
as a ``Series``:
349+
350+
.. ipython:: python
351+
:suppress:
352+
353+
data = ("level\n"
354+
"Patient1,123000\n"
355+
"Patient2,23000\n"
356+
"Patient3,1234018")
357+
358+
with open('tmp.csv', 'w') as fh:
359+
fh.write(data)
360+
361+
.. ipython:: python
362+
363+
print open('tmp.csv').read()
364+
365+
output = read_csv('tmp.csv', squeeze=True)
366+
output
367+
368+
type(output)
208369
209370
.. ipython:: python
210371
:suppress:

doc/source/timeseries.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -596,6 +596,10 @@ and array and produces an aggregated values:
596596
597597
ts.resample('5Min', how=np.max)
598598
599+
Any function available via :ref:`dispatching <groupby.dispatch>` can be given to
600+
the ``how`` parameter by name, including ``sum``, ``mean``, ``std``, ``max``,
601+
``min``, ``median``, ``first``, ``last``, ``ohlc``.
602+
599603
For downsampling, ``closed`` can be set to 'left' or 'right' to specify which
600604
end of the interval is closed:
601605

0 commit comments

Comments
 (0)