continued on 0.6.0 docs

adamklein · wesm · commit 602bd19b4165 · 2011-12-22T16:31:46.000-05:00
diff --git a/doc/source/basics.rst b/doc/source/basics.rst
@@ -27,6 +27,21 @@ the previous section:
               major_axis=DateRange('1/1/2000', periods=5),
               minor_axis=['A', 'B', 'C', 'D'])
 
+.. _basics.head_tail:
+
+Head and Tail
+-------------
+
+To view a small sample of a Series or DataFrame object, use the ``head`` and
+``tail`` methods. The default number of elements to display is five, but you
+may pass a custom number.
+
+.. ipython:: python
+
+   long_series = Series(randn(1000))
+   long_series.head()
+   long_series.tail(3)
+
 .. _basics.attrs:
 
 Attributes and the raw ndarray(s)
@@ -76,15 +91,15 @@ unlike the axis labels, cannot be assigned to.
 Flexible binary operations
 --------------------------
 
-With binary operations between pandas data structures, we have a couple items
+With binary operations between pandas data structures, there are two key points
 of interest:
 
-  * How to describe broadcasting behavior between higher- (e.g. DataFrame) and
+  * Broadcasting behavior between higher- (e.g. DataFrame) and
     lower-dimensional (e.g. Series) objects.
-  * Behavior of missing data in computations
+  * Missing data in computations
 
-We will demonstrate the currently-available functions to illustrate these
-issues independently, though they can be performed simultaneously.
+We will demonstrate how to manage these issues independently, though they can
+be handled simultaneously.
 
 Matching / broadcasting behavior
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -179,6 +194,20 @@ function implementing this operation is ``combine_first``, which we illustrate:
    df2
    df1.combine_first(df2)
 
+General DataFrame Combine
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``combine_first`` method above calls the more general DataFrame method
+``combine``. This method takes another DataFrame and a combiner function,
+aligns the input DataFrame and then passes the combiner function pairs of
+Series (ie, columns whose names are the same).
+
+So, for instance, to reproduce ``combine_first`` as above:
+
+.. ipython:: python
+
+   combiner = lambda x, y: np.where(isnull(x), y, x)
+   df1.combine(df2, combiner)
 
 .. _basics.stats:
 
diff --git a/doc/source/dsintro.rst b/doc/source/dsintro.rst
@@ -16,7 +16,7 @@ objects. To get started, import numpy and load pandas into your namespace:
    import numpy as np
    from pandas import *
    randn = np.random.randn
-   np.set_printoptions(precision=4, suppress=True)
+   np.set_printoptions(precision=4, suppress=True, max_columns=10)
 
 .. ipython:: python
 
@@ -455,6 +455,19 @@ Operations with scalars are just as you would expect:
    1 / df
    df ** 4
 
+.. _dsintro.boolean:
+
+As of 0.6, boolean operators work:
+
+.. ipython:: python
+
+   df1 = DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
+   df2 = DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)
+   df1 & df2
+   df1 | df2
+   df1 ^ df2
+   -df1
+
 Transposing
 ~~~~~~~~~~~
 
diff --git a/doc/source/groupby.rst b/doc/source/groupby.rst
@@ -177,6 +177,13 @@ number:
 
    s.groupby(level='second').sum()
 
+As of v0.6, the aggregation functions such as ``sum`` will take the level
+parameter directly:
+
+.. ipython:: python
+
+   s.sum(level='second')
+
 More on the ``sum`` function and aggregation later. Grouping with multiple
 levels (as opposed to a single level) is not yet supported, though implementing
 it is not difficult.
diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst
@@ -22,12 +22,12 @@ The axis labeling information in pandas objects serves many purposes:
   - Enables automatic and explicit data alignment
   - Allows intuitive getting and setting of subsets of the data set
 
-In this section / chapter, we will focus on the latter set of functionality,
-namely how to slice, dice, and generally get and set subsets of pandas
-objects. The primary focus will be on Series and DataFrame as they have
-received more development attention in this area. More work will be invested in
-Panel and future higher-dimensional data structures in the future, especially
-in label-based advanced indexing.
+In this section / chapter, we will focus on the final point: namely, how to
+slice, dice, and generally get and set subsets of pandas objects. The primary
+focus will be on Series and DataFrame as they have received more development
+attention in this area. Expect more work to be invested higher-dimensional data
+structures (including Panel) in the future, especially in label-based advanced
+indexing.
 
 .. _indexing.basics:
 
@@ -115,19 +115,16 @@ label, respectively.
    panel.major_xs(date)
    panel.minor_xs('A')
 
-.. note::
-
-   See :ref:`advanced indexing <indexing.advanced>` below for an alternate and
-   more concise way of doing the same thing.
-
 Slicing ranges
 ~~~~~~~~~~~~~~
 
-:ref:`Advanced indexing <indexing.advanced>` detailed below is the most robust
-and consistent way of slicing ranges, e.g. ``obj[5:10]``, across all of the data
-structures and their axes (except in the case of integer labels, more on that
-later). On Series, this syntax works exactly as expected as with an ndarray,
-returning a slice of the values and the corresponding labels:
+The most robust and consistent way of slicing ranges along arbitrary axes is
+described in the :ref:`Advanced indexing <indexing.advanced>` section detailing
+the ``.ix`` method. For now, we explain the semantics of slicing using the
+``[]`` operator.
+
+With Series, the syntax works exactly as with an ndarray, returning a slice of
+the values and the corresponding labels:
 
 .. ipython:: python
 
@@ -154,28 +151,37 @@ largely as a convenience since it is such a common operation.
 Boolean indexing
 ~~~~~~~~~~~~~~~~
 
-Using a boolean vector to index a Series works exactly like an ndarray:
+.. _indexing.boolean:
+
+Using a boolean vector to index a Series works exactly as in a numpy ndarray:
 
 .. ipython:: python
 
    s[s > 0]
    s[(s < 0) & (s > -0.5)]
 
-Again as a convenience, selecting rows from a DataFrame using a boolean vector
-the same length as the DataFrame's index (for example, something derived from
-one of the columns of the DataFrame) is supported:
+You may select rows from a DataFrame using a boolean vector the same length as
+the DataFrame's index (for example, something derived from one of the columns
+of the DataFrame):
 
 .. ipython:: python
 
    df[df['A'] > 0]
 
-As we will see later on, the same operation could be accomplished by
-reindexing. However, the syntax would be more verbose; hence, the inclusion of
-this indexing method.
+Consider the ``isin`` method of Series, which returns a boolean vector that is
+true wherever the Series elements exist in the passed list. This allows you to
+select out rows where one or more columns have values you want:
+
+.. ipython:: python
+
+   df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
+                    'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
+                    'c' : np.random.randn(7)})
+   df2[df2['a'].isin(['one', 'two'])]
 
-With the advanced indexing capabilities discussed later, you are able to do
-boolean indexing in any of axes or combine a boolean vector with an indexing
-expression on one of the other axes
+Note, with the :ref:`advanced indexing <indexing.advanced>` ``ix`` method, you
+may select along more than one axis using boolean vectors combined with other
+indexing expressions.
 
 Indexing a DataFrame with a boolean DataFrame
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -202,19 +208,32 @@ Take Methods
 
 TODO: Fill Me In
 
-
-Slicing ranges
+Duplicate Data
 ~~~~~~~~~~~~~~
 
-Similar to Python lists and ndarrays, for convenience DataFrame
-supports slicing:
+.. _indexing.duplicate:
 
-.. ipython:: python
+If you want to indentify and remove duplicate rows in a DataFrame,  there are
+two methods that will help: ``duplicated`` and ``drop_duplicates``. Each
+takes as an argument the columns to use to identify duplicated rows.
+
+``duplicated`` returns a boolean vector whose length is the number of rows, and
+which indicates whether a row is duplicated.
 
-    df[:2]
-    df[::-1]
-    df[-3:].T
+``drop_duplicates`` removes duplicate rows.
+
+By default, the first observed row of a duplicate set is considered unique, but
+each method has a ``take_last`` parameter that indicates the last observed row
+should be taken instead.
+
+.. ipython:: python
 
+   df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
+                    'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
+                    'c' : np.random.randn(7)})
+   df2.duplicated(['a','b'])
+   df2.drop_duplicates(['a','b'])
+   df2.drop_duplicates(['a','b'], take_last=True)
 
 .. _indexing.advanced:
 
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -57,6 +57,9 @@ data into a DataFrame object. They can take a number of arguments:
     below in the section on :ref:`iterating and chunking <io.chunking>`
   - ``iterator``: If True, return a ``TextParser`` to enable reading a file
     into memory piece by piece
+  - ``skip_footer``: number of lines to skip at bottom of file (default 0)
+  - ``converters``: a dictionary of functions for converting values in certain
+    columns, where keys are either integers or column labels
 
 .. ipython:: python
    :suppress:
diff --git a/doc/source/whatsnew/v0.6.0.rst b/doc/source/whatsnew/v0.6.0.rst
@@ -6,18 +6,18 @@ v.0.6.0 (November 25, 2011)
 New Features
 ~~~~~~~~~~~~
 - Add ``melt`` function to ``pandas.core.reshape``
-- Add ``level`` parameter to group by level in Series and DataFrame descriptive statistics (PR313_)
-- Add ``head`` and ``tail`` methods to Series, analogous to to DataFrame (PR296_)
-- Add ``Series.isin`` function which checks if each value is contained in a passed sequence (GH289_)
+  :ref:`Added <groupby.multindex>` ``level`` parameter to group by level in Series and DataFrame descriptive statistics (PR313_)
+- :ref:`Added <basics.head_tail>` ``head`` and ``tail`` methods to Series, analogous to to DataFrame (PR296_)
+- :ref:`Added <indexing.boolean>` ``Series.isin`` function which checks if each value is contained in a passed sequence (GH289_)
 - Add ``float_format`` option to ``Series.to_string``
-- MAYBE DOCUMENTED? Add ``skip_footer`` (GH291_) and ``converters`` (GH343_) options to ``read_csv`` and ``read_table``
-- Add proper, tested weighted least squares to standard and panel OLS (GH303_)
-- Add ``drop_duplicates`` and ``duplicated`` functions for removing duplicate DataFrame rows and checking for duplicate rows, respectively (GH319_)
-- Implement logical (boolean) operators '&', '|', '^', '~' on DataFrame (GH347_)
+- :ref:`Added <io.parse_dates>` ``skip_footer`` (GH291_) and ``converters`` (GH343_) options to ``read_csv`` and ``read_table``
+- Added proper, tested weighted least squares to standard and panel OLS (GH303_)
+- :ref:`Added <indexing.duplicate>` ``drop_duplicates`` and ``duplicated`` functions for removing duplicate DataFrame rows and checking for duplicate rows, respectively (GH319_)
+- :ref:`Implemented <dsintro.boolean>` operators '&', '|', '^', '-' on DataFrame (GH347_)
 - MAYBE ? Add ``Series.mad``, mean absolute deviation, matching DataFrame
 - MAYBE? Add ``QuarterEnd`` DateOffset (PR321_)
 - Add matrix multiplication function ``dot`` to DataFrame (GH65_)
-- Add ``orient``5 option to ``Panel.from_dict`` to ease creation of mixed-type Panels (GH359_, GH301_)
+- Add ``orient`` option to ``Panel.from_dict`` to ease creation of mixed-type Panels (GH359_, GH301_)
 - Add ``DataFrame.from_dict`` with similar ``orient`` option
 - Can now pass list of tuples or list of lists to ``DataFrame.from_records`` for fast conversion to DataFrame (GH357_)
 - Can pass multiple levels to groupby, e.g. ``df.groupby(level=[0, 1])`` (GH103_)