|
| 1 | +# pandas: powerful Python data analysis toolkit |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +## What is it |
| 6 | +**pandas** is a Python package providing fast, flexible, and expressive data |
| 7 | +structures designed to make working with "relational" or "labeled" data both |
| 8 | +easy and intuitive. It aims to be the fundamental high-level building block for |
| 9 | +doing practical, **real world** data analysis in Python. Additionally, it has |
| 10 | +the broader goal of becoming **the most powerful and flexible open source data |
| 11 | +analysis / manipulation tool available in any language**. It is already well on |
| 12 | +its way toward this goal. |
| 13 | + |
| 14 | +## Main Features |
| 15 | +Here are just a few of the things that pandas does well: |
| 16 | + |
| 17 | + - Easy handling of [**missing data**][missing-data] (represented as |
| 18 | + `NaN`) in floating point as well as non-floating point data |
| 19 | + - Size mutability: columns can be [**inserted and |
| 20 | + deleted**][insertion-deletion] from DataFrame and higher dimensional |
| 21 | + objects |
| 22 | + - Automatic and explicit [**data alignment**][alignment]: objects can |
| 23 | + be explicitly aligned to a set of labels, or the user can simply |
| 24 | + ignore the labels and let `Series`, `DataFrame`, etc. automatically |
| 25 | + align the data for you in computations |
| 26 | + - Powerful, flexible [**group by**][groupby] functionality to perform |
| 27 | + split-apply-combine operations on data sets, for both aggregating |
| 28 | + and transforming data |
| 29 | + - Make it [**easy to convert**][conversion] ragged, |
| 30 | + differently-indexed data in other Python and NumPy data structures |
| 31 | + into DataFrame objects |
| 32 | + - Intelligent label-based [**slicing**][slicing], [**fancy |
| 33 | + indexing**][fancy-indexing], and [**subsetting**][subsetting] of |
| 34 | + large data sets |
| 35 | + - Intuitive [**merging**][merging] and [**joining**][joining] data |
| 36 | + sets |
| 37 | + - Flexible [**reshaping**][reshape] and [**pivoting**][pivot-table] of |
| 38 | + data sets |
| 39 | + - [**Hierarchical**][mi] labeling of axes (possible to have multiple |
| 40 | + labels per tick) |
| 41 | + - Robust IO tools for loading data from [**flat files**][flat-files] |
| 42 | + (CSV and delimited), [**Excel files**][excel], [**databases**][db], |
| 43 | + and saving/loading data from the ultrafast [**HDF5 format**][hdfstore] |
| 44 | + - [**Time series**][timeseries]-specific functionality: date range |
| 45 | + generation and frequency conversion, moving window statistics, |
| 46 | + moving window linear regressions, date shifting and lagging, etc. |
| 47 | + |
| 48 | + |
| 49 | + [missing-data]: http://pandas.pydata.org/pandas-docs/stable/missing_data.html#working-with-missing-data |
| 50 | + [insertion-deletion]: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion |
| 51 | + [alignment]: http://pandas.pydata.org/pandas-docs/stable/dsintro.html?highlight=alignment#intro-to-data-structures |
| 52 | + [groupby]: http://pandas.pydata.org/pandas-docs/stable/groupby.html#group-by-split-apply-combine |
| 53 | + [conversion]: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe |
| 54 | + [slicing]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges |
| 55 | + [fancy-indexing]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#advanced-indexing-with-ix |
| 56 | + [subsetting]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing |
| 57 | + [merging]: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging |
| 58 | + [joining]: http://pandas.pydata.org/pandas-docs/stable/merging.html#joining-on-index |
| 59 | + [reshape]: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-and-pivot-tables |
| 60 | + [pivot-table]: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#pivot-tables-and-cross-tabulations |
| 61 | + [mi]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#hierarchical-indexing-multiindex |
| 62 | + [flat-files]: http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files |
| 63 | + [excel]: http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files |
| 64 | + [db]: http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries |
| 65 | + [hdfstore]: http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables |
| 66 | + [timeseries]: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-series-date-functionality |
| 67 | + |
| 68 | +## Where to get it |
| 69 | +The source code is currently hosted on GitHub at: |
| 70 | +http://github.com/pydata/pandas |
| 71 | + |
| 72 | +Binary installers for the latest released version are available at the Python |
| 73 | +package index |
| 74 | + |
| 75 | + http://pypi.python.org/pypi/pandas/ |
| 76 | + |
| 77 | +And via `easy_install`: |
| 78 | + |
| 79 | +```sh |
| 80 | +easy_install pandas |
| 81 | +``` |
| 82 | + |
| 83 | +or `pip`: |
| 84 | + |
| 85 | +```sh |
| 86 | +pip install pandas |
| 87 | +``` |
| 88 | + |
| 89 | +## Dependencies |
| 90 | +- [NumPy](http://www.numpy.org): 1.6.1 or higher |
| 91 | +- [python-dateutil](http://labix.org/python-dateutil): 1.5 or higher |
| 92 | +- [pytz](http://pytz.sourceforge.net) |
| 93 | + - Needed for time zone support with ``pandas.date_range`` |
| 94 | + |
| 95 | +### Highly Recommended Dependencies |
| 96 | +- [numexpr](http://code.google.com/p/numexpr/) |
| 97 | + - Needed to accelerate some expression evaluation operations |
| 98 | + - Required by PyTables |
| 99 | +- [bottleneck](http://berkeleyanalytics.com/bottleneck) |
| 100 | + - Needed to accelerate certain numerical operations |
| 101 | + |
| 102 | +### Optional dependencies |
| 103 | +- [Cython](http://www.cython.org): Only necessary to build development version. Version 0.17.1 or higher. |
| 104 | +- [SciPy](http://www.scipy.org): miscellaneous statistical functions |
| 105 | +- [PyTables](http://www.pytables.org): necessary for HDF5-based storage |
| 106 | +- [matplotlib](http://matplotlib.sourceforge.net/): for plotting |
| 107 | +- [statsmodels](http://statsmodels.sourceforge.net/) |
| 108 | + - Needed for parts of `pandas.stats` |
| 109 | +- [openpyxl](http://packages.python.org/openpyxl/), [xlrd/xlwt](http://www.python-excel.org/) |
| 110 | + - openpyxl version 1.6.1 or higher, for writing .xlsx files |
| 111 | + - xlrd >= 0.9.0 |
| 112 | + - Needed for Excel I/O |
| 113 | +- [boto](https://pypi.python.org/pypi/boto): necessary for Amazon S3 access. |
| 114 | +- One of the following combinations of libraries is needed to use the |
| 115 | + top-level [`pandas.read_html`][read-html-docs] function: |
| 116 | + - [BeautifulSoup4][BeautifulSoup4] and [html5lib][html5lib] (Any |
| 117 | + recent version of [html5lib][html5lib] is okay.) |
| 118 | + - [BeautifulSoup4][BeautifulSoup4] and [lxml][lxml] |
| 119 | + - [BeautifulSoup4][BeautifulSoup4] and [html5lib][html5lib] and [lxml][lxml] |
| 120 | + - Only [lxml][lxml], although see [HTML reading gotchas][html-gotchas] |
| 121 | + for reasons as to why you should probably **not** take this approach. |
| 122 | + |
| 123 | +#### Notes about HTML parsing libraries |
| 124 | +- If you install [BeautifulSoup4][BeautifulSoup4] you must install |
| 125 | + either [lxml][lxml] or [html5lib][html5lib] or both. |
| 126 | + `pandas.read_html` will **not** work with *only* `BeautifulSoup4` |
| 127 | + installed. |
| 128 | +- You are strongly encouraged to read [HTML reading |
| 129 | + gotchas][html-gotchas]. It explains issues surrounding the |
| 130 | + installation and usage of the above three libraries. |
| 131 | +- You may need to install an older version of |
| 132 | + [BeautifulSoup4][BeautifulSoup4]: |
| 133 | + - Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and |
| 134 | + 32-bit Ubuntu/Debian |
| 135 | +- Additionally, if you're using [Anaconda][Anaconda] you should |
| 136 | + definitely read [the gotchas about HTML parsing][html-gotchas] |
| 137 | + libraries |
| 138 | +- If you're on a system with `apt-get` you can do |
| 139 | + |
| 140 | + ```sh |
| 141 | + sudo apt-get build-dep python-lxml |
| 142 | + ``` |
| 143 | + |
| 144 | + to get the necessary dependencies for installation of [lxml][lxml]. |
| 145 | + This will prevent further headaches down the line. |
| 146 | + |
| 147 | + [html5lib]: https://github.com/html5lib/html5lib-python "html5lib" |
| 148 | + [BeautifulSoup4]: http://www.crummy.com/software/BeautifulSoup "BeautifulSoup4" |
| 149 | + [lxml]: http://lxml.de |
| 150 | + [Anaconda]: https://store.continuum.io/cshop/anaconda |
| 151 | + [NumPy]: http://numpy.scipy.org/ |
| 152 | + [html-gotchas]: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#html-table-parsing |
| 153 | + [read-html-docs]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.html.read_html.html#pandas.io.html.read_html |
| 154 | + |
| 155 | +## Installation from sources |
| 156 | +To install pandas from source you need Cython in addition to the normal |
| 157 | +dependencies above. Cython can be installed from pypi: |
| 158 | + |
| 159 | +```sh |
| 160 | +pip install cython |
| 161 | +``` |
| 162 | + |
| 163 | +In the `pandas` directory (same one where you found this file after |
| 164 | +cloning the git repo), execute: |
| 165 | + |
| 166 | +```sh |
| 167 | +python setup.py install |
| 168 | +``` |
| 169 | + |
| 170 | +or for installing in [development mode](http://www.pip-installer.org/en/latest/usage.html): |
| 171 | + |
| 172 | +```sh |
| 173 | +python setup.py develop |
| 174 | +``` |
| 175 | + |
| 176 | +Alternatively, you can use `pip` if you want all the dependencies pulled |
| 177 | +in automatically (the `-e` option is for installing it in [development |
| 178 | +mode](http://www.pip-installer.org/en/latest/usage.html)): |
| 179 | + |
| 180 | +```sh |
| 181 | +pip install -e . |
| 182 | +``` |
| 183 | + |
| 184 | +On Windows, you will need to install MinGW and execute: |
| 185 | + |
| 186 | +```sh |
| 187 | +python setup.py build --compiler=mingw32 |
| 188 | +python setup.py install |
| 189 | +``` |
| 190 | + |
| 191 | +See http://pandas.pydata.org/ for more information. |
| 192 | + |
| 193 | +## License |
| 194 | +BSD |
| 195 | + |
| 196 | +## Documentation |
| 197 | +The official documentation is hosted on PyData.org: http://pandas.pydata.org/ |
| 198 | + |
| 199 | +The Sphinx documentation should provide a good starting point for learning how |
| 200 | +to use the library. Expect the docs to continue to expand as time goes on. |
| 201 | + |
| 202 | +## Background |
| 203 | +Work on ``pandas`` started at AQR (a quantitative hedge fund) in 2008 and |
| 204 | +has been under active development since then. |
| 205 | + |
| 206 | +## Discussion and Development |
| 207 | +Since pandas development is related to a number of other scientific |
| 208 | +Python projects, questions are welcome on the scipy-user mailing |
| 209 | +list. Specialized discussions or design issues should take place on |
| 210 | +the pystatsmodels mailing list / Google group, where |
| 211 | +``scikits.statsmodels`` and other libraries will also be discussed: |
| 212 | + |
| 213 | +http://groups.google.com/group/pystatsmodels |
0 commit comments