Skip to content

Commit 49d1d75

Browse files
WEB: Moving pandas blog to the website (#33178)
1 parent abf587c commit 49d1d75

16 files changed

+470
-4
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
Title: 2019 pandas user survey
2+
Date: 2019-08-22
3+
4+
<style type="text/css">
5+
table td {
6+
background: none;
7+
}
8+
9+
table tr.even td {
10+
background: none;
11+
}
12+
13+
table {
14+
text-shadow: none;
15+
}
16+
17+
</style>
18+
19+
# 2019 pandas user survey
20+
21+
Pandas recently conducted a user survey to help guide future development.
22+
Thanks to everyone who participated! This post presents the high-level results.
23+
24+
This analysis and the raw data can be found [on GitHub](https://github.com/pandas-dev/pandas-user-surveys) and run on Binder
25+
26+
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/pandas-dev/pandas-user-surveys/master?filepath=2019.ipynb)
27+
28+
29+
We had about 1250 repsonses over the 15 days we ran the survey in the summer of 2019.
30+
31+
## About the Respondents
32+
33+
There was a fair amount of representation across pandas experience and frequeny of use, though the majority of respondents are on the more experienced side.
34+
35+
36+
37+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_4_0.png)
38+
39+
40+
41+
42+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_5_0.png)
43+
44+
45+
We included a few questions that were also asked in the [Python Developers Survey](https://www.jetbrains.com/research/python-developers-survey-2018/) so we could compare Pandas' population to Python's.
46+
47+
90% of our respondents use Python as a primary language (compared with 84% from the PSF survey).
48+
49+
50+
51+
52+
53+
Yes 90.67%
54+
No 9.33%
55+
Name: Is Python your main language?, dtype: object
56+
57+
58+
59+
Windows users are well represented (see [Steve Dower's talk](https://www.youtube.com/watch?v=uoI57uMdDD4) on this topic).
60+
61+
62+
63+
64+
65+
Linux 61.57%
66+
Windows 60.21%
67+
MacOS 42.75%
68+
Name: What Operating Systems do you use?, dtype: object
69+
70+
71+
72+
For environment isolation, [conda](https://conda.io/en/latest/) was the most popular.
73+
74+
75+
76+
77+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_13_0.png)
78+
79+
80+
Most repondents are Python 3 only.
81+
82+
83+
84+
85+
86+
3 92.39%
87+
2 & 3 6.80%
88+
2 0.81%
89+
Name: Python 2 or 3?, dtype: object
90+
91+
92+
93+
## Pandas APIs
94+
95+
It can be hard for open source projects to know what features are actually used. We asked a few questions to get an idea.
96+
97+
CSV and Excel are (for better or worse) the most popular formats.
98+
99+
100+
101+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_18_0.png)
102+
103+
104+
In preperation for a possible refactor of pandas internals, we wanted to get a sense for
105+
how common wide (100s of columns or more) DataFrames are.
106+
107+
108+
109+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_20_0.png)
110+
111+
112+
Pandas is slowly growing new exentension types. Categoricals are the most popular,
113+
and the nullable integer type is already almost as popular as datetime with timezone.
114+
115+
116+
117+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_22_0.png)
118+
119+
120+
More and better examples seem to be a high-priority development item.
121+
Pandas recently received a NumFOCUS grant to improve our documentation,
122+
which we're using to write tutorial-style documentation, which should help
123+
meet this need.
124+
125+
126+
127+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_24_0.png)
128+
129+
130+
We also asked about specific, commonly-requested features.
131+
132+
133+
134+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_26_0.png)
135+
136+
137+
Of these, the clear standout is "scaling" to large datasets. A couple observations:
138+
139+
1. Perhaps pandas' documentation should do a better job of promoting libraries that provide scalable dataframes (like [Dask](https://dask.org), [vaex](https://dask.org), and [modin](https://modin.readthedocs.io/en/latest/))
140+
2. Memory efficiency (perhaps from a native string data type, fewer internal copies, etc.) is a valuable goal.
141+
142+
After that, the next-most critical improvement is integer missing values. Those were actually added in [Pandas 0.24](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#optional-integer-na-support), but they're not the default, and there's still some incompatibilites with the rest of pandas API.
143+
144+
Pandas is a less conservative library than, say, NumPy. We're approaching 1.0, but on the way we've made many deprecations and some outright API breaking changes. Fortunately, most people are OK with the tradeoff.
145+
146+
147+
148+
149+
150+
Yes 94.89%
151+
No 5.11%
152+
Name: Is Pandas stable enough for you?, dtype: object
153+
154+
155+
156+
There's a perception (which is shared by many of the pandas maintainers) that the pandas API is too large. To measure that, we asked whether users thought that pandas' API was too large, too small, or just right.
157+
158+
159+
160+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_31_0.png)
161+
162+
163+
Finally, we asked for an overall satisfaction with the library, from 1 (not very unsatisfied) to 5 (very satisfied).
164+
165+
166+
167+
![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_33_0.png)
168+
169+
170+
Most people are very satisfied. The average response is 4.39. I look forward to tracking this number over time.
171+
172+
If you're analyzing the raw data, be sure to share the results with us [@pandas_dev](https://twitter.com/pandas_dev).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
Title: pandas extension arrays
2+
Date: 2019-01-04
3+
4+
# pandas extension arrays
5+
6+
Extensibility was a major theme in pandas development over the last couple of
7+
releases. This post introduces the pandas extension array interface: the
8+
motivation behind it and how it might affect you as a pandas user. Finally, we
9+
look at how extension arrays may shape the future of pandas.
10+
11+
Extension Arrays are just one of the changes in pandas 0.24.0. See the
12+
[whatsnew][whatsnew] for a full changelog.
13+
14+
## The Motivation
15+
16+
Pandas is built on top of NumPy. You could roughly define a Series as a wrapper
17+
around a NumPy array, and a DataFrame as a collection of Series with a shared
18+
index. That's not entirely correct for several reasons, but I want to focus on
19+
the "wrapper around a NumPy array" part. It'd be more correct to say "wrapper
20+
around an array-like object".
21+
22+
Pandas mostly uses NumPy's builtin data representation; we've restricted it in
23+
places and extended it in others. For example, pandas' early users cared greatly
24+
about timezone-aware datetimes, which NumPy doesn't support. So pandas
25+
internally defined a `DatetimeTZ` dtype (which mimics a NumPy dtype), and
26+
allowed you to use that dtype in `Index`, `Series`, and as a column in a
27+
`DataFrame`. That dtype carried around the tzinfo, but wasn't itself a valid
28+
NumPy dtype.
29+
30+
As another example, consider `Categorical`. This actually composes *two* arrays:
31+
one for the `categories` and one for the `codes`. But it can be stored in a
32+
`DataFrame` like any other column.
33+
34+
Each of these extension types pandas added is useful on its own, but carries a
35+
high maintenance cost. Large sections of the codebase need to be aware of how to
36+
handle a NumPy array or one of these other kinds of special arrays. This made
37+
adding new extension types to pandas very difficult.
38+
39+
Anaconda, Inc. had a client who regularly dealt with datasets with IP addresses.
40+
They wondered if it made sense to add an [IPArray][IPArray] to pandas. In the
41+
end, we didn't think it passed the cost-benefit test for inclusion in pandas
42+
*itself*, but we were interested in defining an interface for third-party
43+
extensions to pandas. Any object implementing this interface would be allowed in
44+
pandas. I was able to write [cyberpandas][cyberpandas] outside of pandas, but it
45+
feels like using any other dtype built into pandas.
46+
47+
## The Current State
48+
49+
As of pandas 0.24.0, all of pandas' internal extension arrays (Categorical,
50+
Datetime with Timezone, Period, Interval, and Sparse) are now built on top of
51+
the ExtensionArray interface. Users shouldn't notice many changes. The main
52+
thing you'll notice is that things are cast to `object` dtype in fewer places,
53+
meaning your code will run faster and your types will be more stable. This
54+
includes storing `Period` and `Interval` data in `Series` (which were previously
55+
cast to object dtype).
56+
57+
Additionally, we'll be able to add *new* extension arrays with relative ease.
58+
For example, 0.24.0 (optionally) solved one of pandas longest-standing pain
59+
points: missing values casting integer-dtype values to float.
60+
61+
62+
```python
63+
>>> int_ser = pd.Series([1, 2], index=[0, 2])
64+
>>> int_ser
65+
0 1
66+
2 2
67+
dtype: int64
68+
69+
>>> int_ser.reindex([0, 1, 2])
70+
0 1.0
71+
1 NaN
72+
2 2.0
73+
dtype: float64
74+
```
75+
76+
With the new [IntegerArray][IntegerArray] and nullable integer dtypes, we can
77+
natively represent integer data with missing values.
78+
79+
```python
80+
>>> int_ser = pd.Series([1, 2], index=[0, 2], dtype=pd.Int64Dtype())
81+
>>> int_ser
82+
0 1
83+
2 2
84+
dtype: Int64
85+
86+
>>> int_ser.reindex([0, 1, 2])
87+
0 1
88+
1 NaN
89+
2 2
90+
dtype: Int64
91+
```
92+
93+
One thing it does slightly change how you should access the raw (unlabeled)
94+
arrays stored inside a Series or Index, which is occasionally useful. Perhaps
95+
the method you're calling only works with NumPy arrays, or perhaps you want to
96+
disable automatic alignment.
97+
98+
In the past, you'd hear things like "Use `.values` to extract the NumPy array
99+
from a Series or DataFrame." If it were a good resource, they'd tell you that's
100+
not *entirely* true, since there are some exceptions. I'd like to delve into
101+
those exceptions.
102+
103+
The fundamental problem with `.values` is that it serves two purposes:
104+
105+
1. Extracting the array backing a Series, Index, or DataFrame
106+
2. Converting the Series, Index, or DataFrame to a NumPy array
107+
108+
As we saw above, the "array" backing a Series or Index might not be a NumPy
109+
array, it may instead be an extension array (from pandas or a third-party
110+
library). For example, consider `Categorical`,
111+
112+
```python
113+
>>> cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'])
114+
>>> ser = pd.Series(cat)
115+
>>> ser
116+
0 a
117+
1 b
118+
2 a
119+
dtype: category
120+
Categories (3, object): [a, b, c]
121+
122+
>>> ser.values
123+
[a, b, a]
124+
Categories (3, object): [a, b, c]
125+
```
126+
127+
In this case `.values` is a Categorical, not a NumPy array. For period-dtype
128+
data, `.values` returns a NumPy array of `Period` objects, which is expensive to
129+
create. For timezone-aware data, `.values` converts to UTC and *drops* the
130+
timezone info. These kind of surprises (different types, or expensive or lossy
131+
conversions) stem from trying to shoehorn these extension arrays into a NumPy
132+
array. But the entire point of an extension array is for representing data NumPy
133+
*can't* natively represent.
134+
135+
To solve the `.values` problem, we've split its roles into two dedicated methods:
136+
137+
1. Use `.array` to get a zero-copy reference to the underlying data
138+
2. Use `.to_numpy()` to get a (potentially expensive, lossy) NumPy array of the
139+
data.
140+
141+
So with our Categorical example,
142+
143+
```python
144+
>>> ser.array
145+
[a, b, a]
146+
Categories (3, object): [a, b, c]
147+
148+
>>> ser.to_numpy()
149+
array(['a', 'b', 'a'], dtype=object)
150+
```
151+
152+
To summarize:
153+
154+
- `.array` will *always* be a an ExtensionArray, and is always a zero-copy
155+
reference back to the data.
156+
- `.to_numpy()` is *always* a NumPy array, so you can reliably call
157+
ndarray-specific methods on it.
158+
159+
You shouldn't ever need `.values` anymore.
160+
161+
## Possible Future Paths
162+
163+
Extension Arrays open up quite a few exciting opportunities. Currently, pandas
164+
represents string data using Python objects in a NumPy array, which is slow.
165+
Libraries like [Apache Arrow][arrow] provide native support for variable-length
166+
strings, and the [Fletcher][fletcher] library provides pandas extension arrays
167+
for Arrow arrays. It will allow [GeoPandas][geopandas] to store geometry data
168+
more efficiently. Pandas (or third-party libraries) will be able to support
169+
nested data, data with units, geo data, GPU arrays. Keep an eye on the
170+
[pandas ecosystem][eco] page, which will keep track of third-party extension
171+
arrays. It's an exciting time for pandas development.
172+
173+
## Other Thoughts
174+
175+
I'd like to emphasize that this is an *interface*, and not a concrete array
176+
implementation. We are *not* reimplementing NumPy here in pandas. Rather, this
177+
is a way to take any array-like data structure (one or more NumPy arrays, an
178+
Apache Arrow array, a CuPy array) and place it inside a DataFrame. I think
179+
getting pandas out of the array business, and instead thinking about
180+
higher-level tabular data things, is a healthy development for the project.
181+
182+
This works perfectly with NumPy's [`__array_ufunc__`][ufunc] protocol and
183+
[NEP-18][nep18]. You'll be able to use the familiar NumPy API on objects that
184+
aren't backed by NumPy memory.
185+
186+
## Upgrade
187+
188+
These new goodies are all available in the recently released pandas 0.24.
189+
190+
conda:
191+
192+
conda install -c conda-forge pandas
193+
194+
pip:
195+
196+
pip install --upgrade pandas
197+
198+
As always, we're happy to hear feedback on the [mailing list][ml],
199+
[@pandas-dev][twitter], or [issue tracker][tracker].
200+
201+
Thanks to the many contributors, maintainers, and [institutional
202+
partners][partners] involved in the pandas community.
203+
204+
205+
[IPArray]: https://github.com/pandas-dev/pandas/issues/18767
206+
[cyberpandas]: https://cyberpandas.readthedocs.io
207+
[IntegerArray]: http://pandas.pydata.org/pandas-docs/version/0.24/reference/api/pandas.arrays.IntegerArray.html
208+
[fletcher]: https://github.com/xhochy/fletcher
209+
[arrow]: https://arrow.apache.org
210+
[ufunc]: https://docs.scipy.org/doc/numpy-1.13.0/neps/ufunc-overrides.html
211+
[nep18]: https://www.numpy.org/neps/nep-0018-array-function-protocol.html
212+
[ml]: https://mail.python.org/mailman/listinfo/pandas-dev
213+
[twitter]: https://twitter.com/pandas_dev
214+
[tracker]: https://github.com/pandas-dev/pandas/issues
215+
[partners]: https://github.com/pandas-dev/pandas-governance/blob/master/people.md
216+
[eco]: http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#extension-data-types
217+
[whatsnew]: http://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html
218+
[geopandas]: https://github.com/geopandas/geopandas
File renamed without changes.

0 commit comments

Comments
 (0)