Skip to content

Improve docs about filtering #746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Feb 4, 2012 · 2 comments
Closed

Improve docs about filtering #746

wesm opened this issue Feb 4, 2012 · 2 comments
Labels
Milestone

Comments

@wesm
Copy link
Member

wesm commented Feb 4, 2012

from mailing list "[pystatsmodels] read_csv + text file with one column"

Yes. Suppose that function f returns a boolean value given a column
value. To filter you would do:

df[df[column_to_filter_by].map(f)]

oftentimes filtering is using NumPy vector operations like

df[df[column] == val]

but sometimes you have an element-wise Python function you want to apply.

You said you are a new Python programmer so I can understand lambdas
and regular expressions looking weird =) Lambda is just an alternative
to doing something like:

def condition(x):
   return x.startswith('A')

df[df[column].map(condition)]

you could also do:

df[[condition(x) for x in df[column]]]

the map method is just a faster way of doing the list comprehension
(and returns an ndarray with the right data type)

The select method applies a function to the *axis labels*, directly
from the docstring:

In [1]: DataFrame.select?
Type:       instancemethod
Base Class: <type 'instancemethod'>
String Form:<unbound method DataFrame.select>
Namespace:  Interactive
File:       /home/wesm/code/pandas/pandas/core/generic.py
Definition: DataFrame.select(self, crit, axis=0)
Docstring:
Return data corresponding to axis labels matching criteria

Parameters
----------
crit : function
   To be called on each index (label). Should return True or False
axis : int

Returns
-------
selection : type of caller

hope this helps,
Wes
@gdraps
Copy link
Contributor

gdraps commented Feb 7, 2012

Hi Wes,

First off, thanks for pandas and your recent talk at the NYC Python meetup. On the topic of filtering, NumPy vector filters are awesome for numeric data, but I have found myself reaching for the following idioms when dealing with alpha-numeric columns:

df[df.method.contains('abc')]
df[df.method.startswith('ghi')]
df[df.method.endswith('xyz')]

Would you consider the addition of these methods to the Series class, not only to complement the existing isin() method, but to bridge the gap with SQL libraries, such as SQLAlchemy (http://docs.sqlalchemy.org/en/latest/core/expression_api.html#sqlalchemy.sql.operators.ColumnOperators), and improve conciseness of string queries in pandas?


Update: in the thread referenced, I see you've already thought about similar methods (match()) and that handling exceptions due to NA values, among other details, is the tricky bit. On the surface, throwing a type exception when a NA is encountered seems acceptable because it feels consistent with other Python idioms (for better or worse). e.g., ', '.join(x) throws an exception when x contains a non-string element

@wesm
Copy link
Member Author

wesm commented Feb 7, 2012

hi @gdraps there is actually an open issue about this, #620

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants