-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: query / boolean selection with nullable dtypes with NAs #31503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you include the full tracebacks? That helps show where things are failing. I'm seeing different exceptions in an environment with numexpr. In [17]: df.query('a > 2')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-e5d239635d7b> in <module>
----> 1 df.query('a > 2')
~/sandbox/pandas/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
3223 kwargs["level"] = kwargs.pop("level", 0) + 1
3224 kwargs["target"] = None
-> 3225 res = self.eval(expr, **kwargs)
3226
3227 try:
~/sandbox/pandas/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
3338 kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers)
3339
-> 3340 return _eval(expr, inplace=inplace, **kwargs)
3341
3342 def select_dtypes(self, include=None, exclude=None) -> "DataFrame":
~/sandbox/pandas/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
328 eng = _engines[engine]
329 eng_inst = eng(parsed_expr)
--> 330 ret = eng_inst.evaluate()
331
332 if parsed_expr.assigner is None:
~/sandbox/pandas/pandas/core/computation/engines.py in evaluate(self)
71
72 # make sure no names in resolvers and locals/globals clash
---> 73 res = self._evaluate()
74 return reconstruct_object(
75 self.result_type, res, self.aligned_axes, self.expr.terms.return_type
~/sandbox/pandas/pandas/core/computation/engines.py in _evaluate(self)
112 scope = env.full_scope
113 _check_ne_builtin_clash(self.expr)
--> 114 return ne.evaluate(s, local_dict=scope)
115
116
~/Envs/pandas-dev/lib/python3.7/site-packages/numexpr/necompiler.py in evaluate(ex, local_dict, global_dict, out, order, casting, **kwargs)
820 # Create a signature
821 signature = [(name, getType(arg)) for (name, arg) in
--> 822 zip(names, arguments)]
823
824 # Look up numexpr if possible.
~/Envs/pandas-dev/lib/python3.7/site-packages/numexpr/necompiler.py in <listcomp>(.0)
819
820 # Create a signature
--> 821 signature = [(name, getType(arg)) for (name, arg) in
822 zip(names, arguments)]
823
~/Envs/pandas-dev/lib/python3.7/site-packages/numexpr/necompiler.py in getType(a)
701 if kind == 'S':
702 return bytes
--> 703 raise ValueError("unknown type %s" % a.dtype.name)
704
705
ValueError: unknown type object
We've defined NA such that it propagates in comparison operations. There was a sizable discussion on it if you go back to the original issue, and we reached consensus that propagating was the best. In [10]: 1 == pd.NA
Out[10]: <NA> Not saying that can't change, since NA is experimental, but it'll be a high hurdle. |
And in addition also the indexing behaviour with missing values. So the behaviour you see for boolean masking with missing values is the consequence of two things:
And both are open for discussion (as Tom noted, there is a reason this is experimental). But I personally think it is rather the second that we should maybe discuss. The alternative under consideration for that one was to not error on missing values in a boolean mask, but to interpret the boolean value as False (skip it in masking). That would mostly give the same behaviour as we had before. We decided to take the "raise on missing values" behaviour for now, as it is the most conservative: the user can always do But I think in this area we explicitly said: let's get some more experience with it and re-evaluate later. So feedback on this is certainly welcome. See #28778 for some of the earlier discussion. |
@TomAugspurger - Edited post to include traceback This is going to be extraordinarily confusing for users and an absolute pain to have all these data types with all these different behaviors when filtering with missing values. How is anybody going to reasonably use nullable integers to filter their data? Convert back to float first? Users are going to be so confused that nullable integers filter differently than floats and strings only filter for equality, but object columns work (even with pd.NA in them). >>> df1 = pd.DataFrame({'a': ['rain', 'shine', pd.NA]}) # dtype is object
>>> df1.query('a == "rain"') # works
>>> df1.query('a <= "rain"') # works |
This seems a bug, it should be consistent with boolean selection
There is no need to convert to a float. You can still filter with missing values, but for now you need to be explicit how to treat missing values (with fillna), as I explained above. But to be clear, this is not set in stone. Feedback is very valuable, but so let's discuss the specifics. |
And it's important to distinguish the two operations here. We have mask = integer_array == 1 # comparison propagates NA
filtered = df[mask] # indexing raises on NA Changing the behavior of NA in comparisons just (or primarily) for indexing isn't warranted IMO. There are situations where I'm doing a comparison operation not followed by indexing, and I'd like the NA to propagate. So IMO the focus should be on the behavior of NA in indexing. @tdpetrou would you be able to go through the NA indexing discussion and summarize the arguments for the various options (IIRC, the 3 were raise -- which we implemented, propagate NA, and treat NA as False).? edit: Whoops, I didn't see #31503 (comment) until now, and realized I basically repeated Joris 😄 |
My two cents: I think a sensible default would be to treat NA as False when filtering (so we only select values that are definitely true). This would be the same as a typical where clause in SQL, and in my opinion is generally what people want when they filter data. |
@TomAugspurger Agreed - NA comparisons should remain NA. For indexing, NA's should be treated as False. Problems solved. It would dramatically simplify things for the user. There are many reasons for this. • Matches behavior of old pandas types Right now, filtering depends on the data type, which to me, is just absurd. The users are going to get so confused and frustrated. The workflow right now for filtering nullable ints is one of two things: >>> df.fillna({'a': garbage_number}).query('a > 2') # or
>>> df.astype({'a': 'float'}).query('a > 2') There are so many instances where you just want to filter on the known data without having to go through these huge unnecessary headaches.
|
Try to keep things respectful... We put a good amount of discussion into this. My recollection is that raising seemed the safest thing to do for now, in part because it matches the behavior of NaN, and in part because it left open the door to specifying a behavior later on.
Can you clarify that? When indexing with NaNs, NumPy (or pandas) will raise, right? |
This does make a ton of sense for boolean masking. Keep in mind, though, that our In [3]: s = pd.Series([1, 2, 3])
In [4]: m = pd.array([1, 1, None])
In [5]: s[m] It's not at all clear to me how the Would it be strange to treat NA in an IntegerArray differently than NA in a boolean array? We don't have a typed NA ( |
@dsaxton Thanks for the input!
@TomAugspurger but you typically didn't get booleans with missing values, as the comparison didn't result in them. So for the combined workflow (comparison + boolean indexing), it mostly matches old behaviour
@tdpetrou this is not absurd. We are adding those dtypes because we want to improve things, and yes, that means they don't behave exactly the same as the old ones. There is a reason they are labeled as experimental, and are not yet the default. |
@TomAugspurger I think @dsaxton was speaking about boolean indexing. In integer positional indexing, I think we should continue raising an error. It gives some inconsistency, but I personally think that would be fine. |
Here's my final feedback on this - For filtering, NAs should be equivalent to false and would be very difficult to do if all data types exhibited the same behavior. |
If people are OK with this, then I'm on board to. I think the inconsistency is OK, since we're saying the dtype of the array is what determines the behavior, rather than the values. cc @jbrockmendel @jreback for thoughts. I think the summary is that NA be treated as False in If we have consensus, are you interested in working on this @tdpetrou? |
I'm fine with this, but would propose going a half-step further and have It would also prevent a ton of problems that will come up when users start getting pd.NA as the name of an Index/Series, xref #31227. |
What kind of operation gets you NA as the name?
…On Fri, Jan 31, 2020 at 2:05 PM jbrockmendel ***@***.***> wrote:
I think the summary is that NA be treated as False in *getitem* when
doing a boolean mask
I'm fine with this, but would propose going a half-step further and have
NA.__bool__ return False instead of raising. This would keep the scalar
behavior consistent with the array-element behavior.
It would also prevent a ton of problems that will come up when users start
getting pd.NA as the name of an Index/Series, xref #31227
<#31227>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31503?email_source=notifications&email_token=AAKAOIWEYXR4N3VXKOAFFOTRASABJA5CNFSM4KOKKYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKP267Y#issuecomment-580890495>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIXLPLWJRQERQFFNFODRASABJANCNFSM4KOKKYRA>
.
|
If you can get NA in the index of a DataFrame, then |
It's allowed right now, but the semantics of it aren't well fleshed out. Regardless, I think we have a proposal for boolean masking. We can discuss |
For me, I think the But to be sure we are fully aware of this: this does not fully match the behaviour of old pandas dtypes. With numpy object dtype:
while with the nullable string dtype, it would become (the last one now raises):
I personally think I am fine with this behaviour, but it is a change that is harder to detect when switching to the new nullable dtypes. |
I'm also OK with that difference. |
Code Sample
Create a dataframe with nullable integer, string, and float data types.
<NA>
<NA>
Verify data types and attempt to use
query
Using
query
with strings works......but fails for boolean selection
strings also fail for inequalities
Problem description
The
query
method behaves differently for nullable integers, strings, and floats. Here's my summary of how I think they work with thequery method
assuming there are missing values in the columns.I find it extremely difficult to use if the behavior for all of these types are different for query and boolean selection.
Expected Output
I think I would prefer to have both query and boolean selection working like they do with floats, where missing values evaluate as False in a condition. And even if there are missing values in the boolen mask itself, treat those as False. This would harmonize the behavior for all data types.
This would leave it up to the user to check for missing values. I believe SQL where clauses work in such a manner (missing values in conditions evaluate as False).
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0+untagged.1.gce8af21.dirty
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: