-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: df.sum() of string columns depends on whether or not they can be coerced to numbers #22642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I suppose the behavior you are looking for is actually achieved at some point though there is a coercion at the end to float. Using In [14]: df1['a'].str.cat(df1['b'])
Out[14]:
0 13
1 24
Name: a, dtype: object I believe the auto cast to float for the numeric is desired in other operations, so not sure there's anything to be done for the |
@WillAyd Which operations do you have in mind, where automatic conversion is desired? |
@apnewberry when working with an object type holds numeric data. Taking the example above: In [12]: df1 = df1.astype(int)
In [12]: df1.sum(axis=1)
Out[12]:
0 4.0
1 6.0
dtype: float64 |
On pandas 0.23.4 I'm seeing a different behavior: the resulting dtype is int64, rather than float64. Given the In [8]: pd.__version__
Out[8]: '0.23.4'
In [9]: df1 = pd.DataFrame(data={'a':['1', '2'], 'b':['3', '4']})
In [10]: df1.astype(int).sum(axis=1)
Out[10]:
0 4
1 6
dtype: int64 |
Hmm OK that's what I have on master. Somewhat orthogonal to the original post but if you want to submit a PR with a test case for that would certainly take it (if there's not one already). Regardless going back to the OP going from object to a numeric type where applicable is certainly desired as it can yield memory and will definitely yield performance improvements. I don't see a way to reasonably account for what is being asked without impacting a larger use case and there's already a more idiomatic approach to get what is desired so I'm -1 on any change here |
Could you help me understand what you mean about performance improvements from casting Here are a few timings in case it's helpful. In [6]: n = 1_000_000
df = pd.DataFrame(data={'a':['1' for _ in range(n)], 'b':['2' for _ in range(n)]})
In [9]: %timeit df.sum(axis=1)
209 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit df.astype(str).sum(axis=1)
288 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit df.astype(float).sum(axis=1)
147 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: %timeit df.astype(object).sum(axis=1)
227 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: %timeit df.sum(axis=1).astype(float)
209 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit df.sum(axis=1).astype(str)
421 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [15]: %timeit df.sum(axis=1).astype(object)
236 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
|
@apnewberry if you have usage questions you should ask on SO or Gitter - please don't continue the conversation here as it hijacks the thread. With that said, take the axis argument out of your first three calls and you'll see. |
Note that all the columns in this example are string columns:
Problem description
It would seem a bug, or at least very surprising behavior, that the sum() operation would depend on the contents of the strings when summing the columns.
Not sure if this has been reported before - I wasn't sure exactly what to search on, but wasn't able to find anything.
Expected Output
I would expect the columns to be concatenated as strings and then left as strings, just the same as if we did
df1.a + df1.b
.Output of
pd.show_versions()
On current master as of this filing.
INSTALLED VERSIONS
commit: 996f361
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.24.0.dev0+562.g996f361
pytest: 3.7.4
pip: 10.0.1
setuptools: 40.2.0
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.9.0
xarray: 0.10.8
IPython: 6.5.0
sphinx: 1.7.8
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.6
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.0
lxml: 4.2.4
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.1.2
The text was updated successfully, but these errors were encountered: