Pandas attempts to convert some strings to timestamps when grouping by a timestamp and aggregating? #10078

marcuscollins · 2015-05-07T18:22:18Z

I am working through logs of web requests, and when I want to find the most common, say, user agent string for a (disguised) user, I run something like the following:

from pandas import Series, DataFrame, Timestamp

tdf = DataFrame({'day': {0: Timestamp('2015-02-24 00:00:00'),  1: Timestamp('2015-02-24 00:00:00'),
                                      2: Timestamp('2015-02-24 00:00:00'), 3: Timestamp('2015-02-24 00:00:00'),
                                      4: Timestamp('2015-02-24 00:00:00')},
                            'userAgent': {0: 'some UA string', 1: 'some UA string', 2: 'some UA string',
                                                 3: 'another UA string', 4: 'some UA string'},
                             'userId': {0: '17661101',  1: '17661101', 2: '17661101', 3: '17661101', 4: '17661101'}})

def most_common_values(df):
    return Series({c: s.value_counts().index[0] for c,s in df.iteritems()})

tdf.groupby('day').apply(most_common_values)

Note that in this (admittedly unusual) example, all of the lines are identical. I'm not sure if that is necessary to recreate the issue. And, I'm obscuring the exact purpose of this code, but it reproduces the bug: The 'userId' comes back as a Timestamp, not a string. This happens after the function most_common_values returns, since that userId string is not returned as a timestamp. if we change the value of the userId to an int:

tdf['userId'] = tdf.userId.astype(int)

or if the value of the associated integer is small enough:

tdf['userId'] = '15320104`

then the results are what we'd expect (the most common value as its original type is returned.)

I imagine that for some reason something like a dateutil parser is being called on strings by default but that probably shoulnd't be happening...

The text was updated successfully, but these errors were encountered:

marcuscollins · 2015-05-07T18:24:47Z

version info:

INSTALLED VERSIONS

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-38-virtual
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.1
nose: None
Cython: 0.21.2
numpy: 1.9.2
scipy: 0.11.0
statsmodels: 0.5.0.1
IPython: 2.3.1
sphinx: None
patsy: 0.3.0
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: 2.4
matplotlib: 1.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

jreback · 2015-05-07T18:34:00Z

rather than rolling your own, you prob want this.

In [14]: tdf.groupby('day').apply(lambda x: x.mode())
Out[14]: 
                    day       userAgent    userId
day                                              
2015-02-24 0 2015-02-24  some UA string  17661101

IIRC there is another issue about this outstanding. This inference is quite tricky and certainly could be a bit buggy.

Want to step thru things and see if you can narrow down whats going on?

marcuscollins · 2015-05-07T19:31:07Z

Good suggestion! Still, note how that returns a multiIndex? If you change to x.mode().iloc[0] so that you have a straight Timestamp index, you get back to the same weird result.

I'll dig as time allows. If you can remember the other open issue, please post. I was struggling to describe this, so it was tricky to search for open issues/SO posts/etc...

Thanks!

mroeschke · 2019-10-14T00:09:33Z

Looks like this work correctly on master. Suppose this could use a test.

In [66]: tdf.groupby('day').apply(most_common_values)['userId']
Out[66]:
day
2015-02-24    17661101
Name: userId, dtype: object

In [67]: pd.__version__
Out[67]: '0.26.0.dev0+555.gf7d162b18'

jreback added Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 7, 2015

jreback added this to the Next Major Release milestone May 7, 2015

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 14, 2019

mroeschke mentioned this issue Jan 3, 2020

TST: Regression testing for fixed issues #30646

Merged

14 tasks

simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Jan 3, 2020

mroeschke closed this as completed in #30646 Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas attempts to convert some strings to timestamps when grouping by a timestamp and aggregating? #10078

Pandas attempts to convert some strings to timestamps when grouping by a timestamp and aggregating? #10078

marcuscollins commented May 7, 2015

marcuscollins commented May 7, 2015

jreback commented May 7, 2015

marcuscollins commented May 7, 2015

mroeschke commented Oct 14, 2019

Pandas attempts to convert some strings to timestamps when grouping by a timestamp and aggregating? #10078

Pandas attempts to convert some strings to timestamps when grouping by a timestamp and aggregating? #10078

Comments

marcuscollins commented May 7, 2015

marcuscollins commented May 7, 2015

INSTALLED VERSIONS

jreback commented May 7, 2015

marcuscollins commented May 7, 2015

mroeschke commented Oct 14, 2019