Skip to content

Pandas attempts to convert some strings to timestamps when grouping by a timestamp and aggregating? #10078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marcuscollins opened this issue May 7, 2015 · 4 comments · Fixed by #30646
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@marcuscollins
Copy link

I am working through logs of web requests, and when I want to find the most common, say, user agent string for a (disguised) user, I run something like the following:

from pandas import Series, DataFrame, Timestamp

tdf = DataFrame({'day': {0: Timestamp('2015-02-24 00:00:00'),  1: Timestamp('2015-02-24 00:00:00'),
                                      2: Timestamp('2015-02-24 00:00:00'), 3: Timestamp('2015-02-24 00:00:00'),
                                      4: Timestamp('2015-02-24 00:00:00')},
                            'userAgent': {0: 'some UA string', 1: 'some UA string', 2: 'some UA string',
                                                 3: 'another UA string', 4: 'some UA string'},
                             'userId': {0: '17661101',  1: '17661101', 2: '17661101', 3: '17661101', 4: '17661101'}})

def most_common_values(df):
    return Series({c: s.value_counts().index[0] for c,s in df.iteritems()})

tdf.groupby('day').apply(most_common_values)

Note that in this (admittedly unusual) example, all of the lines are identical. I'm not sure if that is necessary to recreate the issue. And, I'm obscuring the exact purpose of this code, but it reproduces the bug: The 'userId' comes back as a Timestamp, not a string. This happens after the function most_common_values returns, since that userId string is not returned as a timestamp. if we change the value of the userId to an int:

tdf['userId'] = tdf.userId.astype(int)

or if the value of the associated integer is small enough:

tdf['userId'] = '15320104`

then the results are what we'd expect (the most common value as its original type is returned.)

I imagine that for some reason something like a dateutil parser is being called on strings by default but that probably shoulnd't be happening...

@marcuscollins
Copy link
Author

version info:

INSTALLED VERSIONS

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-38-virtual
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.1
nose: None
Cython: 0.21.2
numpy: 1.9.2
scipy: 0.11.0
statsmodels: 0.5.0.1
IPython: 2.3.1
sphinx: None
patsy: 0.3.0
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: 2.4
matplotlib: 1.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented May 7, 2015

rather than rolling your own, you prob want this.

In [14]: tdf.groupby('day').apply(lambda x: x.mode())
Out[14]: 
                    day       userAgent    userId
day                                              
2015-02-24 0 2015-02-24  some UA string  17661101

IIRC there is another issue about this outstanding. This inference is quite tricky and certainly could be a bit buggy.

Want to step thru things and see if you can narrow down whats going on?

@jreback jreback added Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 7, 2015
@jreback jreback added this to the Next Major Release milestone May 7, 2015
@marcuscollins
Copy link
Author

Good suggestion! Still, note how that returns a multiIndex? If you change to x.mode().iloc[0] so that you have a straight Timestamp index, you get back to the same weird result.

I'll dig as time allows. If you can remember the other open issue, please post. I was struggling to describe this, so it was tricky to search for open issues/SO posts/etc...

Thanks!

@mroeschke
Copy link
Member

Looks like this work correctly on master. Suppose this could use a test.

In [66]: tdf.groupby('day').apply(most_common_values)['userId']
Out[66]:
day
2015-02-24    17661101
Name: userId, dtype: object

In [67]: pd.__version__
Out[67]: '0.26.0.dev0+555.gf7d162b18'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 14, 2019
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants