Skip to content

read_csv: date_parser called once with arrays and then many times with strings (from each single row) as arguments #9376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cmeeren opened this issue Jan 30, 2015 · 7 comments · Fixed by #9377

Comments

@cmeeren
Copy link
Contributor

cmeeren commented Jan 30, 2015

Pandas version info:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: EN

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.1
scipy: 0.15.0
statsmodels: 0.5.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

Consider the following script, containing data to be loaded, a date parsing function, and pd.read_csv():

from __future__ import print_function
import StringIO
import pandas as pd
import datetime as dt

text = '''
YYYY DOY HR MN  1  2   3      4      5       6     7
2013   1  0  0 71   1 100   5571      0 99999.9 999.9
2013   1  0  1 99 999 999 999999 999999 99999.9 999.9
2013   1  0  2 71   5 100   5654     19  -348.2   9.1
2013   1  0  3 71   1 100   5647      0  -350.6   9.5
2013   1  0  4 71   1 100   5693      0  -351.2   9.4
'''

def parse_date(year, doy, hour, minute):
    print(year, type(year))
    try:  # array arguments
        year = year.astype('datetime64[Y]')
        doy = doy.astype('datetime64[D]')
        hour = hour.astype('datetime64[h]')
        minute = minute.astype('datetime64[m]')
        return year + doy + hour + minute  # return None also gives the same final DataFrame
    except:  # string arguments
        year = int(year)
        doy = int(doy)
        hour = int(hour)
        minute = int(minute)
        return dt.datetime(year, 1, 1, hour, minute) + dt.timedelta(doy - 1)


data = pd.read_csv(StringIO.StringIO(text), delim_whitespace=True,
                   date_parser=parse_date, parse_dates=[[0, 1, 2, 3]], index_col=0)

The print statement in parse_date outputs

['2013' '2013' '2013' '2013' '2013'] <type 'numpy.ndarray'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>

With parse_date as written above, the DataFrame is loaded properly, even when the first return statement in parse_date is changed to return None:

>>> data
                      1    2    3       4       5        6      7
YYYY_DOY_HR_MN                                                   
2013-01-01 00:00:00  71    1  100    5571       0  99999.9  999.9
2013-01-01 00:01:00  99  999  999  999999  999999  99999.9  999.9
2013-01-01 00:02:00  71    5  100    5654      19   -348.2    9.1
2013-01-01 00:03:00  71    1  100    5647       0   -350.6    9.5
2013-01-01 00:04:00  71    1  100    5693       0   -351.2    9.4

It strikes me as odd that the date parser is called with both whole columns and individual rows. I could not find any documentation regarding this behaviour. It immediately raises two questions in my mind:

  • Why is the date parser called with both whole columns as arguments, and then called with strings from every single row as arguments? Is this a bug? It seems pointless to require the user to design for both kinds of arguments, and apparently not even use the return value from the array call.
  • Wouldn't it be much better performance wise to only call the date parser once with the arrays (whole columns) as arguments, enabling the user to process the whole index in one go if possible?
@jorisvandenbossche
Copy link
Member

The reason the date parser is called both on the whole column as on the individual rows, is because it first tries the provided function as a vectorized function (on the whole column at once), and if that fails (which is the case with your function), it will try to apply it on each row. So in this way the user can provide both a vectorized or scalar function.

It could indeed possibly be better performance wise to call the date parser once on the whole array, but in that case you have to provide a function that is able to do that yourself.
For example, in your function, int(year) will only work on scalar values, not on the whole array.

@jorisvandenbossche
Copy link
Member

BTW, the reason your function fails for vectorized input is that the addition of the different parts (return year + doy + hour + minute) fails with a TypeError (adding datetimes is not allowed (is also meaningless)). Just try it out with some data in numpy.
You can make timedelta64 of them (apart from the year) and then add them to the year, that would work.

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 30, 2015

Thank you, I understand the problem now and I think pandas could have handled this better (by informing me of the error). I never knew something went wrong, because pandas caught this error silently. If this small fact could be added to the docs (to read_csv and any other similar functions), that would help immensely. Or even better, pandas could warn me (though I see the problem with implementing that giving the current usage).

@jorisvandenbossche
Copy link
Member

@cmeeren In this case, I don't think it is fully pandas its fault, as you yourself silenced the warning by adding the bare except, and then going to the string version of your function. So anyway, the numpy TypeError you would never have seen (coincidently, I just read a blog post about that yesterday :-) https://realpython.com/blog/python/the-most-diabolical-python-antipattern/, but I agree, we still use it too much in pandas as well)

Eg if you limit your function to only the first part like

def parse_date(year, doy, hour, minute):
    print(year, type(year))
    year = year.astype('datetime64[Y]')
    doy = doy.astype('datetime64[D]')
    hour = hour.astype('datetime64[h]')
    minute = minute.astype('datetime64[m]')
    return year + doy + hour + minute  # return None also gives the same final DataFrame

you would already get a more informative error ("AttributeError: 'str' object has no attribute 'astype'"), but I agree, you don't see the error from the vectorized try (the error is for the scalar case).

The easiest way, if you have an error, is to just try the parse_date function on some arrays, and then you will get the correct error.
Because of the fact that pandas tries first vectorized, and then scalar, the error on vectorized is indeed silenced. So this is the disadvantage of trying to be smart.

But certainly welcome to add a clarification to the docs of vectorized/scalar.

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 30, 2015

@jorisvandenbossche My original function did indeed look like what you suggest. However, I did not think it would be fruitful to try the function on some arrays manually since I had no idea how pandas called the function and this was not mentioned in the docs. I have now submitted a pull request (#9377).

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 30, 2015

Small GitHub-related question: If this PR is merged, can I safely delete my fork?

@jreback jreback added Docs and removed Docs labels Jan 30, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 30, 2015
@jorisvandenbossche
Copy link
Member

Indeed, once it is merged, you can delete it (there will be even a button on github to do that on your remote origin)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants