Skip to content

numpy error using read_csv with parse_dates=[...] and index_col=[...] #10245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cmeeren opened this issue Jun 1, 2015 · 31 comments · Fixed by #10249
Closed

numpy error using read_csv with parse_dates=[...] and index_col=[...] #10245

cmeeren opened this issue Jun 1, 2015 · 31 comments · Fixed by #10249
Labels
Bug Datetime Datetime data dtype IO CSV read_csv, to_csv
Milestone

Comments

@cmeeren
Copy link
Contributor

cmeeren commented Jun 1, 2015

Consider a file of the following format:

week,sow,prn,rxstatus,az,elv,l1_cno,s4,s4_cor,secsigma1,secsigma3,secsigma10,secsigma30,secsigma60,code_carrier,c_cstdev,tec45,tecrate45,tec30,tecrate30,tec15,tecrate15,tec00,tecrate00,l1_loctime,chanstatus,l2_locktime,l2_cno
1765,68460.00,126,00E80000,0.00,0.00,39.38,0.118447,0.107595,0.252663,0.532384,0.600540,0.603073,0.603309,-13.255543,0.114,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1692.182,8C023D84,0.000,0.00
1765,68460.00,23,00E80000,0.00,0.00,53.48,0.034255,0.021177,0.035187,0.042985,0.061142,0.061738,0.061801,-22.760003,0.015,24.955111,0.112239,25.115330,-0.119774,25.146603,-0.065852,24.747576,-0.243804,10426.426,08109CC4,10409.660,44.52
1765,68460.00,13,00E80000,0.00,0.00,54.28,0.046218,0.019314,0.037818,0.056421,0.060602,0.060698,0.060735,-20.679035,0.090,25.670250,-0.070761,25.752224,-0.055089,26.045048,-0.180056,25.360369,-0.062119,7553.020,18109CA4,7202.660,47.27

I try to read that with the following code

data = pd.read_csv(FILE, date_parser=GPStime2datetime,
                   parse_dates={'datetime': ['week', 'sow']},
                   index_col=['datetime', 'prn'])

Here I'm parsing week and sow into a datetime column using a custom function (this works properly) and using datetime and the prn column as a MultiIndex. The file is read successfully when index_col='datetime', but not when trying to create the MultiIndex using index_col=['datetime', 'prn'] (or when using column numbers instead of names). I get the following traceback:

  File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 260, in _read
    return parser.read()

  File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 721, in read
    ret = self._engine.read(nrows)

  File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1223, in read
    index, names = self._make_index(data, alldata, names)

  File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 898, in _make_index
    index = self._agg_index(index, try_parse_dates=False)

  File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 984, in _agg_index
    index = MultiIndex.from_arrays(arrays, names=self.index_names)

  File "C:\Anaconda\lib\site-packages\pandas\core\index.py", line 4410, in from_arrays
    cats = [Categorical.from_array(arr, ordered=True) for arr in arrays]

  File "C:\Anaconda\lib\site-packages\pandas\core\categorical.py", line 355, in from_array
    return Categorical(data, **kwargs)

  File "C:\Anaconda\lib\site-packages\pandas\core\categorical.py", line 271, in __init__
    codes, categories = factorize(values, sort=False)

  File "C:\Anaconda\lib\site-packages\pandas\core\algorithms.py", line 131, in factorize
    (hash_klass, vec_klass), vals = _get_data_algo(vals, _hashtables)

  File "C:\Anaconda\lib\site-packages\pandas\core\algorithms.py", line 412, in _get_data_algo
    mask = com.isnull(values)

  File "C:\Anaconda\lib\site-packages\pandas\core\common.py", line 230, in isnull
    return _isnull(obj)

  File "C:\Anaconda\lib\site-packages\pandas\core\common.py", line 240, in _isnull_new
    return _isnull_ndarraylike(obj)

  File "C:\Anaconda\lib\site-packages\pandas\core\common.py", line 330, in _isnull_ndarraylike
    result = np.isnan(values)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I am using Python 2.7, Pandas 0.16.1 and numpy 1.9.2.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

you need to show your parsing function.

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

def GPStime2datetime(GPSweek, GPS_TOW, correctLeapSeconds=True,
                     correctRollover=True, pyDatetime=False):
    '''Converts integer GPS week (no 1024 rollover!) and sequence time of week
    (seconds) to a datetime object.

    Parameters
    ==========
    GPSweek : int
        GPS week for the whole sequence
    GPS_TOW : array_like
        seconds of GPS week
    correctLeapSeconds : bool
        Correct for leap seconds based on the first element in `GPS_TOW`
    correctRollover : bool
        correct for GPS week rollover in `GPS_TOW`
        (see :func:`correct_TOW_rollover`)
    pyDatetime : bool
        Force output to python's builtin :class:`~dt.datetime` instead of
        numpy's :class:`~np.datetime64`. Incurs a big performance hit for large
        arrays.
    '''

    # make sure we have a numpy array
    GPS_TOW = np.asarray(GPS_TOW)

    # convert to float in case a list of strings is passed
    GPS_TOW = GPS_TOW.astype(np.float64)

    # correct rollover
    if correctRollover:
        GPS_TOW = correct_TOW_rollover(GPS_TOW)

    msAfterEpoch = GPS_TOW*1000 + np.int64(GPSweek)*604800*1000

    # correct for leap seconds
    if correctLeapSeconds:
        firstDate = np.datetime64('1980-01-06') + (msAfterEpoch[0]).astype('timedelta64[ms]')
        secondsToSubtract = leapSecondsSinceGPSepoch(firstDate)
        np.subtract(msAfterEpoch, secondsToSubtract*1000, out=msAfterEpoch)

    # make into a list of datetime objects and return
    dates = np.datetime64('1980-01-06') + msAfterEpoch.astype('timedelta64[ms]')
    if pyDatetime:
        dates = dates.astype(dt.datetime)
    return dates

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

and what does this produce when you don't specify the index columns. show a sample of the frame and df.dtypes

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

I run this code:

df = pd.read_csv(FILE, date_parser=GPStime2datetime,
                 parse_dates={'datetime': ['week', 'sow']})

Output of df (most columns truncated here):

                datetime  prn  rxstatus  az  elv  l1_cno        s4    s4_cor  \
0    2013-11-03 19:00:44  126  00E80000   0    0   39.38  0.118447  0.107595   
1    2013-11-03 19:00:44   23  00E80000   0    0   53.48  0.034255  0.021177   
2    2013-11-03 19:00:44   13  00E80000   0    0   54.28  0.046218  0.019314   

Output of df.dtypes:

datetime        datetime64[ns]
prn                      int64
rxstatus                object
az                     float64
elv                    float64
l1_cno                 float64
s4                     float64
s4_cor                 float64
secsigma1              float64
secsigma3              float64
secsigma10             float64
secsigma30             float64
secsigma60             float64
code_carrier           float64
c_cstdev               float64
tec45                  float64
tecrate45              float64
tec30                  float64
tecrate30              float64
tec15                  float64
tecrate15              float64
tec00                  float64
tecrate00              float64
l1_loctime             float64
chanstatus              object
l2_locktime            float64
l2_cno                 float64
dtype: object

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

can you show a simple example which doesn't involve this function as I cannot run it.

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

This should be an entirely self-contained example. It seems that when the date parser returns numpy's datetime64 type instead of Python's own datetime, this error occurs.

import numpy as np
import pandas as pd
import datetime as dt
from StringIO import StringIO

contents = r'''week,sow,prn,rxstatus,az,elv,l1_cno,s4,s4_cor,secsigma1,secsigma3,secsigma10,secsigma30,secsigma60,code_carrier,c_cstdev,tec45,tecrate45,tec30,tecrate30,tec15,tecrate15,tec00,tecrate00,l1_loctime,chanstatus,l2_locktime,l2_cno
2013-11-03,19:00:00,126,00E80000,0.00,0.00,39.38,0.118447,0.107595,0.252663,0.532384,0.600540,0.603073,0.603309,-13.255543,0.114,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1692.182,8C023D84,0.000,0.00
2013-11-03,19:00:00,23,00E80000,0.00,0.00,53.48,0.034255,0.021177,0.035187,0.042985,0.061142,0.061738,0.061801,-22.760003,0.015,24.955111,0.112239,25.115330,-0.119774,25.146603,-0.065852,24.747576,-0.243804,10426.426,08109CC4,10409.660,44.52
2013-11-03,19:00:00,13,00E80000,0.00,0.00,54.28,0.046218,0.019314,0.037818,0.056421,0.060602,0.060698,0.060735,-20.679035,0.090,25.670250,-0.070761,25.752224,-0.055089,26.045048,-0.180056,25.360369,-0.062119,7553.020,18109CA4,7202.660,47.27'''

def parse_np_datetime64(date, time):
    datetime = np.array([date + 'T' + time + 'Z'], dtype='datetime64[s]')
    return datetime

def parse_py_datetime(date, time):
    datetime = parse_np_datetime64(date, time).astype(dt.datetime).ravel()
    return datetime

# this will run
pd.read_csv(StringIO(contents), date_parser=parse_py_datetime,
            parse_dates={'datetime': ['week', 'sow']},
            index_col=['datetime', 'prn'])

# this will fail
pd.read_csv(StringIO(contents), date_parser=parse_np_datetime64,
            parse_dates={'datetime': ['week', 'sow']},
            index_col=['datetime', 'prn'])

EDIT after #10245 (comment): parse_np_datetime64 is wrong, should be:

def parse_np_datetime64(date, time):
    datetime = np.array([date + 'T' + time + 'Z'], dtype='datetime64[s]')
    return datetime

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

@cmeeren ok thanks.

The basic issue is that some the inference in read_csv is not as general as to_datetime which correctly handles all of these cases. So the output of the date_parser needs to be coerced to fix this.

pull-requests are welcome!

@jreback jreback added Bug Datetime Datetime data dtype IO CSV read_csv, to_csv labels Jun 1, 2015
@jreback jreback added this to the Next Major Release milestone Jun 1, 2015
@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

I've tried to find read_csv in the code and spent the last 45 minutes following a trail of functions, classes and methods until I completely lost my way around io.parsers.ParserBase._agg_index(), which to me seems to be the method creating the MultiIndex (it's hard to know without comments/docstrings). Could the problem lie here? Specifically, would it fix the problem to add a line or two after line 970 where arr is converted to Python's datetime if it's currently in some numpy datetime format?

I can't really test anything, because I'm on Windows and I've never gotten compiling to work reliably, which means python setup.py develop fails. So I don't think I'm the right one to fix this and submit a PR. (Also, as you probably can see, the codebase is rather opaque to me.)

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

See the docs here for creating a development environment.

This is where the code codes:
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L2069

I think this could just call to_datetime, as lib.try_parse_dates is not sophisticated enough for this.

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

Re. the build environment, I still get the same errors. Something about query_vcvarsall. I've seen it before and followed some other instructions I found, and when I try to run it in the Visual Studio 2008 command prompt, I get a fatal error from python27.lib concerning 32 vs. 64 bit (I have a 64 bit python installation).

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

you need to use conda, VS is not required, just install libpython as indicated.

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

Oh, I just forgot to activate pandas_dev before running python setup.py develop. Works now, thanks. Say, will installing libpython generally make compiling python packages on Windows a painless process, or is it just for pandas?

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

yes, this will in general give you a nice environment for doing nice c-extensions on windows

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

Great, thanks. I've had a look at the problem, and I've tried adding result = tools.to_datetime(result) above this line: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L2066

This packs my datetime64 array from parse_np_datetime64() in #10245 (comment) into another array in this line https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L345 and it ends up being sent to tslib.array_to_datetime on this line: https://github.com/pydata/pandas//blob/master/pandas/tseries/tools.py#L320

After that I have no idea what happens, because that's a C library and I don't know any C.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

This works just fine (and is the point of to_datetime it will handle almost anything)

In [6]: dt = np.array(['2013-01-01T01:23:45Z'], dtype='datetime64[s]')

In [7]: dt
Out[7]: array(['2012-12-31T20:23:45-0500'], dtype='datetime64[s]')

In [8]: pd.to_datetime(dt)
Out[8]: DatetimeIndex(['2013-01-01 01:23:45'], dtype='datetime64[ns]', freq=None, tz=None)

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

Yes, I tried that myself and can confirm that it works. I entirely forgot to mention that when I ran the rest script #10245 (comment) with the one-line edit I mentioned above, I get an error which I've no idea what to make of:

  File "c:\users\christer\code\python\pandas\pandas\io\parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "c:\users\christer\code\python\pandas\pandas\io\parsers.py", line 260, in _read
    return parser.read()
  File "c:\users\christer\code\python\pandas\pandas\io\parsers.py", line 721, in read
    ret = self._engine.read(nrows)
  File "c:\users\christer\code\python\pandas\pandas\io\parsers.py", line 1223, in read
    index, names = self._make_index(data, alldata, names)
  File "c:\users\christer\code\python\pandas\pandas\io\parsers.py", line 898, in _make_index
    index = self._agg_index(index, try_parse_dates=False)
  File "c:\users\christer\code\python\pandas\pandas\io\parsers.py", line 981, in _agg_index
    arr, _ = self._convert_types(arr, col_na_values | col_na_fvalues)
  File "c:\users\christer\code\python\pandas\pandas\io\parsers.py", line 1028, in _convert_types
    na_count = lib.sanitize_objects(result, na_values, False)
  File "pandas\src\inference.pyx", line 942, in pandas.lib.sanitize_objects (pandas\lib.c:56899)
TypeError: unhashable type: 'numpy.ndarray'

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

yeh, you will have to step thru the code and see

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

Sure, but again, this happens inside a C extension and I don't know how to deal with that. As far as the entire traceback goes and whether the object should end up there in the first place, I have absolutely no idea - the codebase is vast and the style is rather opaque to me. Perhaps it's best if someone else tries to crack this nut. 😞

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

no, the error is in the input to the extension. you can easily just look at the cython code, in lib.pyx. and see what the input should actually be (it might need to be an object type, e.g. tyr _ensure_object).

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

I'll assume you were talking about inference.pyx, not lib.pyx. The first argument is ndarray[object] values, so I assume this needs to be an object. When I put print(result.dtype) just before sanitize_objects is called, I get object, so this shouldn't be a problem, right? Just taking stabs in the dark here.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

yes I am talking about inference (it is actually compile to lib though). no it should be a numpy array of objects (and not just and object).

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 1, 2015

It is indeed a numpy array of objects (the dtype, not type(), was object). See here, the print statements are just before lib.sanitize_objects is called:

>>> print result
[array(['2013-11-03T20:00:00+0100'], dtype='datetime64[s]')
 array(['2013-11-03T20:00:00+0100'], dtype='datetime64[s]')
 array(['2013-11-03T20:00:00+0100'], dtype='datetime64[s]')]

>>> print type(result)
<type 'numpy.ndarray'>

>>> print result.dtype
 object

It seems strange to me that result is an array of arrays where each sub-array is a single datetime, instead of result being a numpy array with a single sub-array containing all the dates. Is it correct as per the above?

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

an array of arrays is not correct. Not sure what is happening, you'll have to step thru and see where that is generated.

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 2, 2015

I found two separate causes for the error I experienced, and I have a suggestion as to the solution.

First, my example was wrong and that was the reason for the "array of arrays" problem. My parse_np_datetime64() function should contain

np.array(date + 'T' + time + 'Z', dtype='datetime64[s]')

and not

np.array([date + 'T' + time + 'Z'], dtype='datetime64[s]').

Secondly, everything works fine when I use datetime64[ns] instead of datetime64[s]. This is because only [ns] appears in _DATELIKE_DTYPES in core.common.py (https://github.com/pydata/pandas/blob/4fde9462bd53f5f6b446bdcc6f222199a3f11ca5/pandas/core/common.py#L61-62). I added M8[s] to that list and that made my example script work using datetime64[s]. And the original script (using GPStime2datetime()) works if I add M8[ms].

How do you propose we continue from here? Is there any reason why datetime64[s]/M8[s] and similar shouldn't be allowed? In other words, could be just add a lot more resolutions to _DATELIKE_DTYPES? Specifically, the lines I linked to above might look like:

_DATELIKE_DTYPES = set([np.dtype(t+r)
                       for t in ['M8', '<M8', '>M8', 'm8', '<m8', '>m8']
                       for r in ['[ns]', '[ms]', '[s]']])

Which resolutions should be allowed? Should I submit a pull request?

@shoyer
Copy link
Member

shoyer commented Jun 2, 2015

Thanks for digging in here. Pandas should certainly accept any datetime64 array from numpy, but internally pandas only uses datetime64[ns] (this simplifies internal operations considerably). So the right solution is add coercion in the right place to pandas, e.g., with astype('datetime64[ns]').

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 2, 2015

Would replacing https://github.com/pydata/pandas/blob/4fde9462bd53f5f6b446bdcc6f222199a3f11ca5/pandas/io/parsers.py#L2157 with new_col = parser(*to_parse).astype('datetime64[ns]') break anything? (It fixes my problem.) I can't run the tests here because I get a "python.exe has stopped working" after around 300 tests (and a couple of them failed even before making any changes to the code).

@shoyer
Copy link
Member

shoyer commented Jun 2, 2015

@cmeeren you can setup Travis-CI to run the pandas test suite on your fork: http://docs.travis-ci.com/user/getting-started/

Without running the test suite I honestly have no idea :)

@jreback
Copy link
Contributor

jreback commented Jun 2, 2015

@cmeeren you need to make the change that I suggested above

run the result of try_parse_dates thru to_datetime which will coerce this.

@jreback
Copy link
Contributor

jreback commented Jun 2, 2015

see the contributing guidelines here

@cmeeren
Copy link
Contributor Author

cmeeren commented Jun 2, 2015

@jreback, try_parse_dates isn't the problem in my case, since my functions accept each column as an argument and therefore is processed using date_parser here: https://github.com/pydata/pandas/blob/4fde9462bd53f5f6b446bdcc6f222199a3f11ca5/pandas/io/parsers.py#L2063

I have however wrapped all the three cases in to_datetime and made a commit. See the diff here pydata:08d60e6...cmeeren:7c7355e (Travis currently running, it passed nosetests pandas/io/tests/test_date_converters.py locally). Also, just before you posted your suggestion, I had made another fix which passed the Travis build, see here:

pydata:08d60e6...cmeeren:7d6f7c4

I'll make a pull request for whichever you want (if Travis test passes for the one in progress).

@jreback
Copy link
Contributor

jreback commented Jun 2, 2015

best to do a pull-request. you need to add your example as a test.

@jreback jreback modified the milestones: 0.16.2, Next Major Release Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants