OverflowError when loading uint64 from csv #11440

teto · 2015-10-27T14:15:25Z

Hi,
I am trying to plot a set of values that are all uint64 with panda
The csv file is available here https://transfer.sh/Sy5DS/iperf-client-linux-2rtrs-f30b30-f30b30-w140k-lia-run1.csv and this is the command I used:

In [6]: df = pd.read_csv('/home/teto/ns3testing/iperf-client-linux_2rtrs_f30b30_f30b30_w140K_lia-run1.csv', sep='|', dtype={'dsn': np.uint64})
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-6-794bcaffa179> in <module>()
----> 1 df = pd.read_csv('/home/teto/ns3testing/iperf-client-linux_2rtrs_f30b30_f30b30_w140K_lia-run1.csv', sep='|', dtype={'dsn': np.uint64})

/usr/lib/python3/dist-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    461                     skip_blank_lines=skip_blank_lines)
    462 
--> 463         return _read(filepath_or_buffer, kwds)
    464 
    465     parser_f.__name__ = name

/usr/lib/python3/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    247         return parser
    248 
--> 249     return parser.read()
    250 
    251 _parser_defaults = {

/usr/lib/python3/dist-packages/pandas/io/parsers.py in read(self, nrows)
    704                 raise ValueError('skip_footer not supported for iteration')
    705 
--> 706         ret = self._engine.read(nrows)
    707 
    708         if self.options.get('as_recarray'):

/usr/lib/python3/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1148 
   1149         try:
-> 1150             data = self._reader.read(nrows)
   1151         except StopIteration:
   1152             if nrows is None:

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader.read (pandas/parser.c:7287)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7511)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._read_rows (pandas/parser.c:8336)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9544)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10106)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10503)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser._try_int64 (pandas/parser.c:18126)()

My version was the one with ubuntu's 15.04 repo
pd.version.version
Out[10]: '0.15.0'

I am trying to upgrade it, hoping it would solve this, would it ?
I believe the problem is related to:
#4471

I recently discovered panda and so far it's the best tool I found to plot/work with data (tried R/gnuplot etc...) thanks a lot for the work.

The text was updated successfully, but these errors were encountered:

teto · 2015-10-27T14:27:54Z

I just tried with 0.17 and the error is now "Kerned died restarting" which translates from the interpreter into a segfault

Python 3.4.3 (default, Mar 26 2015, 22:03:40) 
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv('/home/teto/ns3testing/iperf-client-linux_2rtrs_f30b30_f30b30_w140K_lia-run1.csv', sep='|', dtype={'dsn': np.uint64})
zsh: segmentation fault (core dumped)  python3

Winterflower · 2015-10-27T19:29:44Z

hi @teto,
I was trying to replicate, but it seems you have linked a .pcap file instead of the csv file you tried to create a DataFrame from.

teto · 2015-10-27T19:36:31Z

Indeed sorry, the CSV is generated from the pcap, my mistake. Here is the good file:
https://transfer.sh/Sy5DS/iperf-client-linux-2rtrs-f30b30-f30b30-w140k-lia-run1.csv

Winterflower · 2015-10-27T19:58:49Z

thanks @teto !
I think the problem here might be that values in your file are separated by vertical tabs instead of commas.

tcpstream|ipsrc|dss_length|dss_dsn|sport|packetid|ipdst|dss_rawack|dsn|time_delta|dport|tcpflags|dss_ssn|tcpseq|dack|subtype|mptcpstream|datafin|master|reltime
|||||1||||0.000000000||||||||||0.000000000
|||||2||||0.000000000||||||||||0.000000000
|||||3||||0.030046000||||||||||0.030046000

teto · 2015-10-27T20:03:04Z

All my TSV files use '|' as a separator, only with very long numbers I've this kind of crash (if you remove the 'dtype' parameter from the read_csv call it will load just fine).

teto · 2015-10-27T20:41:06Z

My bad, I forgot to say in my description that by default then the dsn column is considered as "object" but I want to plot it, that's why I then added the dtype=np.uint64

Thanks for your help :)

Winterflower · 2015-10-27T21:42:56Z

Can confirm that this happens in pandas 0.17.
I'm by far not an expert, but I think it might have something to do with some of the values for that column being NaN and if I recall correctly NaN is a value for float arrays, not integer arrays.

I tried this with your datafile:

>>> pd.read_csv("iperf.csv", sep="|", dtype={'dsn':np.float64}).head()
   tcpstream  ipsrc  dss_length  dss_dsn  sport  packetid  ipdst  dss_rawack  \
0        NaN    NaN         NaN      NaN    NaN         1    NaN         NaN   
1        NaN    NaN         NaN      NaN    NaN         2    NaN         NaN   
2        NaN    NaN         NaN      NaN    NaN         3    NaN         NaN   
3        NaN    NaN         NaN      NaN    NaN         4    NaN         NaN   
4        NaN    NaN         NaN      NaN    NaN         5    NaN         NaN   

   dsn  time_delta  dport tcpflags  dss_ssn  tcpseq dack subtype  mptcpstream  \
0  NaN    0.000000    NaN      NaN      NaN     NaN  NaN     NaN          NaN   
1  NaN    0.000000    NaN      NaN      NaN     NaN  NaN     NaN          NaN   
2  NaN    0.030046    NaN      NaN      NaN     NaN  NaN     NaN          NaN   
3  NaN    0.000000    NaN      NaN      NaN     NaN  NaN     NaN          NaN   
4  NaN    3.969954    NaN      NaN      NaN     NaN  NaN     NaN          NaN   

   datafin  master   reltime  
0      NaN     NaN  0.000000  
1      NaN     NaN  0.000000  
2      NaN     NaN  0.030046  
3      NaN     NaN  0.030046  
4      NaN     NaN  4.000000

and it seems to work, which would suggest that the np.uint64 is not liking the NaNs. I'll let someone more experienced weigh in, though.

Can you still make your plot when the datatype is float or do you need np.uint64 for something in particular?
(sorry about all the spam in this thread....)

jreback · 2015-10-27T21:49:24Z

@teto so the way to do this is to just pass dtype={'dsn' : object}. instead of trying to convert these to an integer type.

teto · 2015-10-27T21:50:48Z

They are by default loaded as object but then, I can't plot the column, hence I force the conversion to uint64.

jreback · 2015-10-27T21:50:53Z

In [1]: data = """tcpstream|ipsrc|dss_length|dss_dsn|sport|packetid|ipdst|dss_rawack|dsn|time_delta|dport|tcpflags|dss_ssn|tcpseq|dack|subtype|mptcpstream|datafin|master|reltime
   ...: |||||1||||0.000000000||||||||||0.000000000
   ...: |||||2||||0.000000000||||||||||0.000000000
   ...: |||||3||||0.030046000||||||||||0.030046000"""


In [3]: pd.read_csv(StringIO(data),sep='|',dtype={'dsn' : object})
Out[3]: 
   tcpstream  ipsrc  dss_length  dss_dsn  sport  packetid  ipdst  dss_rawack  dsn  time_delta  dport  tcpflags  dss_ssn  tcpseq  dack  subtype  mptcpstream  datafin  master   reltime
0        NaN    NaN         NaN      NaN    NaN         1    NaN         NaN  NaN    0.000000    NaN       NaN      NaN     NaN   NaN      NaN          NaN      NaN     NaN  0.000000
1        NaN    NaN         NaN      NaN    NaN         2    NaN         NaN  NaN    0.000000    NaN       NaN      NaN     NaN   NaN      NaN          NaN      NaN     NaN  0.000000
2        NaN    NaN         NaN      NaN    NaN         3    NaN         NaN  NaN    0.030046    NaN       NaN      NaN     NaN   NaN      NaN          NaN      NaN     NaN  0.030046

In [4]: pd.read_csv(StringIO(data),sep='|',dtype={'dsn' : object}).dtypes
Out[4]: 
tcpstream      float64
ipsrc          float64
dss_length     float64
dss_dsn        float64
sport          float64
packetid         int64
ipdst          float64
dss_rawack     float64
dsn             object
time_delta     float64
dport          float64
tcpflags       float64
dss_ssn        float64
tcpseq         float64
dack           float64
subtype        float64
mptcpstream    float64
datafin        float64
master         float64
reltime        float64
dtype: object

jreback · 2015-10-27T21:51:11Z

can you show what the values actually are? (in a small copy-pastable example)?

jreback · 2015-10-27T21:51:59Z

fyi this is a manifestation of #4471

teto · 2015-10-27T21:55:28Z

I guessed so but the mentioned branch (https://github.com/jtratner/pandas/tree/GH4471_fix_uint64_maybe_convert_objects) looks out of tree since it was scheduled for 0.14 and is not yet merged upstream, is that correct ?

jreback · 2015-10-27T22:05:07Z

that's correct. its still an open bug.

jreback · 2015-10-27T22:05:24Z

closing as dupe.

thanks for the report.

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 27, 2015

jreback closed this as completed Oct 27, 2015

jreback added the IO CSV read_csv, to_csv label Oct 27, 2015

wesm mentioned this issue Oct 27, 2015

lib.maybe_convert_objects will fail on uint64 values that exceed int64 max #4471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError when loading uint64 from csv #11440

OverflowError when loading uint64 from csv #11440

teto commented Oct 27, 2015

teto commented Oct 27, 2015

Winterflower commented Oct 27, 2015

teto commented Oct 27, 2015

Winterflower commented Oct 27, 2015

teto commented Oct 27, 2015

teto commented Oct 27, 2015

Winterflower commented Oct 27, 2015

jreback commented Oct 27, 2015

teto commented Oct 27, 2015

jreback commented Oct 27, 2015

jreback commented Oct 27, 2015

jreback commented Oct 27, 2015

teto commented Oct 27, 2015

jreback commented Oct 27, 2015

jreback commented Oct 27, 2015

OverflowError when loading uint64 from csv #11440

OverflowError when loading uint64 from csv #11440

Comments

teto commented Oct 27, 2015

teto commented Oct 27, 2015

Winterflower commented Oct 27, 2015

teto commented Oct 27, 2015

Winterflower commented Oct 27, 2015

teto commented Oct 27, 2015

teto commented Oct 27, 2015

Winterflower commented Oct 27, 2015

jreback commented Oct 27, 2015

teto commented Oct 27, 2015

jreback commented Oct 27, 2015

jreback commented Oct 27, 2015

jreback commented Oct 27, 2015

teto commented Oct 27, 2015

jreback commented Oct 27, 2015

jreback commented Oct 27, 2015