Skip to content

OverflowError when loading uint64 from csv #11440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
teto opened this issue Oct 27, 2015 · 15 comments
Closed

OverflowError when loading uint64 from csv #11440

teto opened this issue Oct 27, 2015 · 15 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv

Comments

@teto
Copy link

teto commented Oct 27, 2015

Hi,
I am trying to plot a set of values that are all uint64 with panda
The csv file is available here https://transfer.sh/Sy5DS/iperf-client-linux-2rtrs-f30b30-f30b30-w140k-lia-run1.csv and this is the command I used:

In [6]: df = pd.read_csv('/home/teto/ns3testing/iperf-client-linux_2rtrs_f30b30_f30b30_w140K_lia-run1.csv', sep='|', dtype={'dsn': np.uint64})
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-6-794bcaffa179> in <module>()
----> 1 df = pd.read_csv('/home/teto/ns3testing/iperf-client-linux_2rtrs_f30b30_f30b30_w140K_lia-run1.csv', sep='|', dtype={'dsn': np.uint64})

/usr/lib/python3/dist-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    461                     skip_blank_lines=skip_blank_lines)
    462 
--> 463         return _read(filepath_or_buffer, kwds)
    464 
    465     parser_f.__name__ = name

/usr/lib/python3/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    247         return parser
    248 
--> 249     return parser.read()
    250 
    251 _parser_defaults = {

/usr/lib/python3/dist-packages/pandas/io/parsers.py in read(self, nrows)
    704                 raise ValueError('skip_footer not supported for iteration')
    705 
--> 706         ret = self._engine.read(nrows)
    707 
    708         if self.options.get('as_recarray'):

/usr/lib/python3/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1148 
   1149         try:
-> 1150             data = self._reader.read(nrows)
   1151         except StopIteration:
   1152             if nrows is None:

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader.read (pandas/parser.c:7287)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7511)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._read_rows (pandas/parser.c:8336)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9544)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10106)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10503)()

/usr/lib/python3/dist-packages/pandas/parser.cpython-34m-x86_64-linux-gnu.so in pandas.parser._try_int64 (pandas/parser.c:18126)()

My version was the one with ubuntu's 15.04 repo
pd.version.version
Out[10]: '0.15.0'

I am trying to upgrade it, hoping it would solve this, would it ?
I believe the problem is related to:
#4471

I recently discovered panda and so far it's the best tool I found to plot/work with data (tried R/gnuplot etc...) thanks a lot for the work.

@teto
Copy link
Author

teto commented Oct 27, 2015

I just tried with 0.17 and the error is now "Kerned died restarting" which translates from the interpreter into a segfault

Python 3.4.3 (default, Mar 26 2015, 22:03:40) 
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv('/home/teto/ns3testing/iperf-client-linux_2rtrs_f30b30_f30b30_w140K_lia-run1.csv', sep='|', dtype={'dsn': np.uint64})
zsh: segmentation fault (core dumped)  python3

@Winterflower
Copy link
Contributor

hi @teto,
I was trying to replicate, but it seems you have linked a .pcap file instead of the csv file you tried to create a DataFrame from.

@teto
Copy link
Author

teto commented Oct 27, 2015

Indeed sorry, the CSV is generated from the pcap, my mistake. Here is the good file:
https://transfer.sh/Sy5DS/iperf-client-linux-2rtrs-f30b30-f30b30-w140k-lia-run1.csv

@Winterflower
Copy link
Contributor

thanks @teto !
I think the problem here might be that values in your file are separated by vertical tabs instead of commas.

tcpstream|ipsrc|dss_length|dss_dsn|sport|packetid|ipdst|dss_rawack|dsn|time_delta|dport|tcpflags|dss_ssn|tcpseq|dack|subtype|mptcpstream|datafin|master|reltime
|||||1||||0.000000000||||||||||0.000000000
|||||2||||0.000000000||||||||||0.000000000
|||||3||||0.030046000||||||||||0.030046000

@teto
Copy link
Author

teto commented Oct 27, 2015

All my TSV files use '|' as a separator, only with very long numbers I've this kind of crash (if you remove the 'dtype' parameter from the read_csv call it will load just fine).

@teto
Copy link
Author

teto commented Oct 27, 2015

My bad, I forgot to say in my description that by default then the dsn column is considered as "object" but I want to plot it, that's why I then added the dtype=np.uint64

Thanks for your help :)

@Winterflower
Copy link
Contributor

Can confirm that this happens in pandas 0.17.
I'm by far not an expert, but I think it might have something to do with some of the values for that column being NaN and if I recall correctly NaN is a value for float arrays, not integer arrays.

I tried this with your datafile:

>>> pd.read_csv("iperf.csv", sep="|", dtype={'dsn':np.float64}).head()
   tcpstream  ipsrc  dss_length  dss_dsn  sport  packetid  ipdst  dss_rawack  \
0        NaN    NaN         NaN      NaN    NaN         1    NaN         NaN   
1        NaN    NaN         NaN      NaN    NaN         2    NaN         NaN   
2        NaN    NaN         NaN      NaN    NaN         3    NaN         NaN   
3        NaN    NaN         NaN      NaN    NaN         4    NaN         NaN   
4        NaN    NaN         NaN      NaN    NaN         5    NaN         NaN   

   dsn  time_delta  dport tcpflags  dss_ssn  tcpseq dack subtype  mptcpstream  \
0  NaN    0.000000    NaN      NaN      NaN     NaN  NaN     NaN          NaN   
1  NaN    0.000000    NaN      NaN      NaN     NaN  NaN     NaN          NaN   
2  NaN    0.030046    NaN      NaN      NaN     NaN  NaN     NaN          NaN   
3  NaN    0.000000    NaN      NaN      NaN     NaN  NaN     NaN          NaN   
4  NaN    3.969954    NaN      NaN      NaN     NaN  NaN     NaN          NaN   

   datafin  master   reltime  
0      NaN     NaN  0.000000  
1      NaN     NaN  0.000000  
2      NaN     NaN  0.030046  
3      NaN     NaN  0.030046  
4      NaN     NaN  4.000000 

and it seems to work, which would suggest that the np.uint64 is not liking the NaNs. I'll let someone more experienced weigh in, though.

Can you still make your plot when the datatype is float or do you need np.uint64 for something in particular?
(sorry about all the spam in this thread....)

@jreback
Copy link
Contributor

jreback commented Oct 27, 2015

@teto so the way to do this is to just pass dtype={'dsn' : object}. instead of trying to convert these to an integer type.

@teto
Copy link
Author

teto commented Oct 27, 2015

They are by default loaded as object but then, I can't plot the column, hence I force the conversion to uint64.

@jreback
Copy link
Contributor

jreback commented Oct 27, 2015

In [1]: data = """tcpstream|ipsrc|dss_length|dss_dsn|sport|packetid|ipdst|dss_rawack|dsn|time_delta|dport|tcpflags|dss_ssn|tcpseq|dack|subtype|mptcpstream|datafin|master|reltime
   ...: |||||1||||0.000000000||||||||||0.000000000
   ...: |||||2||||0.000000000||||||||||0.000000000
   ...: |||||3||||0.030046000||||||||||0.030046000"""


In [3]: pd.read_csv(StringIO(data),sep='|',dtype={'dsn' : object})
Out[3]: 
   tcpstream  ipsrc  dss_length  dss_dsn  sport  packetid  ipdst  dss_rawack  dsn  time_delta  dport  tcpflags  dss_ssn  tcpseq  dack  subtype  mptcpstream  datafin  master   reltime
0        NaN    NaN         NaN      NaN    NaN         1    NaN         NaN  NaN    0.000000    NaN       NaN      NaN     NaN   NaN      NaN          NaN      NaN     NaN  0.000000
1        NaN    NaN         NaN      NaN    NaN         2    NaN         NaN  NaN    0.000000    NaN       NaN      NaN     NaN   NaN      NaN          NaN      NaN     NaN  0.000000
2        NaN    NaN         NaN      NaN    NaN         3    NaN         NaN  NaN    0.030046    NaN       NaN      NaN     NaN   NaN      NaN          NaN      NaN     NaN  0.030046

In [4]: pd.read_csv(StringIO(data),sep='|',dtype={'dsn' : object}).dtypes
Out[4]: 
tcpstream      float64
ipsrc          float64
dss_length     float64
dss_dsn        float64
sport          float64
packetid         int64
ipdst          float64
dss_rawack     float64
dsn             object
time_delta     float64
dport          float64
tcpflags       float64
dss_ssn        float64
tcpseq         float64
dack           float64
subtype        float64
mptcpstream    float64
datafin        float64
master         float64
reltime        float64
dtype: object

@jreback
Copy link
Contributor

jreback commented Oct 27, 2015

can you show what the values actually are? (in a small copy-pastable example)?

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 27, 2015
@jreback
Copy link
Contributor

jreback commented Oct 27, 2015

fyi this is a manifestation of #4471

@teto
Copy link
Author

teto commented Oct 27, 2015

I guessed so but the mentioned branch (https://github.com/jtratner/pandas/tree/GH4471_fix_uint64_maybe_convert_objects) looks out of tree since it was scheduled for 0.14 and is not yet merged upstream, is that correct ?

@jreback
Copy link
Contributor

jreback commented Oct 27, 2015

that's correct. its still an open bug.

@jreback
Copy link
Contributor

jreback commented Oct 27, 2015

closing as dupe.

thanks for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants