to_csv issue #8621

rmorgans · 2014-10-24T04:05:07Z

I have an issue using to_csv on a DataFrame object. It has a large number of columns d.shape = (3,454731).

Python 3.4.2 (default, Oct  8 2014, 13:44:52)
[GCC 4.9.1 20140903 (prerelease)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.14.22-1-lts
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8

pandas: 0.15.0-16-g7012d71
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.0
scipy: 0.14.0
statsmodels: None
IPython: 2.3.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.7
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.2
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.5.4 (dt dec pq3 ext)
>>> d=pd.read_msgpack('test.mpk')
>>> d.shape
(3, 454731)
>>> d.to_csv('test.csv')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/pandas/util/decorators.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.4/site-packages/pandas/core/frame.py", line 1154, in to_csv
    formatter.save()
  File "/usr/lib/python3.4/site-packages/pandas/core/format.py", line 1400, in save
    self._save()
  File "/usr/lib/python3.4/site-packages/pandas/core/format.py", line 1492, in _save
    chunks = int(nrows / chunksize) + 1
ZeroDivisionError: division by zero
>>> d.T.to_csv('test.csv')
>>>

Not sure what's going on here - I've written a nosetest here (any tips for improvements in my test?)

rmorgans@f3d0a9e

The text was updated successfully, but these errors were encountered:

jreback · 2014-10-24T12:24:01Z

better just to create a test programmatically

eg

df = Dataframe(np.random....)

with the above shape (if this doesn't fail then u have something in your dtypes that causes this to fail)

rmorgans · 2014-10-24T14:24:01Z

pd.DataFrame(np.random.randn(3, 100000)).to_csv('test.csv')

works

pd.DataFrame(np.random.randn(3, 100001)).to_csv('test.csv')

doesnt work... (well it does seem to write a usable csv out)

So it looks like its a limit off 100000... possibly something to do with

% ag 100000 /usr/lib/python3.4/site-packages/pandas/core/format.py
1157:            chunksize = (100000 / (len(self.cols) or 1)) or 1

jreback · 2014-10-25T00:08:05Z

@rmorgans perfect.

Want to do a pull request with that test and a fix?

follow along the same format in tests/test_frame.py, search for where the other to_csv tests are

rmorgans · 2014-10-25T13:55:32Z

I'll see if I can work out what's going on (also my first ever attempt at a pull request)

Is something like this OK for the test inside tests/test_frame.py

    def test_to_csv_wide_frame_formatting(self):

        with ensure_clean() as path:
            pd.DataFrame(np.random.randn(3, 100001)).to_csv(path)

jreback · 2014-10-25T14:05:19Z

close

create the frame
write it out
read it in
compare
(just follow the template of the other examples

rmorgans · 2014-11-03T06:50:54Z

Hi Folks

I've worked out a suitable test
(https://github.com/rmorgans/pandas/blob/GH8621/pandas/tests/test_frame.py#L6461-L6467)

Here's the offending part

        if chunksize is None:
            chunksize = (100000 / (len(self.cols) or 1)) or 1
        self.chunksize = int(chunksize)

What's happening is the chunksize, if None, is being guessed - but why is it being guessed as a function of the number of columns? From the docs chunksize should be related to number of rows surely? What was the original intent here, and what behaviour is desired?

jreback · 2014-11-03T07:25:49Z

gr8 you can do s pull-request to submit the test/fix for this issue

the intent was write a fixed size number of total elements at a time to keep s constant memory usage
remember you can have very wide tables - this it's roughly proportional to that

if u would like to provide a better guessing function would welcome that
keep in mind the various cases

small rows, small columns
small rows, lots of columns (this case)
large rows, small columns
large rows, large columns

in my limited tests didn't find much reason to have s very large chunksize
eg should be min of say 10k but max of 100k (if not user specified)
but if u have a very very wide table as in this example this breaks down

jreback · 2014-11-03T07:28:15Z

a very similar chunking mechanism is inplace for HDFStore, to_sql
so should make s general function for this
put in pandas.io.common

dxe4 · 2014-11-29T13:15:47Z

will give this a try

dxe4 · 2014-11-29T13:26:13Z

python3 issue

# python3
>>> cols = 1000000000
>>> int((100000 / (cols or 1)) or 1)
0

# python2
>>> cols = 1000000000
>>> int((100000 / (cols or 1)) or 1)
1

@fvia mentioned changing / to // fixes it

fvia · 2014-11-29T13:42:22Z

@jreback, I feel the chunksize in HDFStore, to_sql having the same name, has different logic.

jreback · 2014-11-29T13:44:50Z

if u wanted to refactor it out to a function in
io.common would be ok

jreback added Bug IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string labels Oct 24, 2014

jreback added this to the 0.15.1 milestone Oct 24, 2014

jorisvandenbossche added the Good as first PR label Nov 28, 2014

jreback mentioned this issue Nov 28, 2014

Bloomberg Hackathon #8323

Closed

dxe4 mentioned this issue Nov 29, 2014

BUG: fixed chunksize guessed to 0 (py3 only). #8621 #8927

Merged

dxe4 added a commit to dxe4/pandas that referenced this issue Nov 30, 2014

BUG: fixed chunksize guessed to 0 (py3 only). pandas-dev#8621

549422f

jreback closed this as completed in #8927 Dec 3, 2014

jreback mentioned this issue Mar 28, 2022

Speed up a test #46547

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_csv issue #8621

to_csv issue #8621

rmorgans commented Oct 24, 2014

jreback commented Oct 24, 2014

rmorgans commented Oct 24, 2014

jreback commented Oct 25, 2014

rmorgans commented Oct 25, 2014

jreback commented Oct 25, 2014

rmorgans commented Nov 3, 2014

jreback commented Nov 3, 2014

jreback commented Nov 3, 2014

dxe4 commented Nov 29, 2014

dxe4 commented Nov 29, 2014

fvia commented Nov 29, 2014

jreback commented Nov 29, 2014

to_csv issue #8621

to_csv issue #8621

Comments

rmorgans commented Oct 24, 2014

jreback commented Oct 24, 2014

rmorgans commented Oct 24, 2014

jreback commented Oct 25, 2014

rmorgans commented Oct 25, 2014

jreback commented Oct 25, 2014

rmorgans commented Nov 3, 2014

jreback commented Nov 3, 2014

jreback commented Nov 3, 2014

dxe4 commented Nov 29, 2014

dxe4 commented Nov 29, 2014

fvia commented Nov 29, 2014

jreback commented Nov 29, 2014