Skip to content

to_csv issue #8621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rmorgans opened this issue Oct 24, 2014 · 12 comments · Fixed by #8927
Closed

to_csv issue #8621

rmorgans opened this issue Oct 24, 2014 · 12 comments · Fixed by #8927
Labels
Bug IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@rmorgans
Copy link

I have an issue using to_csv on a DataFrame object. It has a large number of columns d.shape = (3,454731).

Python 3.4.2 (default, Oct  8 2014, 13:44:52)
[GCC 4.9.1 20140903 (prerelease)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.14.22-1-lts
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8

pandas: 0.15.0-16-g7012d71
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.0
scipy: 0.14.0
statsmodels: None
IPython: 2.3.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.7
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.2
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.5.4 (dt dec pq3 ext)
>>> d=pd.read_msgpack('test.mpk')
>>> d.shape
(3, 454731)
>>> d.to_csv('test.csv')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/pandas/util/decorators.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.4/site-packages/pandas/core/frame.py", line 1154, in to_csv
    formatter.save()
  File "/usr/lib/python3.4/site-packages/pandas/core/format.py", line 1400, in save
    self._save()
  File "/usr/lib/python3.4/site-packages/pandas/core/format.py", line 1492, in _save
    chunks = int(nrows / chunksize) + 1
ZeroDivisionError: division by zero
>>> d.T.to_csv('test.csv')
>>>

Not sure what's going on here - I've written a nosetest here (any tips for improvements in my test?)

rmorgans@f3d0a9e

@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

better just to create a test programmatically

eg

df = Dataframe(np.random....)

with the above shape (if this doesn't fail then u have something in your dtypes that causes this to fail)

@jreback jreback added Bug IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string labels Oct 24, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 24, 2014
@rmorgans
Copy link
Author

pd.DataFrame(np.random.randn(3, 100000)).to_csv('test.csv')

works

pd.DataFrame(np.random.randn(3, 100001)).to_csv('test.csv')

doesnt work... (well it does seem to write a usable csv out)

So it looks like its a limit off 100000... possibly something to do with

% ag 100000 /usr/lib/python3.4/site-packages/pandas/core/format.py
1157:            chunksize = (100000 / (len(self.cols) or 1)) or 1

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

@rmorgans perfect.

Want to do a pull request with that test and a fix?

follow along the same format in tests/test_frame.py, search for where the other to_csv tests are

@rmorgans
Copy link
Author

I'll see if I can work out what's going on (also my first ever attempt at a pull request)

Is something like this OK for the test inside tests/test_frame.py

    def test_to_csv_wide_frame_formatting(self):

        with ensure_clean() as path:
            pd.DataFrame(np.random.randn(3, 100001)).to_csv(path)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

close

create the frame
write it out
read it in
compare
(just follow the template of the other examples

@rmorgans
Copy link
Author

rmorgans commented Nov 3, 2014

Hi Folks

I've worked out a suitable test
(https://github.com/rmorgans/pandas/blob/GH8621/pandas/tests/test_frame.py#L6461-L6467)

Here's the offending part

        if chunksize is None:
            chunksize = (100000 / (len(self.cols) or 1)) or 1
        self.chunksize = int(chunksize)

What's happening is the chunksize, if None, is being guessed - but why is it being guessed as a function of the number of columns? From the docs chunksize should be related to number of rows surely? What was the original intent here, and what behaviour is desired?

@jreback
Copy link
Contributor

jreback commented Nov 3, 2014

gr8 you can do s pull-request to submit the test/fix for this issue

the intent was write a fixed size number of total elements at a time to keep s constant memory usage
remember you can have very wide tables - this it's roughly proportional to that

if u would like to provide a better guessing function would welcome that
keep in mind the various cases

small rows, small columns
small rows, lots of columns (this case)
large rows, small columns
large rows, large columns

in my limited tests didn't find much reason to have s very large chunksize
eg should be min of say 10k but max of 100k (if not user specified)
but if u have a very very wide table as in this example this breaks down

@jreback
Copy link
Contributor

jreback commented Nov 3, 2014

a very similar chunking mechanism is inplace for HDFStore, to_sql
so should make s general function for this
put in pandas.io.common

@dxe4
Copy link
Contributor

dxe4 commented Nov 29, 2014

will give this a try

@dxe4
Copy link
Contributor

dxe4 commented Nov 29, 2014

python3 issue

# python3
>>> cols = 1000000000
>>> int((100000 / (cols or 1)) or 1)
0
# python2
>>> cols = 1000000000
>>> int((100000 / (cols or 1)) or 1)
1

@fvia mentioned changing / to // fixes it

@fvia
Copy link
Contributor

fvia commented Nov 29, 2014

@jreback, I feel the chunksize in HDFStore, to_sql having the same name, has different logic.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2014

if u wanted to refactor it out to a function in
io.common would be ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants