Skip to content

BUG: to_csv requires escapechar unnecessarily when data contains null byte \x00 (Python 3.10+ only) #47871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
daviewales opened this issue Jul 27, 2022 · 8 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@daviewales
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

########### PowerShell ############
import pandas as pd
df = pd.DataFrame({'A': ['\x00']})
df.to_csv('null_byte.csv', index=False)

# Error: need to escape, but no escapechar set

import pandas as pd
df = pd.DataFrame({'A': ['\x00']})
df.to_csv('null_byte.csv', index=False, escapechar='\\')

# Works, BUT escapechar appears to be redundant, because there is no escapechar in the file on disk!
$ xxd null_byte.csv            # from WSL, because no xxd in PowerShell
00000000: 410d 0a00 0d0a                           A.....

########### Ubuntu/Bash ############
import pandas as pd
df = pd.DataFrame({'A': ['\x00']})
df.to_csv('null_byte.csv', index=False)

# WORKS, but output has different format:
$ xxd null_byte.csv
00000000: 410a 2200 220a                           A.".".

Issue Description

NOTE: I'm running this in PowerShell, but using xxd from Windows Subsystem for Linux.

If a dataframe contains a null byte \x00 as a value, to_csv requires escapechar to be set when run from PowerShell, but not when run from Ubuntu/Bash. However, the escapechar is not actually used, and does not appear in the output file.

Expected Behavior

I expect that if escapechar is not used, I shouldn't need to use escapechar='\\'.
I also expect that the flags for to_csv should be the same for both PowerShell and Ubuntu/Bash.

i.e. I expect that the following should just work in both PowerShell and Bash, without needing to specify escapechar:

import pandas as pd
df = pd.DataFrame({'A': ['\x00']})
df.to_csv('null_byte.csv', index=False)

Installed Versions

Pandas in Powershell:

INSTALLED VERSIONS

commit : e8093ba
python : 3.10.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_Australia.1252

pandas : 1.4.3
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : 0.8.1
fsspec : 2022.5.0
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 6.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : 1.4.39
tables : None
tabulate : 0.8.10
xarray : None
xlrd : None
xlwt : None
zstandard : None

Pandas in Ubuntu/Bash

INSTALLED VERSIONS

commit : e8093ba
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.3
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 63.2.0
pip : 22.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@daviewales daviewales added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2022
@simonjayhawkins
Copy link
Member

Thanks @daviewales for the report.

I'm getting # Error: need to escape, but no escapechar set on 1.4.3 on Ubuntu (wsl) which is at odds with your output.

but I get the same output as you on main. So maybe not related to Powershell/bash difference.

Further investigation required. (for instance my 1.4.3 env is Python 3.10 and my dev env (main) is Python 3.8)

I should probably do a bisect to confirm that the difference i'm seeing is not down to a code change.

@simonjayhawkins simonjayhawkins added the IO CSV read_csv, to_csv label Jul 27, 2022
@daviewales
Copy link
Contributor Author

Hmm... Interesting. I just tried on my Mac, and I'm getting the Error: need to escape, but no escapechar set there too.

INSTALLED VERSIONS

commit : e8093ba
python : 3.10.4.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
Version : Darwin Kernel Version 17.7.0: Wed Feb 27 00:43:23 PST 2019; root:xnu-4570.71.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.4.3
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 63.2.0
pip : 22.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@simonjayhawkins
Copy link
Member

on my dev machine (used laptop above) that is an ubuntu server i'm seeing the same as before but then my envs are setup in a similar manor so I expect the same differences in the envs exist

(pandas-1.4.3) simon@stadia:~/pandas (bisect)$ python bisect/47871.py
1.4.3
Traceback (most recent call last):
  File "/home/simon/pandas/bisect/47871.py", line 9, in <module>
    df.to_csv("null_byte.csv", index=False)
  File "/home/simon/miniconda3/envs/pandas-1.4.3/lib/python3.10/site-packages/pandas/core/generic.py", line 3551, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "/home/simon/miniconda3/envs/pandas-1.4.3/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1180, in to_csv
    csv_formatter.save()
  File "/home/simon/miniconda3/envs/pandas-1.4.3/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 261, in save
    self._save()
  File "/home/simon/miniconda3/envs/pandas-1.4.3/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 266, in _save
    self._save_body()
  File "/home/simon/miniconda3/envs/pandas-1.4.3/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 304, in _save_body
    self._save_chunk(start_i, end_i)
  File "/home/simon/miniconda3/envs/pandas-1.4.3/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 315, in _save_chunk
    libwriters.write_csv_rows(
  File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
    writer.writerows(rows[:((j + 1) % N)])
_csv.Error: need to escape, but no escapechar set
(pandas-1.4.3) simon@stadia:~/pandas (bisect)$ uname -a
Linux stadia 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
(pandas-1.4.3) simon@stadia:~/pandas (bisect)$ 
(pandas-1.4.3) simon@stadia:~$ python
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas;pandas.show_versions()
/home/simon/miniconda3/envs/pandas-1.4.3/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6
python           : 3.10.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-122-generic
Version          : #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.4.3
numpy            : 1.23.1
pytz             : 2022.1
dateutil         : 2.8.2
setuptools       : 63.2.0
pip              : 22.2
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
markupsafe       : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
>>> 
(pandas-1.4.3) simon@stadia:~$ conda deactivate
(pandas-dev) simon@stadia:~$ cd pandas
(pandas-dev) simon@stadia:~/pandas (bisect)$ python bisect/47871.py
1.5.0.dev0+1604.g83f635268d
(pandas-dev) simon@stadia:~/pandas (bisect)$ xxd null_byte.csv 
00000000: 410a 2200 220a                           A.".".
(pandas-dev) simon@stadia:~/pandas (bisect)$ python -c "import pandas;pandas.show_versions()
> 
> "
/home/simon/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : 83f635268d959a18bfad984f8ce562d38b4be22e
python           : 3.8.13.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-122-generic
Version          : #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.5.0.dev0+1604.g83f635268d
numpy            : 1.22.4
pytz             : 2022.1
dateutil         : 2.8.2
setuptools       : 61.2.0
pip              : 22.1.2
Cython           : 0.29.30
pytest           : 7.1.2
hypothesis       : 6.47.1
sphinx           : 4.5.0
blosc            : None
feather          : None
xlsxwriter       : 3.0.3
lxml.etree       : 4.9.1
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.9.3
jinja2           : 3.1.2
IPython          : 8.4.0
pandas_datareader: 0.10.0
bs4              : 4.11.1
bottleneck       : 1.3.5
brotli           : 
fastparquet      : 0.8.1
fsspec           : 2022.5.0
gcsfs            : 2022.5.0
matplotlib       : 3.3.2
numba            : 0.53.1
numexpr          : 2.8.3
odfpy            : None
openpyxl         : 3.0.9
pandas_gbq       : None
pyarrow          : 6.0.0
pyreadstat       : 1.1.9
pyxlsb           : 1.0.9
s3fs             : 0.6.0
scipy            : 1.8.1
snappy           : 
sqlalchemy       : None
tables           : None
tabulate         : 0.8.10
xarray           : 2022.3.0
xlrd             : 2.0.1
xlwt             : 1.3.0
zstandard        : 0.18.0
(pandas-dev) simon@stadia:~/pandas (bisect)$ 

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 27, 2022
@simonjayhawkins
Copy link
Member

looks like could be Python version. I'm seeing "need to escape, but no escapechar set" with python 3.10 but not on python 3.9 or 3.8 using the same version of pandas, either 1.3.5 or 1.4.3

(pandas-1.4.3-test) simon@stadia:~/pandas (bisect)$ python --version
Python 3.8.13
(pandas-1.4.3-test) simon@stadia:~/pandas (bisect)$ python bisect/47871.py 
1.4.3
(pandas-1.4.3-test) simon@stadia:~/pandas (bisect)$ xxd null_byte.csv
00000000: 410a 2200 220a                           A.".".
(pandas-1.4.3-test) simon@stadia:~/pandas (bisect)$ 
(pandas-1.4.3-test) simon@stadia:~/pandas (bisect)$ conda activate pandas-1.4.3
(pandas-1.4.3) simon@stadia:~/pandas (bisect)$ python bisect/47871.py 
1.4.3
need to escape, but no escapechar set
(pandas-1.4.3) simon@stadia:~/pandas (bisect)$ xxd null_byte.csv
00000000: 410a                                     A.
(pandas-1.4.3) simon@stadia:~/pandas (bisect)$ python --version
Python 3.10.5
(pandas-1.4.3) simon@stadia:~/pandas (bisect)$ 
(pandas-1.4.3) simon@stadia:~/pandas (bisect)$ conda activate pandas-1.4.3-py3.9
(pandas-1.4.3-py3.9) simon@stadia:~/pandas (bisect)$ python bisect/47871.py 
1.4.3
(pandas-1.4.3-py3.9) simon@stadia:~/pandas (bisect)$ xxd null_byte.csv
00000000: 410a 2200 220a                           A.".".
(pandas-1.4.3-py3.9) simon@stadia:~/pandas (bisect)$ python --version
Python 3.9.13
(pandas-1.4.3-py3.9) simon@stadia:~/pandas (bisect)$ 


(pandas-dev) simon@stadia:~/pandas (bisect)$ conda activate pandas-1.3.5
(pandas-1.3.5) simon@stadia:~/pandas (bisect)$ python bisect/47871.py 
1.3.5
need to escape, but no escapechar set
(pandas-1.3.5) simon@stadia:~/pandas (bisect)$ xxd null_byte.csv
00000000: 410a                                     A.
(pandas-1.3.5) simon@stadia:~/pandas (bisect)$ python --version
Python 3.10.1
(pandas-1.3.5) simon@stadia:~/pandas (bisect)$ 
(pandas-1.3.5) simon@stadia:~/pandas (bisect)$ conda activate activate pandas-1.3.5-py3.9
(pandas-1.3.5-py3.9) simon@stadia:~/pandas (bisect)$ python bisect/47871.py 
1.3.5
(pandas-1.3.5-py3.9) simon@stadia:~/pandas (bisect)$ xxd null_byte.csv
00000000: 410a 2200 220a                           A.".".
(pandas-1.3.5-py3.9) simon@stadia:~/pandas (bisect)$ python --version
Python 3.9.13
(pandas-1.3.5-py3.9) simon@stadia:~/pandas (bisect)$ 

@LiamNiisan
Copy link

I had the same issue as well, and like @simonjayhawkins said, the issue goes away when I change python 3.10 to 3.9 (conda downgraded pandas from 1.4.4 to 1.2.4 after changing python's version)

@anubhav562
Copy link

I had the same issue. I tried a lot of ways to fix it:

  1. Adding the escapechar argument
  2. using different types of quoting techniques
  3. Replacing the delimiter

None of the above worked!

What worked was changing Python version from 3.10 to 3.9.

@daviewales daviewales changed the title BUG: to_csv requires escapechar unnecessarily when data contains null byte \x00 (PowerShell only) BUG: to_csv requires escapechar unnecessarily when data contains null byte \x00 (Python 3.10+ only) Dec 3, 2022
@roib20
Copy link
Contributor

roib20 commented Feb 2, 2023

Decided to try the reproducible examples with the current versions of Python 3.11.1 and pandas 1.5.3.

With those versions on Fedora 37, I am getting the exact same output using both reproducible examples:

xxd null_byte.csv
00000000: 410a 000a A...

However if I revert back to Python 3.10.9 (still with pandas 1.5.3), I do get the dreaded _csv.Error: need to escape, but no escapechar set error when not including escapechar='\\'. If I do include that then I still get the same output as in 3.11.1.

So is this the expected output? If so it appears that whatever caused this issue may have been fixed in Python 3.11.1 (or possibly 3.11.0 which I didn't test). I also did not test previous (or future) versions of pandas, just 1.5.3.

@jaibalani
Copy link

+1 facing the same issue

@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Feb 7, 2024
iainsgillis added a commit to iainsgillis/socrata-py that referenced this issue Mar 14, 2024
On Python 3.10 and 3.11, an upstream bug in pandas causes a failure when serializing a dataframe to csv when there's a null byte in the dataframe. This pull request leaves the default behaviour alone, but gives users the options to modify to_csv behaviour, including fixing that issue with the `escapechar` parameter.

See pandas-dev/pandas#47871
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants