Skip to content

BUG: read_csv does not take encoding into account for (e.g.) Django FieldFile #45488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
vanschelven opened this issue Jan 20, 2022 · 2 comments
Open
3 tasks done
Labels
Bug IO CSV read_csv, to_csv

Comments

@vanschelven
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# models.py
from django.db import models


class MyModel(models.Model):
    f = models.FileField()

# views.py
from io import StringIO
from django.http import HttpResponse

import tempfile
from pandas import read_csv

from .models import MyModel


def try_it(path_or_filelike_object):
    try:
        print(read_csv(
            path_or_filelike_object,
            encoding="utf-16le",
            sep="|",
            ))
    except Exception as e:
        print(e)


def failme(request):
    utf_16le_encoded = b'\xff\xfef\x00o\x00o\x00|\x00b\x00a\x00r\x00\r\x00\n\x000\x00|\x001\x00\r\x00\n\x00'
    print(utf_16le_encoded.decode("utf-16le"))

    f = tempfile.NamedTemporaryFile(mode='w+b')
    f.write(utf_16le_encoded)
    f.flush()
    f.seek(0)

    print("\n## Just use path")
    try_it(f.name)

    print("\n## A python file object")
    try_it(f)

    MyModel.objects.all().delete()
    mm = MyModel.objects.create(f="filename")
    with mm.f.open('wb+') as destination:
        destination.write(utf_16le_encoded)

    print("\n## A Django FieldFile")
    try_it(mm.f.open('rb'))

    print("\n## A Django FieldFile, wrapped in StringIO")
    try_it(StringIO(mm.f.open('rb').read().decode("utf-16le")))

    return HttpResponse("look in your console")

What is printed:
foo|bar
0|1


## Just use path
   foo  bar
0    0    1

## A python file object
   foo  bar
0    0    1

## A Django FieldFile
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

## A Django FieldFile, wrapped in StringIO
   foo  bar
0    0    1

Issue Description

read_csv does not always take the provided encoding into account for file-like objects.
An example is given in the above for a Django FieldFile (FileField), but I suspect the issue is more general.
A small utf-16le file is fed to read_csv via a path, regular python file, Django fieldfile and StringIO; It fails with an encoding error in the 3rd case only.

This seems similar to #31819, although that bug is reportedly fixed

Expected Behavior

read_csv should always take the provided encoding into account correctly.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.8-76051508-generic
Version : #202112141040163950527821.10~0ede46a SMP Tue Dec 14 22:38:29 U
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 20.3.4
setuptools : 44.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@vanschelven vanschelven added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 20, 2022
@vanschelven
Copy link
Author

The difference between Django's FieldFile (case 3, failing) and a Python TemporaryFile (case 2, succeeding) seems to be that the former has no attribute mode whereas the latter has (in the example code) such an attribute with value "rb". Indeed, monkey-patching like so fixes the problem:

    print("\n## A Django FieldFile")
    django_thing = mm.f.open('rb')
    django_thing.mode = 'rb'  # <= monkey patch
    try_it(django_thing)

I'll leave it up to others to decide whether this is a problem with Django or with Pandas, but if it is decided that this is a problem with Django, I suggest that the following piece of documentation is incorrect:

By file-like object, we refer to objects with a read() method, such as a file handle

since there is also a dependency on mode having been set correctly.

@twoertwein
Copy link
Member

We need to have some way to differentiate whether a file object is opened in binary mode or in text mode. Until 1.2(?) read_csv did not even support binary file handles (it seems your file handle is opened in binary mode). Adding a new keyword-argument mode to read_csv, would probably be a good solution: if the file object has no .mode, the user might need to provide the mode as an argument (to_csv already has this).

If you can open your file handle in text mode, that might be a workaround.

@twoertwein twoertwein added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

2 participants