read_csv from Google Cloud Storage ignores encoding #32392

EgorBEremeev · 2020-03-02T09:35:03Z

Code Sample, a copy-pastable example if possible

    dataframe = pd.read_csv('gs://mybucket/my_file', encoding = 'cp1251')

Problem description

Reading csv files which have encoding other than utf-8, like cp1251, from the Google Cloud Storage fails with error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

the stacktrace from the Google Cloud Function environment:

Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function
_function_handler.invoke_user_function(event_object)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function
return call_user_function(request_or_event)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function
event_context.Context(**request_or_event.context))
File "/user_code/main.py", line 60, in load_csv_to_bq
na_filter=False)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in init
self._make_engine(self.engine)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 748, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

It looks that in pandas ignoring of encoding parameter happens, because in the pandas.io.gcs.get_filepath_or_buffer the mode = 'rb' is passed to call of GCSFileSystem.open(filepath_or_buffer, mode)

Tracing back to the moment of the first actual setting the mode parameter we have stop on this line:

pandas.io.common.py

def get_filepath_or_buffer(
    filepath_or_buffer, encoding=None, compression=None, mode=None
)

, because in the call of get_filepath_or_buffer() performed from here

pandas/pandas/io/parsers.py

Lines 430 to 432 in 29d6b02

    
           fp_or_buf, _, compression, should_close = get_filepath_or_buffer( 
        
               filepath_or_buffer, encoding, compression 
        
           )

we do not pass value of mode and default mode=None works.

But in the current gcsf master GCSFileSystem.open() has been removed and fsspec.AbstractFileSystem.open() has works instead:

where applying of passed encoding for the text reading\writing is now implemented:

        if "b" not in mode:
            mode = mode.replace("t", "") + "b"

            text_kwargs = {
                k: kwargs.pop(k)
                for k in ["encoding", "errors", "newline"]
                if k in kwargs
            }
            return io.TextIOWrapper(
                self.open(path, mode, block_size, **kwargs), **text_kwargs
            )

Expected Output

The encoding value passed into pd.read_csv() is applyied while reading from GCS, csv files are read.

As I could suggest for read_csv() we need pass mode=r and for to_csv() (see #26124) we need pass mode=w in the call of get_filepath_or_buffer(). But I'm not sure where in code it's better to implement this change.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.0
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.13.1
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-03-02T20:13:24Z

Would adding GCSFile (or perhaps fsspec.file.AbstractFile) to

pandas/pandas/io/common.py

Lines 377 to 382 in f25ed6f

    
           try: 
        
               from s3fs import S3File 
        
               need_text_wrapping = (BufferedIOBase, RawIOBase, S3File) 
        
           except ImportError: 
        
               need_text_wrapping = (BufferedIOBase, RawIOBase)

help?

EgorBEremeev · 2020-03-02T23:27:36Z

Hi, @TomAugspurger
As I understand the code below

pandas/pandas/io/common.py

Lines 453 to 457 in 78c1a74

    
           # Convert BytesIO or file objects passed with an encoding 
        
           if is_text and (compression or isinstance(f, need_text_wrapping)): 
        
               from io import TextIOWrapper 
        
               g = TextIOWrapper(f, encoding=encoding, newline="")

it is planned to check if encoding is passed and path_or_buf is presented in the this list need_text_wrapping. If so then wrap with TextIOWrapper().

I think adding GCSFile in need_text_wrapping is a right approach.

I just confused, because do not see where get_handle() is called while read_csv()

TomAugspurger · 2020-03-03T15:19:21Z

I'm not sure either.

…

On Mon, Mar 2, 2020 at 5:27 PM Egor Eremeev ***@***.***> wrote: Hi, @TomAugspurger <https://github.com/TomAugspurger> As I understand the code below it is planned to check if encoding is passed and path_or_buf is presented in the this list need_text_wrapping. If so then wrap with TextIOWrapper(). I think adding GCSFile in need_text_wrapping is a right approach. I just confused, because do not see where get_handle() is called while read_csv() — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#32392?email_source=notifications&email_token=AAKAOIW5Y5ZBTKTOBCZKC6DRFQ6GTA5CNFSM4K7QGX6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRMPWQ#issuecomment-593676250>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIVSAKBZBGMGKG6P3TTRFQ6GTANCNFSM4K7QGX6A> .

EgorBEremeev mentioned this issue Mar 2, 2020

pandas 1.0.1 read_csv() is broken for some file-like objects #31819

Closed

jbrockmendel added the IO CSV read_csv, to_csv label Jun 5, 2020

twoertwein mentioned this issue Aug 13, 2020

BUG/ENH: compression for google cloud storage in to_csv #35681

Merged

5 tasks

jreback added this to the 1.2 milestone Aug 14, 2020

jreback closed this as completed in #35681 Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv from Google Cloud Storage ignores encoding #32392

read_csv from Google Cloud Storage ignores encoding #32392

EgorBEremeev commented Mar 2, 2020 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Mar 2, 2020

EgorBEremeev commented Mar 2, 2020 •

edited

Loading

TomAugspurger commented Mar 3, 2020 via email

read_csv from Google Cloud Storage ignores encoding #32392

read_csv from Google Cloud Storage ignores encoding #32392

Comments

EgorBEremeev commented Mar 2, 2020 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

TomAugspurger commented Mar 2, 2020

EgorBEremeev commented Mar 2, 2020 • edited Loading

TomAugspurger commented Mar 3, 2020 via email

EgorBEremeev commented Mar 2, 2020 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

EgorBEremeev commented Mar 2, 2020 •

edited

Loading