-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_csv from Google Cloud Storage ignores encoding #32392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would adding GCSFile (or perhaps fsspec.file.AbstractFile) to Lines 377 to 382 in f25ed6f
|
Hi, @TomAugspurger Lines 453 to 457 in 78c1a74
it is planned to check if I think adding GCSFile in I just confused, because do not see where |
I'm not sure either.
…On Mon, Mar 2, 2020 at 5:27 PM Egor Eremeev ***@***.***> wrote:
Hi, @TomAugspurger <https://github.com/TomAugspurger>
As I understand the code below it is planned to check if encoding is
passed and path_or_buf is presented in the this list need_text_wrapping.
If so then wrap with TextIOWrapper(). I think adding GCSFile in
need_text_wrapping is a right approach. I just confused, because do not
see where get_handle() is called while read_csv()
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32392?email_source=notifications&email_token=AAKAOIW5Y5ZBTKTOBCZKC6DRFQ6GTA5CNFSM4K7QGX6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRMPWQ#issuecomment-593676250>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIVSAKBZBGMGKG6P3TTRFQ6GTANCNFSM4K7QGX6A>
.
|
Code Sample, a copy-pastable example if possible
Problem description
Reading csv files which have encoding other than utf-8, like cp1251, from the Google Cloud Storage fails with error:
the stacktrace from the Google Cloud Function environment:
It looks that in
pandas
ignoring ofencoding
parameter happens, because in thepandas.io.gcs.get_filepath_or_buffer
themode = 'rb'
is passed to call ofGCSFileSystem.open(filepath_or_buffer, mode)
Tracing back to the moment of the first actual setting the
mode
parameter we have stop on this line:pandas.io.common.py
, because in the call of
get_filepath_or_buffer()
performed from herepandas/pandas/io/parsers.py
Lines 430 to 432 in 29d6b02
we do not pass value of
mode
and defaultmode=None
works.But in the current gcsf master
GCSFileSystem.open()
has been removed andfsspec.AbstractFileSystem.open()
has works instead:where applying of passed
encoding
for the text reading\writing is now implemented:Expected Output
The encoding value passed into pd.read_csv() is applyied while reading from GCS, csv files are read.
As I could suggest for read_csv() we need pass
mode=r
and for to_csv() (see #26124) we need passmode=w
in the call ofget_filepath_or_buffer()
. But I'm not sure where in code it's better to implement this change.Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.0
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.13.1
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: