read_csv fails to read file if there are cyrillic symbols in filename #17773

c-fos · 2017-10-04T06:09:45Z

Code Sample, a copy-pastable example if possible

import pandas
cyrillic_filename = "./файл_1.csv"
# 'c' engine fails:
df = pandas.read_csv(cyrillic_filename, engine="c", encoding="cp1251")
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-18-9cb08141730c> in <module>()
      2 
      3 cyrillic_filename = "./файл_1.csv"
----> 4 df = pandas.read_csv(cyrillic_filename , engine="c", encoding="cp1251")

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    762             self.options['has_index_names'] = kwds['has_index_names']
    763 
--> 764         self._make_engine(self.engine)
    765 
    766     def close(self):

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
    983     def _make_engine(self, engine='c'):
    984         if engine == 'c':
--> 985             self._engine = CParserWrapper(self.f, **self.options)
    986         else:
    987             if engine == 'python':

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1603         kwds['allow_leading_cols'] = self.index_col is not False
   1604 
-> 1605         self._reader = parsers.TextReader(src, **kwds)
   1606 
   1607         # XXX
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:4209)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas\_libs\parsers.c:8895)()
OSError: Initializing from file failed

# 'python' engine work:
df = pandas.read_csv(cyrillic_filename, engine="python", encoding="cp1251")
df.size
>>172440

# 'c' engine works if filename can be encoded to utf-8
latin_filename = "./file_1.csv"
df = pandas.read_csv(latin_filename, engine="c", encoding="cp1251")
df.size
>>172440

Problem description

The 'c' engine should read the files with non-UTF-8 filenames

Expected Output

File content readed into dataframe

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.13.2
scipy: 0.19.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.0.0
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-10-04T06:35:31Z

@c-fos : Thanks for reporting this! A couple of questions:

If you change the engine to Python, does it make a difference?
Can you open the file simply by calling open(filename)?

I'm trying to figure out if this is just a general Python issue of not handling Cyrillic characters well OR if this a pandas-specific issue.

c-fos · 2017-10-04T08:05:29Z

cyrillic_filename = "./файл_1.csv"

# 'python' engine is working:
df = pandas.read_csv(cyrillic_filename, engine="python", encoding="cp1251")
df.size
>>172440

# simple open in working
fd = open(cyrillic_filename)
fd
>><_io.TextIOWrapper name='./файл_1.csv' mode='r' encoding='cp1251'>

gfyoung · 2017-10-04T08:49:03Z

Alright, this is indeed a pandas-specific issue with the C engine. More than welcome to track this down and submit a patch for it!

gfyoung · 2017-10-04T08:52:14Z

The error is raised here:

pandas/pandas/_libs/parsers.pyx

Lines 709 to 718 in def3bce

    
           else: 
        
               ptr = new_file_source(source, self.parser.chunksize) 
        
               self.parser.cb_io = &buffer_file_bytes 
        
               self.parser.cb_cleanup = &del_file_source 
        
           if ptr == NULL: 
        
               if not os.path.exists(source): 
        
                   raise compat.FileNotFoundError( 
        
                       'File %s does not exist' % source) 
        
               raise IOError('Initializing from file failed')

The culprit function I believe is here:

pandas/pandas/_libs/src/parser/io.c

Lines 24 to 49 in def3bce

    
           void *new_file_source(char *fname, size_t buffer_size) { 
        
               file_source *fs = (file_source *)malloc(sizeof(file_source)); 
        
               if (fs == NULL) { 
        
                   return NULL; 
        
               } 
        
               fs->fd = open(fname, O_RDONLY | O_BINARY); 
        
               if (fs->fd == -1) { 
        
                   free(fs); 
        
                   return NULL; 
        
               } 
        
               // Only allocate this heap memory if we are not memory-mapping the file 
        
               fs->buffer = (char *)malloc((buffer_size + 1) * sizeof(char)); 
        
               if (fs->buffer == NULL) { 
        
                   close(fs->fd); 
        
                   free(fs); 
        
                   return NULL; 
        
               } 
        
               memset(fs->buffer, '\0', buffer_size + 1); 
        
               fs->size = buffer_size; 
        
               return (void *)fs; 
        
           }

What worries me is that it might have to do with the open function, in which case we might have hit a dead end (and perhaps it would no longer be a pandas-issue).

jreback · 2017-10-04T09:58:26Z

duplicate of #15086, there was a PR to fix this #15092 but it was erased somehow. This was a change in default file encoding on 3.6 on windows. There is a PEP reference in there. To solve we just treat the filename as bytes and decode as utf8. Welcome to have a patch.

fanguoguo · 2018-08-24T05:04:06Z

File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source

fingoldo · 2019-01-12T12:35:12Z

I can not believe this. It's year 2019 now, pandas v 23.4 and this issue with Cyrillic paths and C engine is STILL NOT FIXED, even after so many issue reports and questions on stackoverflow. Open source community seems to be no better than Microsoft in this regard, where known bugs are not getting fixed for years.

gfyoung · 2019-01-12T18:23:29Z

Open source community seems to be no better than Microsoft in this regard, where known bugs are not getting fixed for years.

@fingoldo : Sorry about this! We do get a lot of issues every day, and unlike at Microsoft, we have way fewer code maintainers to work and address all of these issues that we receive.

That being said, if you would like to tackle the issue, that would be great! Part of the issue that we have right now is that it's hard for us to test and validate any fixes, so a community contribution would be most welcome for something like this.

xref #15086 (comment)

fingoldo · 2019-01-14T07:33:52Z

@gfyoung sorry for the harsh words and thank you for your kind reply, I personally don't know how to fix that issue but please, if someone from devs who has actually created relevant modules sees this thread, roll out the fix, it's really embarassing to still have that error when engine is set to C...

gfyoung · 2019-01-14T08:03:16Z

it's really embarassing to still have that error when engine is set to C

@fingoldo : Yeah, it's awkward no doubt, but I hope you understand that there are many of these types of "embarrassing errors" than there are man-hours (devs and contributors combined) to correct them, especially if people have a hard time reproducing them.

Lucky for you, I currently have in my possession a Windows machine, and I was able to patch this issue pretty quickly in a PR:

#24758

fingoldo · 2019-01-14T08:46:46Z

Amazing thank you so much!!!

gfyoung added the IO CSV read_csv, to_csv label Oct 4, 2017

gfyoung added the Bug label Oct 4, 2017

jreback closed this as completed Oct 4, 2017

jreback added Unicode Unicode strings Windows Windows OS labels Oct 4, 2017

jreback added this to the No action milestone Oct 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv fails to read file if there are cyrillic symbols in filename #17773

read_csv fails to read file if there are cyrillic symbols in filename #17773

c-fos commented Oct 4, 2017 •

edited

Loading

gfyoung commented Oct 4, 2017

c-fos commented Oct 4, 2017 •

edited

Loading

gfyoung commented Oct 4, 2017

gfyoung commented Oct 4, 2017 •

edited

Loading

jreback commented Oct 4, 2017

fanguoguo commented Aug 24, 2018

fingoldo commented Jan 12, 2019

gfyoung commented Jan 12, 2019 •

edited

Loading

fingoldo commented Jan 14, 2019

gfyoung commented Jan 14, 2019 •

edited

Loading

fingoldo commented Jan 14, 2019

read_csv fails to read file if there are cyrillic symbols in filename #17773

read_csv fails to read file if there are cyrillic symbols in filename #17773

Comments

c-fos commented Oct 4, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Oct 4, 2017

c-fos commented Oct 4, 2017 • edited Loading

gfyoung commented Oct 4, 2017

gfyoung commented Oct 4, 2017 • edited Loading

jreback commented Oct 4, 2017

fanguoguo commented Aug 24, 2018

fingoldo commented Jan 12, 2019

gfyoung commented Jan 12, 2019 • edited Loading

fingoldo commented Jan 14, 2019

gfyoung commented Jan 14, 2019 • edited Loading

fingoldo commented Jan 14, 2019

c-fos commented Oct 4, 2017 •

edited

Loading

Output of `pd.show_versions()`

c-fos commented Oct 4, 2017 •

edited

Loading

gfyoung commented Oct 4, 2017 •

edited

Loading

gfyoung commented Jan 12, 2019 •

edited

Loading

gfyoung commented Jan 14, 2019 •

edited

Loading