Skip to content

read_hdf crash python process when use it in multithread code #14263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xmseraph opened this issue Sep 20, 2016 · 6 comments
Closed

read_hdf crash python process when use it in multithread code #14263

xmseraph opened this issue Sep 20, 2016 · 6 comments
Labels
Duplicate Report Duplicate issue or pull request IO HDF5 read_hdf, HDFStore

Comments

@xmseraph
Copy link

xmseraph commented Sep 20, 2016

one of my folder contains multiple h5 files, and I tried to load them into dataframes and then concat these df into one.

the python process crashes when the num_tasks>1, if I debug thread by thread, it works, in another, it crashes simply when two threads run at the same time, even though they read different files.

from multiprocessing.pool import ThreadPool
import pandas as pd 

num_tasks=2
def readjob(x):
    path = x
    return pd.read_hdf(path,"df",mode='r')

pool = ThreadPool(num_tasks)
results = pool.map(readjob,files)
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 20, 2016

Could you make a reproducible example? files is undefined. Also, pd.show_versions.

@xmseraph
Copy link
Author

xmseraph commented Sep 20, 2016

files is the array of string, contains the absolute paths of .h5 files, you will need code like this.

from os import listdir
from os.path import isfile, join
dir='where i store the h5 files'
files=[join(dir, f) for f in listdir(dir) if isfile(join(dir, f))]

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 26.1.1
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Unfortunately, that still won't work for me since the directory 'where i store the h5 files' isn't on my computer. Can you make a small script to generate the HDF files needed? They shouldn't need to be large or that numerous.

@xmseraph
Copy link
Author

xmseraph commented Sep 20, 2016

@TomAugspurger thank you for your reply, actually if i didn't create code to generate small files for you, I wouldn't notice this problem when I created H5 files

import numpy as np
import pandas as pd
from pandas.util import testing as tm
from multiprocessing.pool import ThreadPool

path = 'test.hdf'
path1 = 'test1.hdf'
files=[path,path1]
num_rows = 100000
num_tasks = 2

def make_df(num_rows=10000):

    df = pd.DataFrame(np.random.rand(num_rows, 5), columns=list('abcde'))
    df['foo'] = 'foo'
    df['bar'] = 'bar'
    df['baz'] = 'baz'
    df['date'] = pd.date_range('20000101 09:00:00',
                               periods=num_rows,
                               freq='s')
    df['int'] = np.arange(num_rows, dtype='int64')
    return df

print("writing df")
df = make_df(num_rows=num_rows)
df.to_hdf(path, 'df',complib='zlib',complevel=9,append=False,mode='w',format='t')
df.to_hdf(path1, 'df',complib='zlib',complevel=9,append=False,**mode='a'**,format='t')

def readjob(x):
    path = x
    return pd.read_hdf(path,"df",mode='r')

pool = ThreadPool(num_tasks)
results = pool.map(readjob,files)
print results

when I write to path1, i set the mode to append, the code crashes when the pool kicks in
but if I write to path1 with mode='w', the code works.
is this weird?

@jreback
Copy link
Contributor

jreback commented Sep 20, 2016

duplicate of #12236

@jreback jreback closed this as completed Sep 20, 2016
@jreback jreback added Duplicate Report Duplicate issue or pull request IO HDF5 read_hdf, HDFStore labels Sep 20, 2016
@jreback jreback added this to the No action milestone Sep 20, 2016
@xmseraph
Copy link
Author

the mode parameter doesn't fix the problem, after i tested the code more times, i found out it was just random to run through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

3 participants