Segfault when writing data out of order to pd.HDFStore via append #10180

bastianlb · 2015-05-20T17:47:02Z

I am trying to append chunks of data to an (initially empty) HDF5 frame with pd.HDFStore. The chunks come in out of order, and sometimes certain orders produce segfaults. This script seems to consistently segfault after loading file 31 (update, see comments for better example). You will notice that by looking at the time stamps outputted by the script it appears to be when hdf5 tries to fill some gap data. I can produce more files that trigger segfaults if necessary.

I've managed to narrow it down to the following line in pytables.py

> /home/user/env/lib/python3.4/site-packages/pandas/io/pytables.py(3738)write_data_chunk()
   3737                 self.table.append(rows)
-> 3738                 self.table.flush()
   3739         except Exception as detail:

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: 1.3.3
Cython: 0.22
numpy: 1.9.2
scipy: None
statsmodels: None
IPython: 2.1.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.2.0
numexpr: 2.4
matplotlib: None
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9
apiclient: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)

HDF5 Versions:
~/user$ dpkg -l | grep "hdf5"
ii  hdf5-helpers                   1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - Helper tools
ii  hdf5-tools                     1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - Runtime tools
rc  libhdf5-7:amd64                1.8.12+docs-1               amd64        Hierarchical Data Format 5 (HDF5) - runtime files - serial version
ii  libhdf5-8:amd64                1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - runtime files - serial version
rc  libhdf5-cpp-7:amd64            1.8.12+docs-1               amd64        Hierarchical Data Format 5 (HDF5) - C++ libraries
ii  libhdf5-cpp-8:amd64            1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - C++ libraries
ii  libhdf5-dev                    1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - development files - serial version

A script to reproduce:

import os
import pandas as pd

_dir = 'test_files'
_file = './hdf5_test.h5'


def write_to_file(series):
    store = pd.HDFStore(_file, 'a')
    frame = pd.DataFrame(series)
    print("Appending data from {0} to {1}".format(
        frame.index[0], frame.index[-1]))
    store.append('test', frame)
    store.close()


if __name__ == "__main__":
    if os.path.isfile(_file):
        os.remove(_file)
    files = os.listdir(_dir)
    for f in files:
        series = pd.read_pickle(os.path.join(_dir, f))
        print("Writing data: %s" % f)
        write_to_file(series)

zipped pickled files for test: http://s000.tinyupload.com/?file_id=60238823358379433453

Here is a pastebin of the data where segfault is occuring from the example in csv format:
http://pastebin.com/FRsygCUG
note, you may need the actual files to reproduce this, but as you can see from the pastebin that the data isn't malformed

It is trying to fill the the following gap in the original data:
2011-01-04 17:55:00 to2011-01-05 22:15:00
with an append which results in a segfault

Script output:

Writing data: 00
Appending data from 2011-01-03 13:40:00 to 2011-01-04 17:55:00
Writing data: 01
Appending data from 2011-01-05 22:15:00 to 2011-01-07 02:30:00
Writing data: 02
Appending data from 2011-01-04 18:00:00 to 2011-01-05 22:15:00
Segmentation fault

The text was updated successfully, but these errors were encountered:

jreback · 2015-05-20T21:33:30Z

So you need to show how you are reading in the data, and what df.info() looks like.

you data has some bad rows in it, and thus is being read in as object. I converted it and it works just fine.

When storing in HDF5 you need to be espectially cognizant of dtypes. It will be non-performant if you have the wrong dtypes (and you may not even be able to store it).

In [34]: df = pd.read_csv('/Users/jreback/Downloads/FRsygCUG.txt',header=0,skiprows=2)

In [35]: df.dtypes
Out[35]: 
Unnamed: 0    object
normal        object
dtype: object

In [36]: df.columns=['date','value']

In [37]: df['date'] = pd.to_datetime(df['date'],coerce=True)

In [38]: df['value'] = df['value'].convert_objects(convert_numeric=True)

In [39]: df.dtypes
Out[39]: 
date     datetime64[ns]
value           float64
dtype: object

In [40]: store = pd.HDFStore('test.h5',mode='w')

In [41]: for dfi in np.array_split(df,20):
    store.append('df',dfi)
   ....:     

In [42]: store.close()

In [43]: pd.read_hdf('test.h5','df').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1025 entries, 0 to 1024
Data columns (total 2 columns):
date     1020 non-null datetime64[ns]
value    1020 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 24.0 KB

jreback · 2015-05-20T21:52:46Z

@BastianL this also does not segfault for me (on macosx).

bastianlb · 2015-05-20T21:58:28Z

I'm pretty sure the order in which the appends occur in this case matter. If you just write the pastebin as csv in one go you won't see the error. I saw something similar when coerced the 3 files into a dataframe (or using store.overwrite) in memory, before appending. (also, I think the HDFStore needs to be opened in 'a' mode, or it will simply overwrite your data every time).

Writing data: 00
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-03 13:40:00 to 2011-01-04 17:55:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
Appending data from 2011-01-03 13:40:00 to 2011-01-04 17:55:00

Writing data: 01
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-05 22:15:00 to 2011-01-07 02:30:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
Appending data from 2011-01-05 22:15:00 to 2011-01-07 02:30:00

Writing data: 02
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-04 18:00:00 to 2011-01-05 22:15:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
Appending data from 2011-01-04 18:00:00 to 2011-01-05 22:15:00
Segmentation fault

@jreback going to try on mac os x when I get a chance

jreback · 2015-05-20T22:09:33Z

@BastianL I opened the file in write mode only at the beginning. The appends happened in order. In any event, pls post your complete code.

bastianlb · 2015-05-20T22:11:09Z

I seem to have the same problem on os x. A dialogue pops up warning that python3 quit unexpectedly.

Writing data: 00
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-03 13:40:00 to 2011-01-04 17:55:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
memory usage: 5.3 KB
Appending data from 2011-01-03 13:40:00 to 2011-01-04 17:55:00

Writing data: 01
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-05 22:15:00 to 2011-01-07 02:30:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
memory usage: 5.3 KB
Appending data from 2011-01-05 22:15:00 to 2011-01-07 02:30:00

Writing data: 02
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-04 18:00:00 to 2011-01-05 22:15:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
memory usage: 5.3 KB
Appending data from 2011-01-04 18:00:00 to 2011-01-05 22:15:00
[1]    55971 abort      python3.4 test_hdf.py

bastianlb · 2015-05-20T22:11:38Z

Here is the example code:

import os
import pandas as pd

_dir = 'test_files'
_file = './hdf5_test.h5'


def write_to_file(series):
    store = pd.HDFStore(_file, 'a')
    frame = pd.DataFrame(series)
    print("Appending data from {0} to {1}".format(
        frame.index[0], frame.index[-1]))
    store.append('test', frame)
    store.close()


if __name__ == "__main__":
    if os.path.isfile(_file):
        os.remove(_file)
    files = os.list_dir(_dir)
    for f in files:
        series = pd.read_pickle(os.path.join(_dir, f))
        print("Writing data: %s" % f)
        write_to_file(series)

You'll need to unzip these pickled files:
http://s000.tinyupload.com/?file_id=60238823358379433453

jreback · 2015-05-21T00:51:05Z

@BastianL ok, so turns out this only core dumps on PyTables 3.2 (I have libhdft of 1.8.14).

Not really sure why. So work-around is to use PyTables 3.1.1 which seems to work fine.

You can also report to the PyTables Issue Tracker. This must be another edge case. Very odd.

bastianlb · 2015-05-21T02:23:41Z

ok, can confirm that this works on 3.1.1 and not 3.2.0. Here is a better self-contained example, for reference.

import os
import pandas as pd

output_file = './hdf5_test.h5'

def write_to_file(series):
    store = pd.HDFStore(output_file, 'a')
    frame = pd.DataFrame(series)
    print("Appending data from {0} to {1}".format(
        frame.index[0], frame.index[-1]))
    store.append('test', frame)
    store.close()

def run():
    if os.path.isfile(output_file):
        os.remove(output_file)
    vals = range(340)
    s1 = pd.Series(vals, pd.date_range(
        freq='5t',
        start='2011-01-03 13:40:00',
        end='2011-01-04 17:55:00')
    )
    write_to_file(s1)
    s2 = pd.Series(vals, pd.date_range(
        freq='5t',
        start='2011-01-05 22:20:00',
        end='2011-01-07 02:35:00')
    )
    write_to_file(s2)
    s3 = pd.Series(vals, pd.date_range(
        freq='5t',
        start='2011-01-04 18:00:00',
        end='2011-01-05 22:15:00')
    )
    write_to_file(s3)

if __name__ == "__main__":
    run()

andreabedini · 2015-07-15T00:27:15Z

This should be fixed in PyTables/PyTables@5e2a63b. I will make a bug fix release soon-ish.

jreback · 2015-08-04T01:19:15Z

ok looks like 3.2.1 just released
so pls test out

andreabedini · 2015-08-04T01:33:43Z

thanks @jreback I forgot to mention it here

bastianlb · 2015-11-24T06:32:12Z

@jreback I realized I never circled around to this, but PyTables/PyTables@5e2a63b fixes this. Thanks for helping get to the bottom of the issue.

jreback · 2015-11-24T11:40:50Z

gr8!

jreback added the IO HDF5 read_hdf, HDFStore label May 20, 2015

bastianlb mentioned this issue May 21, 2015

indexesextension.keysort() seg faults under 3.2.0 but not 3.1.1 PyTables/PyTables#455

Closed

TomAugspurger mentioned this issue Jul 26, 2015

BUG: DataFrame.to_hdf segfault #10672

Closed

sdvillal mentioned this issue Aug 14, 2015

Missing cameras dataset in reprojection error files strawlab/flydra#27

Closed

jreback closed this as completed Nov 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault when writing data out of order to pd.HDFStore via append #10180

Segfault when writing data out of order to pd.HDFStore via append #10180

bastianlb commented May 20, 2015

jreback commented May 20, 2015

jreback commented May 20, 2015

bastianlb commented May 20, 2015

jreback commented May 20, 2015

bastianlb commented May 20, 2015

bastianlb commented May 20, 2015

jreback commented May 21, 2015

bastianlb commented May 21, 2015

andreabedini commented Jul 15, 2015

jreback commented Aug 4, 2015

andreabedini commented Aug 4, 2015

bastianlb commented Nov 24, 2015

jreback commented Nov 24, 2015

Segfault when writing data out of order to pd.HDFStore via append #10180

Segfault when writing data out of order to pd.HDFStore via append #10180

Comments

bastianlb commented May 20, 2015

jreback commented May 20, 2015

jreback commented May 20, 2015

bastianlb commented May 20, 2015

jreback commented May 20, 2015

bastianlb commented May 20, 2015

bastianlb commented May 20, 2015

jreback commented May 21, 2015

bastianlb commented May 21, 2015

andreabedini commented Jul 15, 2015

jreback commented Aug 4, 2015

andreabedini commented Aug 4, 2015

bastianlb commented Nov 24, 2015

jreback commented Nov 24, 2015