Skip to content

Segfault when writing data out of order to pd.HDFStore via append #10180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bastianlb opened this issue May 20, 2015 · 13 comments
Closed

Segfault when writing data out of order to pd.HDFStore via append #10180

bastianlb opened this issue May 20, 2015 · 13 comments
Labels
IO HDF5 read_hdf, HDFStore

Comments

@bastianlb
Copy link

I am trying to append chunks of data to an (initially empty) HDF5 frame with pd.HDFStore. The chunks come in out of order, and sometimes certain orders produce segfaults. This script seems to consistently segfault after loading file 31 (update, see comments for better example). You will notice that by looking at the time stamps outputted by the script it appears to be when hdf5 tries to fill some gap data. I can produce more files that trigger segfaults if necessary.

I've managed to narrow it down to the following line in pytables.py

> /home/user/env/lib/python3.4/site-packages/pandas/io/pytables.py(3738)write_data_chunk()
   3737                 self.table.append(rows)
-> 3738                 self.table.flush()
   3739         except Exception as detail:
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: 1.3.3
Cython: 0.22
numpy: 1.9.2
scipy: None
statsmodels: None
IPython: 2.1.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.2.0
numexpr: 2.4
matplotlib: None
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9
apiclient: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)

HDF5 Versions:
~/user$ dpkg -l | grep "hdf5"
ii  hdf5-helpers                   1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - Helper tools
ii  hdf5-tools                     1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - Runtime tools
rc  libhdf5-7:amd64                1.8.12+docs-1               amd64        Hierarchical Data Format 5 (HDF5) - runtime files - serial version
ii  libhdf5-8:amd64                1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - runtime files - serial version
rc  libhdf5-cpp-7:amd64            1.8.12+docs-1               amd64        Hierarchical Data Format 5 (HDF5) - C++ libraries
ii  libhdf5-cpp-8:amd64            1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - C++ libraries
ii  libhdf5-dev                    1.8.13+docs-15              amd64        Hierarchical Data Format 5 (HDF5) - development files - serial version

A script to reproduce:

import os
import pandas as pd

_dir = 'test_files'
_file = './hdf5_test.h5'


def write_to_file(series):
    store = pd.HDFStore(_file, 'a')
    frame = pd.DataFrame(series)
    print("Appending data from {0} to {1}".format(
        frame.index[0], frame.index[-1]))
    store.append('test', frame)
    store.close()


if __name__ == "__main__":
    if os.path.isfile(_file):
        os.remove(_file)
    files = os.listdir(_dir)
    for f in files:
        series = pd.read_pickle(os.path.join(_dir, f))
        print("Writing data: %s" % f)
        write_to_file(series)

zipped pickled files for test: http://s000.tinyupload.com/?file_id=60238823358379433453

Here is a pastebin of the data where segfault is occuring from the example in csv format:
http://pastebin.com/FRsygCUG
note, you may need the actual files to reproduce this, but as you can see from the pastebin that the data isn't malformed

It is trying to fill the the following gap in the original data:
2011-01-04 17:55:00 to2011-01-05 22:15:00
with an append which results in a segfault

Script output:

Writing data: 00
Appending data from 2011-01-03 13:40:00 to 2011-01-04 17:55:00
Writing data: 01
Appending data from 2011-01-05 22:15:00 to 2011-01-07 02:30:00
Writing data: 02
Appending data from 2011-01-04 18:00:00 to 2011-01-05 22:15:00
Segmentation fault
@jreback
Copy link
Contributor

jreback commented May 20, 2015

So you need to show how you are reading in the data, and what df.info() looks like.

you data has some bad rows in it, and thus is being read in as object. I converted it and it works just fine.

When storing in HDF5 you need to be espectially cognizant of dtypes. It will be non-performant if you have the wrong dtypes (and you may not even be able to store it).

In [34]: df = pd.read_csv('/Users/jreback/Downloads/FRsygCUG.txt',header=0,skiprows=2)

In [35]: df.dtypes
Out[35]: 
Unnamed: 0    object
normal        object
dtype: object

In [36]: df.columns=['date','value']

In [37]: df['date'] = pd.to_datetime(df['date'],coerce=True)

In [38]: df['value'] = df['value'].convert_objects(convert_numeric=True)

In [39]: df.dtypes
Out[39]: 
date     datetime64[ns]
value           float64
dtype: object

In [40]: store = pd.HDFStore('test.h5',mode='w')

In [41]: for dfi in np.array_split(df,20):
    store.append('df',dfi)
   ....:     

In [42]: store.close()

In [43]: pd.read_hdf('test.h5','df').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1025 entries, 0 to 1024
Data columns (total 2 columns):
date     1020 non-null datetime64[ns]
value    1020 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 24.0 KB

@jreback
Copy link
Contributor

jreback commented May 20, 2015

@BastianL this also does not segfault for me (on macosx).

@jreback jreback added the IO HDF5 read_hdf, HDFStore label May 20, 2015
@bastianlb
Copy link
Author

I'm pretty sure the order in which the appends occur in this case matter. If you just write the pastebin as csv in one go you won't see the error. I saw something similar when coerced the 3 files into a dataframe (or using store.overwrite) in memory, before appending. (also, I think the HDFStore needs to be opened in 'a' mode, or it will simply overwrite your data every time).

Writing data: 00
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-03 13:40:00 to 2011-01-04 17:55:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
Appending data from 2011-01-03 13:40:00 to 2011-01-04 17:55:00

Writing data: 01
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-05 22:15:00 to 2011-01-07 02:30:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
Appending data from 2011-01-05 22:15:00 to 2011-01-07 02:30:00

Writing data: 02
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-04 18:00:00 to 2011-01-05 22:15:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
Appending data from 2011-01-04 18:00:00 to 2011-01-05 22:15:00
Segmentation fault

@jreback going to try on mac os x when I get a chance

@jreback
Copy link
Contributor

jreback commented May 20, 2015

@BastianL I opened the file in write mode only at the beginning. The appends happened in order. In any event, pls post your complete code.

@bastianlb
Copy link
Author

I seem to have the same problem on os x. A dialogue pops up warning that python3 quit unexpectedly.

Writing data: 00
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-03 13:40:00 to 2011-01-04 17:55:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
memory usage: 5.3 KB
Appending data from 2011-01-03 13:40:00 to 2011-01-04 17:55:00

Writing data: 01
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-05 22:15:00 to 2011-01-07 02:30:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
memory usage: 5.3 KB
Appending data from 2011-01-05 22:15:00 to 2011-01-07 02:30:00

Writing data: 02
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 340 entries, 2011-01-04 18:00:00 to 2011-01-05 22:15:00
Freq: 5T
Data columns (total 1 columns):
normal    340 non-null float64
dtypes: float64(1)
memory usage: 5.3 KB
Appending data from 2011-01-04 18:00:00 to 2011-01-05 22:15:00
[1]    55971 abort      python3.4 test_hdf.py

@bastianlb
Copy link
Author

Here is the example code:

import os
import pandas as pd

_dir = 'test_files'
_file = './hdf5_test.h5'


def write_to_file(series):
    store = pd.HDFStore(_file, 'a')
    frame = pd.DataFrame(series)
    print("Appending data from {0} to {1}".format(
        frame.index[0], frame.index[-1]))
    store.append('test', frame)
    store.close()


if __name__ == "__main__":
    if os.path.isfile(_file):
        os.remove(_file)
    files = os.list_dir(_dir)
    for f in files:
        series = pd.read_pickle(os.path.join(_dir, f))
        print("Writing data: %s" % f)
        write_to_file(series)

You'll need to unzip these pickled files:
http://s000.tinyupload.com/?file_id=60238823358379433453

@jreback
Copy link
Contributor

jreback commented May 21, 2015

@BastianL ok, so turns out this only core dumps on PyTables 3.2 (I have libhdft of 1.8.14).

Not really sure why. So work-around is to use PyTables 3.1.1 which seems to work fine.

You can also report to the PyTables Issue Tracker. This must be another edge case. Very odd.

@bastianlb
Copy link
Author

ok, can confirm that this works on 3.1.1 and not 3.2.0. Here is a better self-contained example, for reference.

import os
import pandas as pd

output_file = './hdf5_test.h5'

def write_to_file(series):
    store = pd.HDFStore(output_file, 'a')
    frame = pd.DataFrame(series)
    print("Appending data from {0} to {1}".format(
        frame.index[0], frame.index[-1]))
    store.append('test', frame)
    store.close()

def run():
    if os.path.isfile(output_file):
        os.remove(output_file)
    vals = range(340)
    s1 = pd.Series(vals, pd.date_range(
        freq='5t',
        start='2011-01-03 13:40:00',
        end='2011-01-04 17:55:00')
    )
    write_to_file(s1)
    s2 = pd.Series(vals, pd.date_range(
        freq='5t',
        start='2011-01-05 22:20:00',
        end='2011-01-07 02:35:00')
    )
    write_to_file(s2)
    s3 = pd.Series(vals, pd.date_range(
        freq='5t',
        start='2011-01-04 18:00:00',
        end='2011-01-05 22:15:00')
    )
    write_to_file(s3)

if __name__ == "__main__":
    run()

@andreabedini
Copy link
Contributor

This should be fixed in PyTables/PyTables@5e2a63b. I will make a bug fix release soon-ish.

@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

ok looks like 3.2.1 just released
so pls test out

@andreabedini
Copy link
Contributor

thanks @jreback I forgot to mention it here

@bastianlb
Copy link
Author

@jreback I realized I never circled around to this, but PyTables/PyTables@5e2a63b fixes this. Thanks for helping get to the bottom of the issue.

@jreback
Copy link
Contributor

jreback commented Nov 24, 2015

gr8!

@jreback jreback closed this as completed Nov 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

3 participants