Skip to content

HDF5.remove with where removing all rows #17567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
omytea opened this issue Sep 18, 2017 · 4 comments
Open

HDF5.remove with where removing all rows #17567

omytea opened this issue Sep 18, 2017 · 4 comments
Labels
Bug IO HDF5 read_hdf, HDFStore

Comments

@omytea
Copy link

omytea commented Sep 18, 2017

when I use 'in' condition in HDFStore.remove, it removes all data in the table instead of selected rows

In[0]:
store = pd.HDFStore("d:/org.h5")
a = store.get(table["table_name"])
a.indexOut[0]:
Index(['{7D23701A-991B-11E7-977E-448A5B7647CF}',
       '{7D23701E-991B-11E7-977E-448A5B7647CF}',
       '{7D237020-991B-11E7-977E-448A5B7647CF}',
       ...
       '{9E1257B7-99E4-11E7-B36E-448A5B7647CF}',
       '{9E1257B9-99E4-11E7-B36E-448A5B7647CF}',
       '{9E1257CB-99E4-11E7-B36E-448A5B7647CF}'],
      dtype='object', name='object_id', length=6710)

In [1]:
a[a['trade_dt'] == '2017-09-14'].index

Out[1]:
Index(['{7D23701A-991B-11E7-977E-448A5B7647CF}',
       '{7D23701E-991B-11E7-977E-448A5B7647CF}',
       '{7D237020-991B-11E7-977E-448A5B7647CF}',
       ...
       '{7D236E8C-991B-11E7-977E-448A5B7647CF}',
       '{7D236E9A-991B-11E7-977E-448A5B7647CF}',
       '{7D236EA0-991B-11E7-977E-448A5B7647CF}'],
      dtype='object', name='object_id', length=3353)

In [2]:
store.select(table["table_name"], where="index in %s" % list(a[a['trade_dt'] == '2017-09-14'].index))

Out[2]:
(3353, 22)

In [3]:
store.remove(table["table_name"], where="index in %s" % list(a[a['trade_dt'] == '2017-09-14'].index))

Out[3]:
6710

Problem description

The “where" condition in HDFStore.select returns selected 3353 rows. I supposed using same condition with "where" in HDFStore.remove should do the same which is remove this 3353 rows. But it remove 6710 rows which is all lines in this table.

I tested with only one single row in the "index in [list]" where condition. It removes one row which is I expected.

In [4]:
[a[a['trade_dt'] == '2017-09-14'].index[0]]

Out[4]:
['{7D23701A-991B-11E7-977E-448A5B7647CF}']

In [5]:
store.select(table["table_name"], where="index in %s" % [a[a['trade_dt'] == '2017-09-14'].index[0]]).shape

Out[5]:
(1, 22)

In [6]:
store.remove(table["table_name"], where="index in %s" % [a[a['trade_dt'] == '2017-09-14'].index[0]])

Out[6]:
1

I tested with equivalent condition ops such as "=" returns the same

Output of pd.show_versions()

'0.20.3'

@gfyoung gfyoung added the IO HDF5 read_hdf, HDFStore label Sep 18, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 18, 2017
@jreback
Copy link
Contributor

jreback commented Sep 18, 2017

So you would have to show a reproducible example; the following works.

In [1]: df = pd.DataFrame({'A': np.arange(10), 'B': pd.date_range('20130101',periods=10)}).set_index('B')
   ...: store = pd.HDFStore('test.h5')
   ...: store.append('df', df)
   ...: assert len(store.select('df')) == 10
   ...: store.remove('df', where='index in ["20130102", "20130103"]')
   ...: assert len(store.select('df')) == 8
   ...: 

In [2]: store.select('df')
Out[2]: 
            A
B            
2013-01-01  0
2013-01-04  3
2013-01-05  4
2013-01-06  5
2013-01-07  6
2013-01-08  7
2013-01-09  8
2013-01-10  9

@jreback jreback removed this from the Next Major Release milestone Sep 18, 2017
@omytea
Copy link
Author

omytea commented Sep 18, 2017

@jreback

I did a slight change to your test code then it reproduces now.

In [1]:
df = pd.DataFrame({'A': np.arange(1000), 'B': pd.date_range('20130101',periods=1000)}).set_index('B')
select = list(df.reset_index().B.sample(100))
store = pd.HDFStore('test.h5')
store.put('df', df, format='table')
assert len(store.select('df')) == 1000
print('select len %d' % len(store.select('df', where='index in %s' % select)))
store.remove('df', where='index in %s' % select)
print('after remove %d' % len(store.select('df')))
assert len(store.select('df')) == 900

select len 100
after remove 0
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-136-cf9062a4d575> in <module>()
      7 store.remove('df', where='index in %s' % select)
      8 print('after remove %d' % len(store.select('df')))
----> 9 assert len(store.select('df')) == 900


AssertionError: 

In [2]:
store.select('df')

Out[2]:
A
B	

@jreback
Copy link
Contributor

jreback commented Sep 18, 2017

ok, would appreciate if you can debug. there are not very many test cases for this, so certainly can add this one and fix.

@jreback jreback added this to the Next Major Release milestone Sep 18, 2017
@jreback jreback changed the title "where" parameter in HDFStore.remove behaving incorrect HDF5.remove with where removing all rows Sep 18, 2017
@TomAugspurger TomAugspurger modified the milestones: Next Major Release, 0.21.0 Oct 12, 2017
@GuessWhoSamFoo
Copy link
Contributor

Tangentially related to #15937. The remove method uses numexpr so selecting as shown above would work as long as <= 31 elements. A warning pointing to http://pandas.pydata.org/pandas-docs/stable/io.html#selecting-using-a-where-mask seems reasonable.

Swapping the commented lines here will return an empty dataframe.

In [2]:
df = pd.DataFrame({'A': np.arange(1000),
                   'B': pd.date_range('20130101', periods=1000)}
                  ).set_index('B')

select = list(df.reset_index().B.sample(32))
store = pd.HDFStore('temp.h5')
store.put('df', df, format='table')
where = df.index.isin(select)

store.select('df', where=where)

#store.remove('df', where='index in %s' % select)
store.remove('df', where=where)

Out[2]:
32

In [3]:
len(store.select('df'))

Out[3]:
968

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

7 participants