using Pandas DataFrame can corrupt data #182

rogpeppe · 2021-01-11T10:00:42Z

The code in dataframe_serializer.py could use some improvement.

the code is hard to understand. It operates by building up a string expression and then using eval, which is hard to understand, error-prone, and has potential for security flaws.
when there are null values in the data, it passes the encoded points through a regular-expression-based translation phase which can corrupt data.
the line-protocol encoding code amounts to an independent encoder to the encoder in influxdb_client/client/write/point.py, which means there's room for independent encoding bugs.
the code has undocumented (and probably unwanted side-effects) on the data frame that's being encoded. Default tags will be added directly to the data_frame object rather than being added only to the encoded line-protocol points.
Any DataFrame index values will be converted to timestamps regardless of whether that's appropriate or not. It would be better if that was something to explicitly opt into (converting the default RangeIndex index into time values is almost never going to be correct - a better default would probably to omit the time stamps and let the server add them).

Here is some example code that demonstrates corruption of data:

import pandas as pd
from influxdb_client.client.write_api import WriteOptions, WriteApi, PointSettings
from influxdb_client.client.write.point import Point
from influxdb_client.client.write.dataframe_serializer import data_frame_to_list_of_points

frame = pd.DataFrame(
	data=[
	    ["coyote_creek", 1.0, "a"],
	    ["coyote_creek", None, "b"],
	    ["coyote_creek", 3.0, "c"],
	    ["coyote_creek", 4.0, "d"],
	],
	index=[1, 2, 3, 4],
	columns=["location", "level water_level", "str"],
)
ps=data_frame_to_list_of_points(frame, PointSettings(), data_frame_measurement_name='h2o_feet', data_frame_tag_columns=['location'])
for p in ps:
   print(p)

This prints:

h2o_feet,location=coyote_creek level\ water_level=1.0,str="comma separated string" 1
h2o_feet,location=coyote_creek level\ water_level=nan,str="b" 2
h2o_feet,location=coyote_creek level\ water_level=3.0,str="cff" 3
h2o_feet,location=coyote_creek level\ water_level=4.0,str="d" 4

Note that two string values are corrupted and the null value still remains because the regexp code has not taken account of the escaped key.

I would expect it to print this instead:

h2o_feet,location=coyote_creek level\ water_level=1.0,str="comma, separated, string" 1
h2o_feet,location=coyote_creek str="b" 2
h2o_feet,location=coyote_creek level\ water_level=3.0,str="clocation=nanff" 3
h2o_feet,location=coyote_creek level\ water_level=4.0,str="d" 4

The text was updated successfully, but these errors were encountered:

rogpeppe mentioned this issue Jan 13, 2021

chore: influxdb_client/client/write: fix data_frame_to_list_of_points #183

Merged

5 tasks

bednar closed this as completed in #183 Jan 18, 2021

bednar added this to the 1.14.0 milestone Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using Pandas DataFrame can corrupt data #182

using Pandas DataFrame can corrupt data #182

rogpeppe commented Jan 11, 2021 •

edited

Loading

using Pandas DataFrame can corrupt data #182

using Pandas DataFrame can corrupt data #182

Comments

rogpeppe commented Jan 11, 2021 • edited Loading

rogpeppe commented Jan 11, 2021 •

edited

Loading