Skip to content

using Pandas DataFrame can corrupt data #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rogpeppe opened this issue Jan 11, 2021 · 0 comments · Fixed by #183
Closed

using Pandas DataFrame can corrupt data #182

rogpeppe opened this issue Jan 11, 2021 · 0 comments · Fixed by #183
Milestone

Comments

@rogpeppe
Copy link
Contributor

rogpeppe commented Jan 11, 2021

The code in dataframe_serializer.py could use some improvement.

  • the code is hard to understand. It operates by building up a string expression and then using eval, which is hard to understand, error-prone, and has potential for security flaws.
  • when there are null values in the data, it passes the encoded points through a regular-expression-based translation phase which can corrupt data.
  • the line-protocol encoding code amounts to an independent encoder to the encoder in influxdb_client/client/write/point.py, which means there's room for independent encoding bugs.
  • the code has undocumented (and probably unwanted side-effects) on the data frame that's being encoded. Default tags will be added directly to the data_frame object rather than being added only to the encoded line-protocol points.
  • Any DataFrame index values will be converted to timestamps regardless of whether that's appropriate or not. It would be better if that was something to explicitly opt into (converting the default RangeIndex index into time values is almost never going to be correct - a better default would probably to omit the time stamps and let the server add them).

Here is some example code that demonstrates corruption of data:

import pandas as pd
from influxdb_client.client.write_api import WriteOptions, WriteApi, PointSettings
from influxdb_client.client.write.point import Point
from influxdb_client.client.write.dataframe_serializer import data_frame_to_list_of_points

frame = pd.DataFrame(
	data=[
	    ["coyote_creek", 1.0, "a"],
	    ["coyote_creek", None, "b"],
	    ["coyote_creek", 3.0, "c"],
	    ["coyote_creek", 4.0, "d"],
	],
	index=[1, 2, 3, 4],
	columns=["location", "level water_level", "str"],
)
ps=data_frame_to_list_of_points(frame, PointSettings(), data_frame_measurement_name='h2o_feet', data_frame_tag_columns=['location'])
for p in ps:
   print(p)

This prints:

h2o_feet,location=coyote_creek level\ water_level=1.0,str="comma separated string" 1
h2o_feet,location=coyote_creek level\ water_level=nan,str="b" 2
h2o_feet,location=coyote_creek level\ water_level=3.0,str="cff" 3
h2o_feet,location=coyote_creek level\ water_level=4.0,str="d" 4

Note that two string values are corrupted and the null value still remains because the regexp code has not taken account of the escaped key.

I would expect it to print this instead:

h2o_feet,location=coyote_creek level\ water_level=1.0,str="comma, separated, string" 1
h2o_feet,location=coyote_creek str="b" 2
h2o_feet,location=coyote_creek level\ water_level=3.0,str="clocation=nanff" 3
h2o_feet,location=coyote_creek level\ water_level=4.0,str="d" 4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants