You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am puzzled with the following problem. I have a set of csv files, which I parse iterativly. Before collecting the dataframes in a list, I apply some function (as simple as tmp_df*2) to each of the tmp_df. It all worked perfectly fine at first glance, until I've realized I have inconsistencies with the results from run to run. Strangly, I don't not observe inconsistencies like this one when I don't manipulate the parsed data (simply set with_function=False ). I have managed to re-create the problem, it should run on win and ux. I've tested on win8.1 facing the problem when with_function=True (typically after 1-5 runs), on ux it runs without problems. with_function=False runs without differences for win and ux. I can also reject the hypothesis that it is related to int or float issue as also the simulated int are different. Here is the discussion on SO http://stackoverflow.com/questions/35252460/inconsistent-results-when-concatenating-parsed-csv-files
import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import gettempdir
def simulate_csv_data(tmp_dir,num_files=5):
""" simulate a csv files
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:return:
"""
rows = 20000
columns = 5
np.random.seed(1282)
for file_num in range(num_files):
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
simulated_df = pd.DataFrame(np.random.standard_normal((rows, columns)))
simulated_df['some_int'] = np.random.randint(0,100)
simulated_df.to_csv(str(file_path))
def get_csv_data(tmp_dir,num_files=5, with_function=True):
""" Collect various csv files and return a concatenated dfs
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:param with_function: Bool, apply function to tmp_dataframe
:return:
"""
data_list = list()
for file_num in range(num_files):
# current file path
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
# load csv_file
tmp_df = pd.read_csv(str(file_path), dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# apply function to tmp_dataframe
if with_function:
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
return df
def main():
# INPUT ----------------------------------------------
num_files = 5
with_function = True
max_comparisons = 50
# ----------------------------------------------------
tmp_dir = gettempdir()
# use temporary "non_existing" dir for new file
tmp_csv_folder = Path(tmp_dir).joinpath('csv_files_sdfs2eqqf')
# if exists already don't simulate data/files again
if tmp_csv_folder.exists() is False:
tmp_csv_folder.mkdir()
print('Simulating temp files...')
simulate_csv_data(tmp_csv_folder, num_files)
print('Getting benchmark data frame...')
df1 = get_csv_data(tmp_csv_folder, num_files, with_function)
df_is_same = True
count_runs = 0
# Run until different df is found or max runs exceeded
print('Comparing data frames...')
while df_is_same:
# get another data frame
df2 = get_csv_data(tmp_csv_folder, num_files, with_function)
count_runs += 1
# compare data frames
if df1.equals(df2) is False:
df_is_same = False
print('Found unequal df after {} runs'.format(count_runs))
# print out a standard deviations (arbitrary example)
print('Std Run1: \n {}'.format(df1.std()))
print('Std Run2: \n {}'.format(df2.std()))
if count_runs > max_comparisons:
df_is_same = False
print('No unequal df found after {} runs'.format(count_runs))
print('Delete the following folder if no longer needed: "{}"'.format(
str(tmp_csv_folder)))
if __name__ == '__main__':
main()
The text was updated successfully, but these errors were encountered:
next time pls pd.show_versions() as well as a much shorter description - you are pasting code which cannot be reproduced without the source files and so is pretty much useless
I am puzzled with the following problem. I have a set of csv files, which I parse iterativly. Before collecting the dataframes in a list, I apply some function (as simple as
tmp_df*2
) to each of thetmp_df
. It all worked perfectly fine at first glance, until I've realized I have inconsistencies with the results from run to run. Strangly, I don't not observe inconsistencies like this one when I don't manipulate the parsed data (simply setwith_function=False
). I have managed to re-create the problem, it should run on win and ux. I've tested on win8.1 facing the problem when with_function=True (typically after 1-5 runs), on ux it runs without problems. with_function=False runs without differences for win and ux. I can also reject the hypothesis that it is related to int or float issue as also the simulated int are different. Here is the discussion on SO http://stackoverflow.com/questions/35252460/inconsistent-results-when-concatenating-parsed-csv-filesThe text was updated successfully, but these errors were encountered: