Inconsistent results when concatenating parsed csv files with function application on windows #12250

MMCMA · 2016-02-07T15:08:49Z

I am puzzled with the following problem. I have a set of csv files, which I parse iterativly. Before collecting the dataframes in a list, I apply some function (as simple as tmp_df*2) to each of the tmp_df. It all worked perfectly fine at first glance, until I've realized I have inconsistencies with the results from run to run. Strangly, I don't not observe inconsistencies like this one when I don't manipulate the parsed data (simply set with_function=False ). I have managed to re-create the problem, it should run on win and ux. I've tested on win8.1 facing the problem when with_function=True (typically after 1-5 runs), on ux it runs without problems. with_function=False runs without differences for win and ux. I can also reject the hypothesis that it is related to int or float issue as also the simulated int are different. Here is the discussion on SO http://stackoverflow.com/questions/35252460/inconsistent-results-when-concatenating-parsed-csv-files

import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import gettempdir


def simulate_csv_data(tmp_dir,num_files=5):
    """ simulate a csv files
    :param tmp_dir: Path, csv files are saved to
    :param num_files: int, how many csv files to simulate
    :return:
    """

    rows = 20000
    columns = 5
    np.random.seed(1282)

    for file_num in range(num_files):

        file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
        simulated_df = pd.DataFrame(np.random.standard_normal((rows, columns)))
        simulated_df['some_int'] = np.random.randint(0,100)
        simulated_df.to_csv(str(file_path))


def get_csv_data(tmp_dir,num_files=5, with_function=True):
    """ Collect various csv files and return a concatenated dfs
    :param tmp_dir: Path, csv files are saved to
    :param num_files: int, how many csv files to simulate
    :param with_function: Bool, apply function to tmp_dataframe
    :return:
    """

    data_list = list()

    for file_num in range(num_files):
        # current file path
        file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))

        #  load csv_file
        tmp_df = pd.read_csv(str(file_path), dtype=np.float64)

        # replace infs by na
        tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)

        # apply function to tmp_dataframe
        if with_function:
            tmp_df = tmp_df*2

        data_list.append(tmp_df)

    df = pd.concat(data_list, ignore_index=True)
    df.reset_index(inplace=True)

    return df

def main():

    # INPUT ----------------------------------------------
    num_files = 5
    with_function = True
    max_comparisons = 50
    # ----------------------------------------------------

    tmp_dir = gettempdir()
    # use temporary "non_existing" dir for new file
    tmp_csv_folder = Path(tmp_dir).joinpath('csv_files_sdfs2eqqf')

    # if exists already don't simulate data/files again
    if tmp_csv_folder.exists() is False:
        tmp_csv_folder.mkdir()
        print('Simulating temp files...')
        simulate_csv_data(tmp_csv_folder, num_files)

    print('Getting benchmark data frame...')
    df1 = get_csv_data(tmp_csv_folder, num_files, with_function)
    df_is_same = True
    count_runs = 0

    # Run until different df is found or max runs exceeded
    print('Comparing data frames...')
    while df_is_same:
        # get another data frame
        df2 = get_csv_data(tmp_csv_folder, num_files, with_function)
        count_runs += 1
        # compare data frames
        if df1.equals(df2) is False:
            df_is_same = False
            print('Found unequal df after {} runs'.format(count_runs))
            # print out a standard deviations (arbitrary example)
            print('Std Run1: \n {}'.format(df1.std()))
            print('Std Run2: \n {}'.format(df2.std()))

        if count_runs > max_comparisons:
            df_is_same = False
            print('No unequal df found after {} runs'.format(count_runs))

    print('Delete the following folder if no longer needed: "{}"'.format(
            str(tmp_csv_folder)))


if __name__ == '__main__':
    main()

The text was updated successfully, but these errors were encountered:

jreback · 2016-02-07T15:14:02Z

you need numexpr 2.4.6

see here: #12023

next time pls pd.show_versions() as well as a much shorter description - you are pasting code which cannot be reproduced without the source files and so is pretty much useless

MMCMA · 2016-02-07T15:26:10Z

Thanks, this solves the problem. Sorry, I'll try to improve.

MMCMA closed this as completed Feb 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results when concatenating parsed csv files with function application on windows #12250

Inconsistent results when concatenating parsed csv files with function application on windows #12250

MMCMA commented Feb 7, 2016

jreback commented Feb 7, 2016

MMCMA commented Feb 7, 2016

Inconsistent results when concatenating parsed csv files with function application on windows #12250

Inconsistent results when concatenating parsed csv files with function application on windows #12250

Comments

MMCMA commented Feb 7, 2016

jreback commented Feb 7, 2016

MMCMA commented Feb 7, 2016