Skip to content

Experimental Data Authenticity and Trustworthiness

Miguel Tomas Silva edited this page May 11, 2024 · 47 revisions

index >> Experimental Data Authenticity and Trustworthiness


The progress in technology has allowed the creation of large and complex datasets. However, many researchers lack the necessary expertise and resources to properly analyze, interpret, and store the data properly. Furthermore, more needs to be done on defining uniform procedures for newer technological solutions and methodologies, which can result in inconsistencies that can affect the ability to accurately reproduce an experiment and its data. One possible way to improve the trustworthiness of experimental data collected from local sensors is to use redundant datasets. Each smart DAQ connected in a swarm-like manner will hold in real-time a copy of all other datasets on all devices it connects. Updated and verified on each new data measurement made and replicated across the swarm. This way, in case of external interference, it is possible to identify if is one of continuous nature or instead, it happened intermittently during the experimental campaign, by identifying mismatched data values between the current dataset and older datasets previously received from all other devices. Each smart DAQ has the task of comparing previous existing datasets from another device with newer ones received from the same device, while at the same time verifying changes on the experimental data blockchain hashes.

Another way to improve the trustworthiness of the experimental data collected is by uploading to the data repository all copies of a dataset including the one where data was collected from all smart DAQ devices connected. In this case, the data repository needs to verify incoming datasets by comparing previous and existing ones in the data repository with the newer one. The data repository must be able to do verification of changes on the experimental data blockchain hashes among all dataset files previously downloaded and notify back to a smart DAQ when a mismatch is found. Finally, is also possible to implement MD5 checksum, or other authenticity file verification algorithm, to each dataset file saved over time during an ongoing experiment. This allows the linkage of the previous dataset file to the next, similar to a blockchain, meaning any attempts to change a dataset are easily identified when performing experimental data validation tasks.

The proposed SDAD has onboard a cryptographic integrated circuit, the ATSHA204. This IC utilizes an SHA-256 hash Algorithm and has a Unique 72-bit Serial Number that is utilized for the identification of a SDAD in a swarm network during an experimental campaign, and later when performing data validation tasks. This is necessary so any sensor data measurements made with this open hardware electronics can be traced back to its origins in the laboratory where experimental sensor data was created and combined with the unique 72-bit serial number onboard. This kind of experimental data validation is unique to each data measurement made and uploaded to a data repository in near real-time.

3.1 - Universal Numerical Fingerprint

The Universal Numeric Fingerprint (UNF) is an algorithm to ensure a data matrix or digital object produced in one format is correctly interpreted when moved to a different environment and format. This algorithm is described in detail in the work by Micah Altman “A Fingerprint Method for Verification of Scientific Data”[13] and summarized in Algorithm 1 below.


Algorithm 1 - Pseudocode (simplified) for calculation of a single UNF signature for a data record

For every $V_i$ in V
  Round a vector element, $V_i$, to $N$ significant digits (IEEE 754)
$V_(i,N) = round(V_i,N)$
  Convert $V_(i,N)$ into a character string in exponential notation, trailing zeros
$VCS_i = [+-][d{1}] [.] [d\{N-1\}] [e] [+-] [ d\{N\} ] [\textbackslash n] [\textbackslash 000] $
  Concatenate all the individual character strings
$V_e = V_e + VCS_i$

Compute the SHA256 hash of $V_e$
$V_e,hash = SHA256(V_e)$

Truncate the resulting hash to 128
$V_e,hash128 = Trunc(V_e,128)$

Encode the resulting string in base64
$V_e b64 = Base64(V_(e hash 128))$

Add the signature header
$UNFID = UNF:6:N[N]: + V_(e b64)$


To determine the UNF of a vector the first step consists in calculating a rounded number of each sensor data value with a precision of N, converted to a character string, VCSi, and terminated with a POSIX end-of-line character, \n, plus a null byte \000. Zero values use the notation +0.e+ for a positive zero and -0.e+ for a negative zero. When a data value is NULL is added as a missing value represented with 3 null bytes \000\000\000. Boolean values are integer representations of 0 and 1. If an element is an IEEE 754, non-finite, special floating-point value, the corresponding value representation is made using lowercase +inf, -inf, or +nan. For cases where value Vi is a character string, it needs to be in Unicode UTF-8 bit encoding and truncated to the number of bytes used to hash. The default value is 128. In regards to date and time values, the date follows the zero-padded form YYYY-MM-DD, YYYY, or YYYY-MM. The ISO 8601 is used for the representation of Time using the format hh:mm:ss.fffff with 2 digits, zero-padded numbers except for fffff must contain no non-significant zeros, to represent fractions of a second. To format a combined date and time value is done by concatenating a date representation, a single letter “T”, and a time representation to form a single character string value (partial date representations are not allowed). Finally, to represent a date-time interval is done by concatenating two date-time values separated by a slash “/”.


Example 1 – example of a values normalization

$1           →   +1.e+$
$Pi$ at 5 digits      →   $+3.1415e+$
$-300          →   -3.e+2$
$0.00073$         →   $+7.3e-4$
Positive infinity $     →   +inf$
$1.23456789 $ for N=7   →   $+1.234568e+$

The next step consists of concatenating, Ve, with all the individual character strings, VCSi, to compute the SHA256 hash of the combined string truncated to the number of bytes used to hash it. The final step consists of encoding the hash string using a base64 algorithm, for improved readability purposes, and prepending it with the signature header “UNF:Y:” where Y indicates the current version of the algorithm in use (see www.dataverse.org). Example 2 shows this procedure in a more visual manner for a numeric vector {1.23456789, , 0} and a default normalization of N=7.


Example 2 – example of a vector normalization

Normalized elements
“+1.234568e+”, “\000\000\000”, “+0.e+”

Combined string
“+1.234568e+\n\000\000\000\000+0.e+\n\000”

SHA256 hash. truncated to the default 128-bit
Do5dfAoOOFt4FSj0JcByEw==

Printable UN
UNF:6:Do5dfAoOOFt4FSj0JcByEw==


3.2 - Unique Fingerprint Identification

Micah Altman [13] UNF is a step forward in the verification of scientific data, however, an additional requirement is needed for data measurements themselves to be traced back to their origin. In particular, the physical location where measurements are collected and transformed into an experimental data record.

The proposed Unique Fingerprint Identification (UFPID) utilizes Micah Altman's [13] algorithm with specific data values that are unique to the specimen being tested. Each SDAD has two static serial numbers, that can only change with the replacement of hardware components, this means regardless of the firmware code installed and running on the proposed open hardware electronics, specific values appended to a UNF those signature values do not change. The OEM firmware code proposed in this work includes, in the generation of a UFPID, sensor data values about the conditions of measurements made during each experimental data record creation. To the sensor data measurements collected at any given point in time, during an experiment, are also added two environmental data values, characteristic of the environment where the experiment is running: the onboard temperature and onboard humidity. A value for motion and yaw with the purpose of motion detection of the specimen itself. Finally, the proposed firmware has the capability to include geo-location data obtained from IP Geo tracing from the internet. The Latitude, Longitude, and the time of request. This is a configurable and optional data value part of the UFPID when disabled is represented with 3 null bytes \000\000\000.

The proposed changes to Micah Altman's [13] UNF algorithm are made to the data vector to be normalized to add and include the following data record (between { } is the number of data values):

  • local date & time {1};
  • experimental data array {1...n};
  • onboard sensor measurements:
    • temperature {1};
    • humidity {1};
    • motion vector {3};
    • yaw vector {3};
    • I.A.Q. sensor data (T.V.O.C., $eCO^2$, A.Q.I.) {3};
    • Luminosity {1};
    • Atomosphetic pressure {1};
  • Geo-location data and external IP address resolution {3}:
    • Hardware with the ESP32 S3 MCU the geo-location coordinate resolution is obtained from an external public website;
    • Hardware with the Nordic nRF9160 MCU the geo-location coordinate resolution is obtained directly from the hardware electronics;
  • 72-bit hardware ID {1};
  • The Unique ID of a scientific researcher {1...n};

This additional data string added to the data vector holding a single experimental data record is then used to generate a UNF that enables linkage to a specific SDAD by adding specific environmental time-dependent sensor data, varying over time during an experiment, reducing even further the possibility of data forgery. Furthermore, it includes information about the motion/displacement of the SDAD itself, which can be utilized for verification of unwanted handling when the experiment is left unattended in a laboratory. To finalize, it is added to the data vector the unique Identification of the scientific researchers responsible for the experimental data collection during an experiment. Well-accepted and well-known IDs can be, for instance, the ORCID, the Web of Science ID, Scopus ID, or any other with the same purpose and intent.

3.2.1 Weather Data Validation

The Citizen Weather Observer Program (CWOP) is part of the USA's national oceanic and Atmospheric Administration (NOAA) composed of a network of privately owned electronic weather stations concentrated in the United States and also in over 150 countries. The CWOP was originally set up by amateur radio operators experimenting with packet radio but now contains a majority of Internet-only connected stations, with more than 10,000 stations worldwide reporting regularly to the CWOP network (July 2015). This network allows volunteers with computerized weather stations to send automated surface weather observations to the National Weather Service in the USA. This data is then used by the Rapid Refresh forecast model to produce short-term forecasts (3 to 12 hours into the future) of conditions across the United States' lower 48 states or in any other country. CWOP Observations are also re-distributed to the public. Before being used, there's an extensive set of quality control verifications, assigns a data quality rating, and makes suggestions before is considered for modeling and forecasting the weather. data is quality controlled through the Meteorological Assimilation Data Ingest System (MADIS).

MADIS observations are quality-controlled during data processing and these results are made available to users. Observations in the MADIS database are stored with a series of flags indicating the quality of the observation from a variety of perspectives (e.g. temporal consistency and spatial consistency), or more precisely, a series of flags indicating the results of various quality control (QC) checks. MADIS users and their applications can then inspect the flags and decide whether or not to use the observation.

The QC procedures are, for the most part, provided by the NWS Techniques Specification Package (TSP) 88-21-R2 (1994). Two categories of QC checks, static and dynamic, are described in the TSP for a variety of observation types, including most of the observations available in the different MADIS datasets. The static QC checks are single-station, single-time checks which, as such, are unaware of the previous and current meteorological or hydrologic situation described by other observations and grids. Checks falling into this category include validity, internal consistency, and vertical consistency. Although useful for locating extreme outliers in the observational database, the static checks can have difficulty with statistically reasonable, but invalid data. To address these difficulties, the TSP also describes dynamic checks that refine the QC information by taking advantage of other available hydrometeorological information. Examples of dynamic QC checks include position consistency, temporal consistency, and spatial consistency. The TSP also describes single-character "data descriptors" for each observation, which is intended to provide an overall opinion of the quality of the observation by combining the information from the various QC checks. Algorithms used to compute the data descriptor are a function of the types of QC checks applied to the observation, and the sophistication of those checks. Level 1 QC checks are considered the least sophisticated, and level 3 the most sophisticated checks.

3.3 - Blockchain

The proposed SDAD is of type "Internet of Everything" (IoE) able to connect with others using swarm intelligence. Experimental data collected is stored in a block format, meaning, a single block stores an individual data record of experimental data written to it, the UFPID of the previous block, and its own UFPID.

The main principle of operation behind blockchain technology is to make difficult any modifications to a data record once written to a block, and consists in connecting all data blocks among each other since the creation of the first, at the beginning of an experiment, experimental campaign, or beginning of a research project. Every block of data created references the UFPID of the previous block. This way any modification to the data record stored in a block, the corresponding UFPID changes, forcing the next blocks to modify accordingly. To modify a block is needed a rewrite on all blocks.

The proposed changes in section 3.2 to the data vector to be normalized for computing the UNF, are added one final data value to include the UFPID of the previous data record. On creation of the first block holds a NULL value.

Verification of the blockchain integrity

Verification of the blockchain integrity is made in parallel during an experimental campaign and using separate computing resources. It can be made on any Laptop/computer of the scientific researcher responsible for the experimental data measurements and collection, by any of the team members, or by any other person, who wants to perform a data auditing, for instance, the editorial team of a journal when submitting any communication related to existing datasets in a data repository. The task consists of generating the UFPID for each stored experimental data record in a dataset file or database and comparing it with the existing UFPID for the same data record. Both records must exactly match and for all data records stored in a dataset file or database file.


Open Scientific Data << Experimental Data Authenticity and Trustworthiness >> Swarm Intelligence


Clone this wiki locally