BLOG

Data Provenance

Image

The objective of the Data Provenance Tools from an End-User-Perspective is to provide stronger data ownership guarantees to data providers as sharing their datasets with the PIMCity platform will also discourage the illegal copying or reselling of datasets.

The objective of the Data Provenance Tools from an End-User-Perspective is to provide stronger data ownership guarantees to data providers as sharing their datasets with the PIMCity platform will also discourage the illegal copying or reselling of datasets. That is because our module provides a watermarking algorithm that allows a data buyer or data owner to verify data ownership offline or a third-party verifier online on behalf of the data owner by reading a data owner secret information online previous agreement with such owner.

The Data Provenance tools is an OpenAPI formatted framework for interoperable transactions with other components in the PIMCity platform. Responsibilities for the trading of the datasets fall outside the scope oof the Data Provenance tool but in future releases we plan to provide metadata information that is valuable to data traders in reassuring operational exchange of such datasets with data buyers and the like.

In particular, we provide the following capabilities to the Trading Engine of PIMCity,

1. Insert a watermark in a dataset to assure to data providers their data ownership conforms to their secret information even if not stored in the PIMCity platform (offline). Data buyers could be additionally provided with hint that a particular dataset is legally sourced from the PIMCity platform by belonging to a specific data provider without data buyer having to know the data provider.
2. Verify a watermark of a dataset by receiving a data provider’s secret information to reassure a data provider that a piece of data found in the wild outside PIMCity belongs to them with a given secret input information only the data provider (offline) and/or PIMCity (online) knows about.
In the first case, data providers or PIMCity platform provide or generate the secret information from which to derive a secure watermarked dataset.
In the second, the secret information to verify a dataset can be strictly held by the data provider that owns the data but it has to be at the very least read by the Data Provenance component in order to return True or False as result of the verification process for a watermarked dataset.

Benefits
Data Provenance tools in WP4 provide greater reassurance to data providers about sharing their private data while discouraging abusive reuse, reselling or simply copying their datasets in the wild without permission. We have implemented a first watermarking algorithm for Strings (browsing history urls) that follows a similar approach to state-of-the-art in VLDB ’02 [1].
Regarding permissions, together with the Data Trading Engine we plan book keeping of the transactions generated in the data trading platform by recording metadata such as source, destination and number of times each dataset has been shared. This will allow us to:

1. Identify if a dataset located in the ‘wild’ belongs to a given data owner.
2. The protocol is aimed at providing traceability or fingerprinting of the data buyer of a dataset in the ‘wild’ too, without disclosing to the public Internet the identity of such data buyer but just serve metadata to the PIMCity platform to process it in a secure manner.
Moreover, the Data Trading Engine provides through a data marketplace in PIMCity, information to buyers and sellers about every dataset and possibly personal transactions of the user. This is in order to offer transparency and trust to data transactions in PIMCity.