How a Data Hub can help to find back memory to tackle GDPR right to be forgotten
The challenge
As part of the GDPR, a client can ask for its records to be deleted. In a typical large organization, client records can be spread across multiple different systems from CRM, to eCommerce, to Campaign management solutions. Finding this records can quickly become a nightmare.In this article we will describe how MarkLogic and the datahub pattern can help finding the customer records across these systems.
The Operational Data Hub pattern
The Operational Data Hub (ODH) pattern, powered by MarkLogic is designed to ease the integration of data from siloes, link and enrich them and create actionable 360 view across all demands.
The Data Hub consolidates data from siloes so if we enrich the process with lineage, we can then track all data from the 360 view to their source records.
Data dictionaries meet the triple store
Usually organizations maintain data dictionaries for their databases. It's quite easy to get a CSV or an Excel file representing the table definitions. It could look like this :
With MarkLogic, we can load it as-is using MarkLogic Content Pump which by default would load each individual line as a XML or JSon document. Actually it's not exactly what we want. It would indeed make sense to have 1 document per table. So the easiest way to do it is to load the CSV as a text file and apply a transform on the fly that will process the CSV content and generate the expected file.
The file could have the following structure:
- Table Name
- Description
- System
- Fields {Array}
- Field
- Field Name
- Description
- Data Type
- ...
Now that we have all tables definitions loaded as documents, it would be even better to have them as a graph :
To do that, we just have to create a simple Template for Template Driven Extraction in order to lift the data from the document into the Triple Store.
Lineage at loading
At loading time, we can now tag each individual record coming from a source with a Triple that creates a link to the source table the record is coming from.Using the envelope pattern, the relation can be stored in the triple section of the envelope.
Lineage from business entity to its sources
In the Data Hub pattern, we apply search for related records and harmonization in order to create business entities which are exposed by the datahub to the consumer systems. The harmonization usually performs some merging and deduplication tasks in order to generate the business entity.We can so easily add a tracking logic in order to collect all the source records used to create the entity. Then the relation with the sources can be materialize using PROV-O. A simple representation could look like this :
It's time to find back memory
Now we have everything in place. We can perform a query on the graph which starts from a business entity, finds its source records (used by the harmonization process) and then identify tables and systems associated with these records.The Data dictionary can also help in identifying table primary keys so the query can also output the actuel source system query with primary keys constraints.
Take away
We recently implemented this logic in the insurance industry. In such as context, the data hub contains multiple business entities from customer, to contract, claims and third parties.
The other opportunity is related to customer data portability. As MarkLogic is multi-model the same query can not only be used to find the records in the graph but also to export the content itself coming from the business entities (stored in json or XML). The query can indeed perform document access to return all the data related to a customer.
The overall logic was implemented in less than 2 days on top of the data hub data integration logic.