How a Data Hub can help to find back memory to tackle GDPR right to be forgotten










The challenge

As part of the GDPR, a client can ask for its records to be deleted. In a typical large organization, client records can be spread across multiple different systems from CRM, to eCommerce, to Campaign management solutions. Finding this records can quickly become a nightmare.

In this article we will describe how MarkLogic and the datahub pattern can help finding the customer records across these systems.

The Operational Data Hub pattern

The Operational Data Hub (ODH) pattern, powered by MarkLogic is designed to ease the integration of data from siloes, link and enrich them and create actionable 360 view across all demands.


The Data Hub consolidates data from siloes so if we enrich the process with lineage, we can then track all data from the 360 view to their source records. 

Data dictionaries meet the triple store

Usually organizations maintain data dictionaries for their databases. It's quite easy to get a CSV or an Excel file representing the table definitions. It could look like this :


With MarkLogic, we can load it as-is using MarkLogic Content Pump which by default would load each individual line as a XML or JSon document. Actually it's not exactly what we want. It would indeed make sense to have 1 document per table. So the easiest way to do it is to load the CSV as a text file and apply a transform on the fly that will process the CSV content and generate the expected file.

The file could have the following structure:
  • Table Name
  • Description
  • System
    • Fields {Array}
      • Field
        • Field Name
        • Description
        • Data Type
        • ...


Now that we have all tables definitions loaded as documents, it would be even better to have them as a graph :


To do that, we just have to create a simple Template for Template Driven Extraction in order to lift the data from the document into the Triple Store.

Lineage at loading

At loading time, we can now tag each individual record coming from a source with a Triple that creates a link to the source table the record is coming from.


Using the envelope pattern, the relation can be stored in the triple section of the envelope.

Lineage from business entity to its sources

In the Data Hub pattern, we apply search for related records and harmonization in order to create business entities which are exposed by the datahub to the consumer systems.  The harmonization usually performs some merging and deduplication tasks in order to generate the business entity.

We can so easily add a tracking logic in order to collect all the source records used to create the entity. Then the relation with the sources can be materialize using PROV-O. A simple representation could look like this :




It's time to find back memory

Now we have everything in place. We can perform a query on the graph which starts from a business entity, finds its source records (used by the harmonization process) and then identify tables and systems associated with these records.



The Data dictionary can also help in identifying table primary keys so the query can also output the actuel source system query with primary keys constraints.

Take away

We recently implemented this logic in the insurance industry. In such as context, the data hub contains multiple business entities from customer, to contract, claims and third parties. 



In MarkLogic, all these entities are linked using triples so by extension you can create a query starting from a client and finding all client related business entities and from there identifying source records, table and system.

The other opportunity is related to customer data portability. As MarkLogic is multi-model the same query can not only be used to find the records in the graph but also to export the content itself coming from the business entities (stored in json or XML). The query can indeed perform document access to return all the data related to a customer.


The overall logic was implemented in less than 2 days on top of the data hub data integration logic.



Popular posts from this blog

Domain centric architecture : Data driven business process powered by Snowflake Data Sharing

Snowflake Data sharing is a game changer : Be ready to connect the dots (with a click)

Process XML, JSON and other sources with XQuery at scale in the Snowflake Data Cloud