Metadata in operational systems to enforce compliance (GDPR is not far away)

This week, I'm running out of time for a solution deep dive so what is best than a 45min data integration challenge? Here is the objective: 

The GDPR (General Data Protection Regulation) will take effect in May 2018 and will affect every organisation that collects or handles data relating to EU citizens.


The first step towards compliance for most organisations is to be able to audit the data to understand where it comes from, how it was processed, with whom its shared and under what consent.


The ability to manage metadata alongside the data in an operational system is a key enabler for the compliance of the solution.


This week, we will illustrate how MarkLogic can easily manage data and metadata in order to create an operational Datahub able to enforce compliance rules. Here is the scenario:



  1. We'll process 3 sources of customer data coming in JSON and XML format
  2. We'll ingest as-is the sources AND the provenance data generated par upstream systems
  3. We'll generate a "Golden Record" merging the 3 sources and keeping provenance in metadata attached to the fields
    • Note: Here the objective is not to implement matching and merging rules as we usually do in our operational datahub... we only have 45min in total. But it could be a topic for an upcoming post
  4. We'll create a semantic layer to track provenance and consent and we'll go as far as possible regarding the implementation of GDPRov ontology.






Let's start, we have 45 minutes...



T00:00:00

NIFI flows configuration


In order to create our scenario, we first generated 3 fake customer datasets using an online faker tool:



  • CRM: customer identity data (XML format)
  • eCommerce: data related to customer's delivery address (JSON format)
  • Campaign: data related to customer's car (XML format)

We will create Nifi flows to process this inputs.

In the background, NIFI collects all events related to the data flow and generates provenance and lineage information. This information can be pushed to a specific NIFI Flow  (below called Provenance processing flow). We will use this capability to insert provenance information into MarkLogic. 


UPDATE: MarkLogic now has an official NIFI connector: it's available here: https://github.com/marklogic/nifi-nars

Data processing flow

In order to process the 3 inputs we create a flow based on standard NIFI components. First we have 3 folder "watchers", the input from this watcher is first tagged with its source, then split (we receive batch of objects) into individual objects, then for each individual object we add in metadata some configuration info in order to create the correct MarkLogic REST calls.
At the end of the flow, we insert the object (XML or JSON in MarkLogic) and finally we call again MarkLogic to ask for Golden Record (re)generation.
The flows are partially shared between the inputs.




In the controller settings of this instance, you need to start the SiteToSiteProvenanceReportingTask (Google this term for detailed guidelines). It allows to forward the provenance events into a remote NIFI input port.



T00:13:30

Provenance processing flow

The provenance events are captures by a dedicated NIFI flow. The events are received in batches so the flow splits the batch into individual event before the ingestion into MarkLogic.







Inside MarkLogic

T00:25:00 

Bootstrap a new project

When you develop with MarkLogic, ml-gradle is usually the way to go. In 2 command lines you can bootstrap a project and deploy initial configuration.

T00:30:00 

REST Services for Ingestion

For production project, there is a strong Datahub Framework in order to industrialise data ingestion, harmonisation and distribution. Today, as we have some time constraints... we'll keep the ingestion simple:
MarkLogic is a multimodel operational database. We will leverage this capability to first store the data and metadata as-is using the envelope pattern. To be fast, we create 3 REST extension end-points to ingest XML sources, JSON sources and Provenance source and apply enveloppe pattern. So we ingest the sources as-is (XML or JSON) and put it into an envelope (the source data is stored in a "content" entry and we add some metadata in the header.

In MarkLogic, you can create REST extensions on top of what the product already proposes out-of-the-box. The code runs in the database, so close to the data themselves.

Here for our use case the logic  is extremely simple (1 single API call and 12 lines to keep the code readable...).
Gradle can create new REST extensions for you in 1 line :
gradle mlCreateResource -PresourceName=insertXMLSource -PresourceType=xqy
or for javascript version :
gradle mlCreateResource -PresourceName=insertXMLSource -PresourceType=sjs


When the input data is ingested it then looks like this for the XML inputs :





or like this for the JSON:







And here is a Nifi Provenance event ingested as-is as we don't really need to enrich it:





T00:38:30 

REST service for Golden Record generation

As we mentioned initially, we are not going to implement a complex logic for deduplication and merging. In the scenario, when we ingest data related to a customer from 1 source, just after ingestion we regenerate the Golden Record for this customer by merging the new source data into the Golden Record.

Of course, what is important here is to keep track of the provenance. This is where XML is quite interesting for data storage. You can indeed easily ingest data and decorate this data with metadata as XML attributes. So today, to illustrate  we will insert, as an attribute, the UUID of the source Data for each Golden Record Field.

But wait, MarkLogic is also a bi-temporal database. It means you can store and query the database in its latest state but also in the past. For compliance, it's quite interesting because  you can query a previous state of the database to see what was the result of a query at the time it was run. For our current use case, we will use uni-temporal (versioning according to 1 temporal axis) in order to be able to know what was the state of a Golden Record at a given date in the past. All this at minimal cost (as the time is counted today...), have a look here for the temporal collection configuration steps.


For each field of the Golden Record we keep the freshest data from the new data vs the existing Golden Record. Thanks to XQuery, all this can be done in a few lines (we don't have time anyway...). 
The good news is that these few lines of code are able to merge customer data coming from XML and JSON sources in a totally generic way.







And so here is a Golden Record generated after the ingestion of 2 sources for a single customer. 










and the database has now multiple versions (timestamped) of the Golden Record


T00:47:30 

Hmm, it seems time is over... Let's do one more thing for today.


Let's just add some Template Driven Extraction templates to lift current loaded data into the triple store.
We create templates to create triples representing relation between golden record and its aggregated fields, the relation between the fields and their data source and the relation between the datasource and all events that consumed and generated this datasource.


and the end we can create a query:


which lists all these relations (we find back the NIFI components we used in the ingestion flow).


T01:15:30 

Well we missed the objective, but 45min was probably a bit optimistic!

But let's consider the result : 
  • We loaded 3 heterogenous sources, merged them into a "Golden Record" which is versioned temporally (using MarkLogic uni-temporal capability), 
  • We ingested provenance metadata attached to all the sources, 
  • Lift all necessary data into the triple-store, 
  • and then ran query which reconstructs the provenance of all the fields of a given Golden Record.... 

not so bad...



Addendum

Wait, let's add encryption at rest for 5 more seconds. As we started with the GDPR topic, this is very important.
For this time we use internal key mechanism but of course MarkLogic also supports external KMS.


Popular posts from this blog

Domain centric architecture : Data driven business process powered by Snowflake Data Sharing

Snowflake Data sharing is a game changer : Be ready to connect the dots (with a click)

Process XML, JSON and other sources with XQuery at scale in the Snowflake Data Cloud