Testing Datahub for Certification/Safety/Fraud management - Time meets semantics (1/2)

We are going to deep dive into a solution built to explore temporal patterns in large datasets of structured and unstructured data.





Why it matters?


Tests are a key activity in product lifecycle. 

They help product owners and engineers design better product, optimise characteristics but also reduce risks and improve safety. This is particularly true for the biotechnology and transport industries.
Product testing and certification require to perform multiple strict tests with the product all long its lifecycle from pre-production to sometimes live operations. For complex products, the challenge is to be able to analyse the product behaviour for a particular configuration and tests conditions using measures but also expert feedbacks. All these dimensions are different types of data, some are structured, others unstructured with moreover temporal and sometimes geospatial dimensions.

The challenge is to be able to query all these dimensions using industry standards (advanced geospatial, semantics, thesaurus, stemming) and at scale. Moreover all these data live in siloes.

The Datasets

The main objective of the solution is to provide answers to questions based on temporal patterns and mixing any type of data. The patterns are actually a networks of relations between observations. The observations being subsequent or concomitant.

What are we going to consider as an observation?

A observation is a timestamped (point-in-time or period) event which can have attached data, metadata and/or content. Depending on the context, the observations could be mix of:


Sensor measure: 
- IoT devices
- Sensors
- blood test
-  Any other similar measures




Expert reports: 
- They are unstructured documents (PDF, word, etc.) containing domain terms, description, feedbacks, analysis, etc.

KPI / Processed data
- It can be any any result obtained way processing raw data (prediction from AI model, regression, etc.) and linked to a period in time


Geolocation: 
- They are coming from GPS or other devices (points, surface, trajectory)





The contextual Data

The contextual data is usually shared by all the observations.

Product related data
- It can be the product configuration or the patient profile (yes he looks like a product here)

Test related data
- What are the objectives of the overall test, assumptions and initial setup

Contextual data
- Weather, county stats, etc.

What question can we ask ?

We want to find all the periods that match a specific temporal pattern of observations.
Applied to industries it could be :

  • Transportation: sequence of events during a product test
  • Clinical trial: evolution of a patient state
  • Intelligence: behaviour of a person collected via intelligence reports or captures

The temporal pattern

The question we want to ask is primarily based on a temporal pattern :
The temporal pattern is a graph of relations between group of observations. As mentioned before, the relations can be :

  • Temporal sequence (subsequent) : One (or more) group(s) after the other(s)
  • Temporal intersection (concomitant) : N observations happen (at least partially) at the same time
It actually corresponds to a temporal query such as :
"I want groupA to happen XX seconds before the shared period between (GroupB and GroupC) which happens YYs before GroupD, etc."



Translated into industry domain terms, it could look like :

  • Transportation:  it can corresponds to a sequence of events (AUTOPILOT=ON) then (ROLL>60 while SPEED>500) then (ALT>10000).
  • Clinical trial: the sequence could be based on (treatment taken) then (temperature<40) then (doctor report says xxxx) 
  • Fraud: (user connects) then (performs XXX) then (trace YYY)
  • Intelligence: (person calls phone #) then (move to area while other person in area) then (switch off phone

The observation filter

Each Group of observations in the pattern is filtered based on conditions (the ones inside the brackets in the examples above).


Unstructured content conditions
We consider reports which are related to an observation (timestamped). The objective is to select reports and so related time period based on content conditions. With a search engine you can perform full-text with stemming and usually term expansion using thesaurus. MarkLogic can of course do all that but as we are talking about expert domain, semantics is highly relevant. Using domain ontology, it's indeed possible to enrich the content with  key concepts and/or expand queries. It's then possible to search for reports (and so related periods of time) even if the terms are not present in the document.
Moreover, sentiment analysis can also be applied to provide the opportunity to select documents based on their tones.

Structured content conditions
If the observation contains structured data (mesure/value), the conditions to select the related period are based on a "standard" boolean expression with conditions on values and typical operators( >=, <=, >, <, =). 

Geospatial conditions
We want to consider only time periods during which the observed product is at a particular location. It can be based on proximity with point, surface, trajectory query.

Actually a single observation can be selected by combining any of the above conditions : 
Ex: A doctor report containing unstructured text but also "structured" metadata.

Coming soon

Now that we have the scene (the datasets and the questions we want to ask), we can detail how MarkLogic can manage such complex requirements leveraging its multimodel DB, semantics capabilities, geospatial features and more.

This is what we will present in the next part of this post.


Popular posts from this blog

Domain centric architecture : Data driven business process powered by Snowflake Data Sharing

Snowflake Data sharing is a game changer : Be ready to connect the dots (with a click)

Process XML, JSON and other sources with XQuery at scale in the Snowflake Data Cloud