Snowflake Data sharing is a game changer : Be ready to connect the dots (with a click)

In a previous blog I talked about extended enterprise and how Snowflake can provide the data architecture to support it. Today I will try to illustrate it with a more realistic scenario. The scenario comes from my previous experience in manufacturing but can be adapted to any industry : Several parties need to share their data to get the full picture of a product and then estimate the production costs. This could be part of a Design-to-Cost business process  aiming at optimising product design and fabrication in order to 

  • Use as many as possible off-the-shelf components
  • Reduce raw material usage
  • Reduce the product complexity
  • Mutualise components
  • Reduce waste
  • Eliminate non desired features

In order to make it easy to understand I will simplify the data processing part of the scenario and keep the most valuable dimension, the sharing.

A composite product

Let's imagine we are dealing with a company producing a product and working with several third parties and plants (subsidiaries or partners) which are in charge of components of the product. The objective is to connect the dots.



We will consider the product structure of the final product that represents the product components and their relationships (https://en.wikipedia.org/wiki/Product_structure_modeling).

The final product is represented by a graph of components, some of them being designed and produced by third parties. The component A and B in the diagram below: 



the product structure not only contains the items of the product but also metadata and the quantities of a child in its parent (the total quantity of a single item is equal to the multiplication of the quantities of all the parent branches above)  

Ok now what about the data and why Snowflake?

In order to analyse the total cost of the product, usually processed as part of the Design to Cost business process, you need to get the overall product structure, calculate all parts quantities and find methods to get the approximate cost of these parts. We will drastically simplify the workflow here but we will keep:

  • Core product structure is management by the main organisation
  • Component A and B product structures are managed by subsidiaries/partners
  • The main organisation and the subsidiaries/partners have their own supply chain and buy the parts required to build the core product or the individual components.
These are the data that should have to be merged to get the total estimated cost of the product.

But instead of moving around the data as almost everyone is doing in the industry, we will share it from where it is (in the producer own account) thanks to Snowflake Data Sharing capability.

What is the impact :
  • Immediate time to value
  • Data always fresh
  • Data always in producer quality (no deprecation during export/reprocessing)
  • End-to-end governance of the data
  • No data export, ingest, processing, validation == Faster and cheaper 

Generating the datasets

In order to simulate this scenario we will
  • Create one main product structure (for the final product)
  • Create individual component product structures
  • Create purchase order dataset for the main site and the subsidiaries/partners sites.

Product structures (main and third parties)

We will generate a fake product structure, a huge graph starting from a root with random number of children. In order to generate plant A and plant B components, we will arbitrary decide that starting from a specific level that the subtree is a separate component from a different plant (with a random probability).
Below is the code to generate these datasets. At the end, we have main product structure associated with mainPlant and some sub product structure associated with plantA and plantB (all different file naming conventions).
To make it more realistic, we generate many product structures. we generate 1000 different product structures and each has from 30K to 200K nodes.


const fs = require('fs');
let rows = []

function genChildren(plant,product,parent,level){

let nbChildren = Math.floor(Math.random() * (level + 3)) + 1;
if(nbChildren>=10) nbChildren=9
if(level >2 && Math.floor(Math.random() * 10)>=6) return
if(level>10) return

//console.log(level +"/" +nbChildren)
for(let i=1;i<=nbChildren;i++){

let child = parent + Math.pow(10,level) * i
let partRef = "P" + Math.floor(Math.random() * 10000)
let qty = Math.floor(Math.random() * 5 + 1);
if(level==4){

if(Math.floor(Math.random() * 10)>7){
plant = "PLANT" + Math.floor(Math.random() * 2 + 1)
//console.log(plant +"=======================")
}

}
rows.push([plant, product, parent ,child , partRef , qty])

genChildren(plant,product,child,level+1)
}

}
for (let i=0;i<10;i++){
for (let j=0;j<10;j++){
genChildren("PLANT0","prod" + (j+i*10),1,1)
console.log(rows.length)
}
fs.writeFileSync('/Users/USERNAME/Documents/dev/BOM/boms/plant0-bom'+ i +".csv",
        rows.filter(item=> item.includes("PLANT0")).sort(function(a,b){return a[2]-b[2]}).join("\n"));
fs.writeFileSync('/Users/USERNAME/Documents/dev/BOM/boms/plant1-bom'+ i +".csv",
        rows.filter(item=> item.includes("PLANT1")).sort(function(a,b){return a[2]-b[2]}).join("\n"));
fs.writeFileSync('/Users/USERNAME/Documents/dev/BOM/boms/plant2-bom'+ i +".csv",
        rows.filter(item=> item.includes("PLANT2")).sort(function(a,b){return a[2]-b[2]}).join("\n"));
rows=[]
}


Generating fake PO

We generated well known range of part references in the product structure , we can just for each plant generate a full dataset of fake PO per part reference.


const fs = require('fs');
let plant1 = []
let plant2 = []
let plant3 = []

function randomDate(start, end) {
var d = new Date(start.getTime() + Math.random() * (end.getTime() - start.getTime())),
month = '' + (d.getMonth() + 1),
day = '' + d.getDate(),
year = d.getFullYear();

if (month.length < 2) month = '0' + month;
if (day.length < 2) day = '0' + day;

return [year, month, day].join('-');
}

for (let i = 0;i <10001;i++){

let nbPO = Math.floor(Math.random() *15 + 3 )
for (let j=0;j<nbPO;j++){
let basePrice = Math.floor(Math.random() * 1000 + 10 )
plant1.push(["P" + i,randomDate(new Date(2015, 01, 01), new Date(2020, 10, 01)),Math.floor(basePrice                 + basePrice/4 * Math.random() ) ])
plant2.push(["P" + i,randomDate(new Date(2015, 01, 01), new Date(2020, 10, 01)),Math.floor(basePrice                 + basePrice/4 * Math.random() ) ])
plant3.push(["P" + i,randomDate(new Date(2015, 01, 01), new Date(2020, 10, 01)),Math.floor(basePrice
                + basePrice/4 * Math.random() ) ])
}


}
fs.writeFileSync('/Users/USERNAME/Documents/dev/BOM/po/plant0-po.csv', plant1.join("\n"))
fs.writeFileSync('/Users/USERNAME/Documents/dev/BOM/po/plant1-po.csv', plant2.join("\n"))
fs.writeFileSync('/Users/USERNAME/Documents/dev/BOM/po/plant2-po.csv', plant3.join("\n"))


We are ready to go !

Different Snowflake accounts to simulate third parties

The objective is to simulate third parties sharing data so to create the environment you will need several Snowflake accounts (in theory 1 per tier). 
In each account we will create the tables required to load their source data :

The BOM Table (Link in the product structure):


The PO Table:


Loading the data

Using the generated naming convention, the data can be loaded with these simple commands :


Load files from local machine to the staging area:

PUT file:///Users/USERNAME/Documents/dev/bom/po/plantX** @%PLANTX_PO

Copy data from staging to the table:

COPY INTO PLANTX_PO

  file_format = (format_name = csv);



PUT file:///Users/USERNAME/Documents/dev/bom/boms/plantX** @%PLANTX_BOM
COPY INTO PLANTX_BOM
  file_format = (format_name = csv);

Sharing is as simple as a click

In order to create a share, you just have to go the the Shares section, click outbound and create.


Then give a name, select the tables you want to share :


and add the main Snowflake account which will consume the data:

This is it, the dataset provider is now sharing its data with a third party


Now if we go to the main account, we can see the 2 sharing available in the Inbound list of the "Shares" section:



Just click create to map the share to a Database name and grant access to specific roles:



Now if you go to the Databases section, you can see the 2 new shares mapped to local databases:



And now we can build very interesting queries :

We can first create a view that is the union of all product structures, the main one but also the ones from the 2 different plants:

CREATE OR REPLACE VIEW PLANT_BOMS AS
SELECT * FROM plant0_bom
UNION
SELECT * FROM "MANUF_PLANT1"."PUBLIC"."PLANT1_BOM"
UNION
SELECT * FROM "MANUF_PLANT2"."PUBLIC"."PLANT2_BOM";

And then we can in a single query:

Calculate the total quantities of each individual part in each individual product and get in front of each result the average price of the part ordered by the production site.

Snowflake provides recursive query that allow to process the product structure from the roots of each product, calculate the quantities on the go and finally join each individual result with the prices coming from the purchase orders. All this without moving a single data from the source providers. The query consumes the data as it is in the provider account.

with recursive allboms
-- Column names for the "view"/CTE
(product,path,parent,child, partRef,levelQty, qty)
as
-- Common Table Expression
(

-- Anchor Clause
select product,'/' || parent || '/' || child, parent,child, partRef,quantity,quantity
from plant_boms
where parent = '1'

union all

-- Recursive Clause
select plant_boms.product,path || '/' || plant_boms.child ,plant_boms.parent, plant_boms.child, plant_boms.partRef,plant_boms.quantity,plant_boms.quantity * qty
from plant_boms join allboms
on plant_boms.parent = allboms.child and allboms.product = plant_boms.product
)

SELECT product, partRef,sumQty, AVG(po.price) FROM (
SELECT product, partRef, sum(qty) as sumQty from allboms
group by product, partRef
) vol
JOIN PLANT0_PO po on po.part = vol.partRef and po.plant=vol.plant
GROUP BY product, partRef,sumQty
ORDER BY 1, 2;


Below is an overview for generated path



and some metrics:

The dataset contains:

  • 500 generated products
  • 18 millions links split between main site and the 2 component related sites

Performances on a Large virtual warehouse (Snowflake can go up to 4X-Large...):

  • Snowflake generates the 13.4 Millions unique paths from the 500 products in 10.22s
  • Snowflake calculates the 875K aggregates to get average cost  per product and part reference, in 7.18s


In short

This is really a game changer for data intensive business processes.

First there is no more need to move the data from the producer to the final consumer (no export, no ETL, no reload, no duplication, no transformation) which requires lot of resources ($$) and time. Results are always based on fresh dataset constantly available.

Second the performance are astonishing ! Not only the source data is immediately available but the processing itself is extremely fast. It provides the opportunity to iterate rapidly on new concepts and deliver continuous improvements to the business.

What's next ?

Now that the data are available you can leverage all Snowflake capabilities to build an end-to-end solution leveraging IoT, machine learning, data exploration and visualisation to deliver business value.




 

Popular posts from this blog

Domain centric architecture : Data driven business process powered by Snowflake Data Sharing

Process XML, JSON and other sources with XQuery at scale in the Snowflake Data Cloud