Process XML, JSON and other sources with XQuery at scale in the Snowflake Data Cloud

In the past few years, I worked with XML and Json technologies in Database. There are still tons of applications especially in the Manufacturing or Finance where XML is leveraged as a data exchange standard.

In this new short blog post, the objective is to create a function in Snowflake able to take data from XML, JSON (coming from a VARIANT type) or even from a standard table, view or Snowflake Share and apply an arbitrary XQuery to generate an output in XML (or any other output format thanks to XQuery 3).

This function is of course running in the Snowflake compute layer and take advantage of all Snowflake capabilities.



The dependencies

First we have to load the JAVA dependencies into the Snowflake stage. For this illustration, we again use the user stage.

put file:///Users/xxxxx/Downloads/saxon-he-test-10.5.jar @~/javalib/ 

    auto_compress = false

    overwrite=true;

                                          

put file:///Users/xxxxx/Downloads/saxon-xqj-10.5.jar  @~/javalib/
    auto_compress = false

    overwrite=true;

  

put file:///Users/xxxxx/Downloads/saxon-he-10.5.jar  @~/javalib/
    auto_compress 
= false 

    overwrite=true;


The JAVA UDF

Let's do a simple function that takes 3 parameters:
  • xq: the XQuery to be executed
  • v1 and v2, sources that can be XML or JSON file. 
For now, the function will just check if they XML or JSON and add the correct XQuery declaration to make them available for the script (as $v1 and $v2 documents).

create or replace function execXQuery(xq varchar, v1 varchar, v2 varchar)
returns string
language java
imports = (
'@~/javalib/saxon-he-test-10.5.jar',
'@~/javalib/saxon-xqj-10.5.jar',
'@~/javalib/saxon-he-10.5.jar')
handler = 'XQuery.exec'
as
$$


import net.sf.saxon.s9api.Processor;
import net.sf.saxon.s9api.SaxonApiException;
import net.sf.saxon.s9api.XQueryEvaluator;
import net.sf.saxon.s9api.XdmAtomicValue;

import java.io.StringWriter;
import java.io.PrintWriter;

public class XQuery {
public static String exec(String xq, String v1, String v2) {
String result = "";
String decVars = "";
String setVars = "";
boolean isV1Value = false;
boolean isV2Value = false;

v1 = v1.replaceAll("^\\s+", "");
v2 = v2.replaceAll("^\\s+", "");

if (v1.startsWith("<")) {
decVars += "declare variable $vv1 as xs:string external; \n";
setVars += "let $v1 := fn:parse-xml($vv1)";
isV1Value = true;
} else if (v1.startsWith("{") || v1.startsWith("[")) {
decVars += "declare variable $vv1 as xs:string external; \n";
setVars += "let $v1 := fn:json-to-xml($vv1)\n";
isV1Value = true;
}
if (v2.startsWith("<")) {
decVars += "declare variable $vv2 as xs:string external; \n";
setVars += "let $v2 := fn:parse-xml($vv2)";
isV2Value = true;
} else if (v2.startsWith("{") || v2.startsWith("[")) {
decVars += "declare variable $vv2 as xs:string external; \n";
setVars += "let $v2 := fn:json-to-xml($vv2)\n";
isV2Value = true;
}

xq = decVars + setVars + "\nreturn\n" + xq;

try {
Processor processor = new Processor(true);
String xquery = xq;
final XQueryEvaluator ev = processor.newXQueryCompiler().compile(xquery).load();

if (isV1Value)
ev.setExternalVariable(new net.sf.saxon.s9api.QName("vv1"),                         new XdmAtomicValue(v1));
if (isV2Value)
ev.setExternalVariable(new net.sf.saxon.s9api.QName("vv2"),                         new XdmAtomicValue(v2));

return ev.evaluate().toString();
} catch (SaxonApiException nsme) {
StringWriter errors = new StringWriter();
nsme.printStackTrace(new PrintWriter(errors));
return errors.toString();
}
}
}
$$;

We are ready to run XQuery in snowflake

Executing XQuery on XML

Let's first generate some fake XML:

INSERT INTO "DEMO"."STAGING"."XML_DEMO"
SELECT to_variant(parse_xml('<courses>     
   <course category="JAVA">  
      <title lang="en">Learn Java in 3 Months.</title>  
      <trainer>Sonoo Jaiswal</trainer>  
      <year>2008</year>  
      <fees>10000.00</fees>  
   </course>    
    <course category="Dot Net">  
      <title lang="en">Learn Dot Net in 3 Months.</title>  
      <trainer>Vicky Kaushal</trainer>  
      <year>2008</year>  
      <fees>10000.00</fees>  
   </course>  
    <course category="C">  
      <title lang="en">Learn C in 2 Months.</title>  
      <trainer>Ramesh Kumar</trainer>  
      <year>2014</year>  
      <fees>3000.00</fees>  
   </course>  
    <course category="XML">  
      <title lang="en">Learn XML in 2 Months.</title>  
      <trainer>Ajeet Kumar</trainer>  
      <year>2015</year>  
      <fees>4000.00</fees>  
   </course>    
</courses>')) from table(generator(rowcount => 200000));

For the illustration, We created a table called XQuery which stores the code:

INSERT INTO "DEMO"."STAGING"."XQUERY" VALUES ('template1','
<titles>
{for $x in  $v1/courses/course
return
<book>
    <title>{$x/title/string()}</title>
    <year>{$x/year/string()}</year>
</book>
}
</titles> 
');

And now we can run the XQuery on our sample dataset (200K XML)
SELECT execXQuery(
                (SELECT QUERY FROM "DEMO"."STAGING"."XQUERY" WHERE NAME='template1'),
                TO_XML(V),'')  
FROM "DEMO"."STAGING"."XML_DEMO";


And here we are the results in about 1min:



Executing XQuery on data coming from a table (actually a Share).

Here we will generate XML from data coming from a table. We will use a sample dataset coming from "SAMPLE_DATA"."TPCH_SF10"."CUSTOMER" which has 1 500 000 records.

Again we insert the xquery code in our XQUERY table:

INSERT INTO "DEMO"."STAGING"."XQUERY" VALUES ('template2','
<result>                                   
   <segment>{$v1//*:string[@key="C_MKTSEGMENT"]/string()}</segment>
   <reference>{$v1//*:string[@key="C_NAME"]/string()}</reference>                                     
</result>


Let's run the query:
We convert on the fly the source data into a json to be passed to the xquery. In the Java function if the source is in Json we call the new fn:json-to-xml to then be able to process it with XQuery.

SELECT execXQuery(
    (SELECT QUERY FROM "DEMO"."STAGING"."XQUERY" WHERE NAME='template2'),
    to_json(OBJECT_CONSTRUCT('C_MKTSEGMENT',C_MKTSEGMENT,'C_NAME',C_NAME)) ,''))  
FROM "SAMPLE_DATA"."TPCH_SF10"."CUSTOMER";



This simple example runs in 44s on a Large virtual warehouse, so more than 34K XML per second.

In short, lot of new opportunities to process input and output XML and JSON at scale. With a bonus, as the function takes multiple inputs you can combine data coming from different sources (and different format).








Popular posts from this blog

Snowflake Data sharing is a game changer : Be ready to connect the dots (with a click)

Domain centric architecture : Data driven business process powered by Snowflake Data Sharing