Let's create dataset for demonstrations in few lines of SQL and Snowflake Java UDF

In the presales activities, a recurring topic is how I could get a dataset to illustrate a scenario. With Snowflake, a first answer is of course "let's see what I can find on the Marketplace". However it's also useful to have fake customer records, fake transaction records and other datasets that can be used to illustrate a retail, insurance and other industry scenario.

For that you can leverage fake data generator, generate datasets and then import them into Snowflake.

With Snowflake Java UDF, it's now possible to have a generator embedded into Snowflake to generate any volume of data at any time with few lines of SQL.

Here are the 3 simple steps to get your brand new fake datasets ready to go:

Loading JAR into Snowflake stage

First we will load the faker library into Snowflake. For today, we don't want to build a new library based on Faker but just load the library as is and leverage it into Snowflake. You can get the JAR from MAVEN.

So we load Faker (the generator selected today) and 2 of its dependencies into a Snowflake Stage. You'll need snowSQL configured and running and then just type the following command: (For this illustration we load in user stage ~ into javalib folder)

put file:///Users/xxxxx/Downloads/javafaker-1.0.2.jar @~/javalib/ 

    auto_compress = false

    overwrite=true;

                                          

put file:///Users/xxxxx/Downloads/snakeyaml-1.29.jar  @~/javalib/
    auto_compress = false

    overwrite=true;

  

put file:///Users/xxxxx/Downloads/commons-lang3-3.12.0.jar  @~/javalib/
    auto_compress
= false 

    overwrite=true;



Create JAVA UDF in Snowflake

Now that we have the JAR loaded we will declare a JAVA UDF that will expose Faker generators for the end user. As we don't want to create 10s of functions to cover all the Faker functions, we will use introspection to call the methods based on user's parameters and so expose a single method (fakeValue) that takes 2 parameters: the domain and the property we want to get.


Snowflake supports inline Java function creation and this is what we declare here. In a worksheet you can just run the following command to create the JAVA UDF:



create or replace function fakeValue(domain varchar,prop varchar)
returns string
language java
imports = ( '@~/javalib/commons-lang3-3.12.0.jar',                 '@~/javalib/snakeyaml-1.29.jar',                 '@~/javalib/javafaker-1.0.2.jar')
handler = 'myFaker.fakeValue'
as
$$
import com.github.javafaker.*;
class myFaker {
static Faker faker= new Faker();
static Address address=faker.address();
static Name name=faker.name();
static Company company=faker.company();
static Internet internet=faker.internet();
static Job job=faker.job();
static Finance finance=faker.finance();
static PhoneNumber phoneNumber=faker.phoneNumber();
static Commerce commerce=faker.commerce();


private static String invokeFunction(Object o, String prop){
java.lang.reflect.Method method =null;
String result ="";
try {method = o.getClass().getMethod(prop); }
catch(NoSuchMethodException nsme) { nsme.printStackTrace(); }
catch(SecurityException nsme) { nsme.printStackTrace(); }

if(method!=null)
try { result = (String) method.invoke(o);}
catch(IllegalAccessException nsme) { nsme.printStackTrace(); }
catch(IllegalArgumentException nsme) { nsme.printStackTrace(); }
catch(java.lang.reflect.InvocationTargetException nsme)                         { nsme.printStackTrace(); }
return result;
}

public static String fakeValue(String domain, String prop){
java.lang.reflect.Method method;
String result="";

switch(domain) {
case "address":
result = invokeFunction(address,prop);
break;
case "name":
result = invokeFunction(name,prop);
break;
case "company":
result = invokeFunction(company,prop);
break;
case "internet":
result = invokeFunction(internet,prop);
break;
case "job":
result = invokeFunction(job,prop);
break;
case "finance":
result = invokeFunction(finance,prop);
break;
case "phoneNumber":
result = invokeFunction(phoneNumber,prop);
break;
case "commerce":
result = invokeFunction(commerce,prop);
break;
default:
// code block
}
return result;
}
}
$$;




Let's generate data

We are now ready to create dataset in snowflake with few lines of SQL

For exemple we can generate fake users profiles:

    SELECT 
        fakeValue('name','firstName') as firstname,
        fakeValue('name','lastName') as lastname,
        fakeValue('name','title') as title,
        fakeValue('address','streetName') as streetName,
        fakeValue('address','streetAddress') as streetAddress,
        fakeValue('address','zipCode') as streetAddress,
        fakeValue('address','city') as streetAddress
    from table(generator(rowcount => 100000));




Let's create fake transactions for specific products: (Of course we could do it smarter to create a product catalog first and then transactions on to of it):

    SELECT 
     fakeValue('commerce','department') as department,
     fakeValue('commerce','productName') as productName,
     fakeValue('company','name') as productName,
     fakeValue('commerce','price') as price,
     fakeValue('commerce','promotionCode') as promotionCode
      from table(generator(rowcount => 1000000));



If you want to play with other generator you can check the fields which are available here :
https://java-faker.herokuapp.com/

Popular posts from this blog

Snowflake Data sharing is a game changer : Be ready to connect the dots (with a click)

Domain centric architecture : Data driven business process powered by Snowflake Data Sharing

Process XML, JSON and other sources with XQuery at scale in the Snowflake Data Cloud