From one house to several districts, geospatial illustration with Snowflake




We are going to illustrate how to use Snowflake to process structured data, semi-structured data (json) and  geospatial data (geojson) to produce some analytics. The exemple is quite simple and only aims at illustrating some of the Snowflake capabilities related to geospatial and semi-structured data. 

For this illustration we will use the cadastre, an official record of the owners of land used for calculating the amount of tax owed  and property sales datasets. It's a breakdown of the territory up to the individual parcels.

Loading the cadastre dataset

The french cadastre is available here: https://www.data.gouv.fr/fr/datasets/cadastre/

The cadastre source is a huge geojson file with a root "FeatureCollection". In order to optimise the processing, I first split the geojson into individual feature with a simple loop in JavaScript and node.js:

fs = require('fs')
fs.readFile('/Users/USERNAME/Documents/dev/immo/cadastre-35-parcelles.json', 'utf8',   function (err,data) {
if (err) {
return console.log(err);
}
data = JSON.parse(data)
let nb=0
let cache = []
for (let o of data.features){
cache.push(o)
if(cache.length>=10000){
fs.writeFileSync('/Users/USERNAME/Documents/dev/immo/cadastre/' + nb++ + '.json',                         JSON.stringify(cache));
cache=[]
}
}
});


This is what I'll load into Snowflake.

I can now create a table that will store these json files. For this we can use VARIANT as the column type. Snowflake will store the data "as-is" in json format and we'll be able to query it in SQL from there.


From snowSQL, you can transfer local files to the Snowflake Stage. In this case we will use the Stage of the table itself (@% references table stage):


PUT file:///Users/USERNAME/Documents/dev/immo/cadastre/* @%CADASTRE;

In my case 128 files (splits) are loaded from my local laptop in 50s for a total of 1.280 millions records (and about 1GB of data) on a small warehouse.

We can then copy the data into the target table using the following command. the command will consider individual items inside the main json array as individual records thanks to the STRIP_OUTER_ARRAY parameter. 


copy into CADASTRE
file_format = (type = 'JSON'
STRIP_OUTER_ARRAY = TRUE )


I can now browse the data from the CADASTRE table using Snowflake UI for exemple. We see the FEATURE column contains the json as-is. 


We could query the data directly from the FEATURE column but in order to make it more readable, we will create a view on top of this table with the extraction of the interesting properties (the keys used to match the other table).

So we create a view, the SQL below accesses the json properties and map them to dedicated columns. The geojson of the cadastre entry is cast (as-is) to a geometry type in order to apply geospatial functions on it.

create or replace view cadastre_v as
SELECT TRY_TO_GEOGRAPHY(FEATURE:geometry) as geo,
FEATURE:properties:commune as commune,
FEATURE:properties:contenance as contenance,
FEATURE:properties:created as created,
FEATURE:properties:id as id,
FEATURE:properties:numero as numero,
FEATURE:properties:prefixe as prefixe,
FEATURE:properties:section as section,
FEATURE:properties:updated as updated
FROM CADASTRE;

 The new view looks like this:



Enriching with property sales dataset 

Now we will load the property sales dataset, 2.5 millions records with price, surface and cadastre keys to join it with previously loaded dataset.

The dataset I used is available here :

https://www.data.gouv.fr/en/datasets/demandes-de-valeurs-foncieres/#:~:text=Conform%C3%A9ment%20au%20d%C3%A9cret%20n%C2%B0,d%C3%A9sormais%20disponible%20en%20open%20data.

The dataset is is not that clean with coma as digit separator. 

First we will load the data as-is in a table and then transform them a bit.


create or replace TABLE VALEURS_FONCIERES (
CodeServiceCH string,
ReferenceDocument string,
ArticlesCGI1 string,
ArticlesCGI2 string,
ArticlesCGI3 string,
ArticlesCGI4 string,
ArticlesCGI5 string,
NoDisposition number,
DateMutation STRING,
NatureMutation string,
ValeurFonciere string,
NoVoie string,
B_T_Q string,
TypeDeVoie string,
CodeVoie string,
Voie string,
CodePostal string,
Commune string,
CodeDepartement string,
CodeCommune integer,
PrefixeDeSection string,
Section string,
NoPlan integer,
NoVolume string,
Lot1 string,
SurfaceCarrezDu1erLot string,
Lot2 string,
SurfaceCarrezDu2emeLot string,
Lot3 string,
SurfaceCarrezDu3emeLot string,
Lot4 string,
SurfaceCarrezDu4emeLot string,
Lot5 string,
SurfaceCarrezDu5emeLot string,
NombreDeLots integer,
CodeTypeLocal integer,
TypeLocal string,
IdentifiantLocal string,
SurfaceReelleBati integer,
NombrePiecesPrincipales integer,
NatureCulture string,
NatureCultureSpeciale string,
SurfaceTerrain string
);


we will also create a table for the prepared data with only the fields we want to keep:


create or replace TABLE VALEURS_FONCIERES_PROD (
DateMutation date,
NatureMutation string,
ValeurFonciere double,
NoVoie string,
B_T_Q string,
TypeDeVoie string,
CodeVoie string,
Voie string,
CodePostal string,
Commune string,
CodeDepartement string,
CodeCommune string,
PrefixeDeSection string,
Section string,
NoPlan integer,
NoVolume string,
CodeTypeLocal integer,
TypeLocal string,
SurfaceReelleBati integer,
NombrePiecesPrincipales integer,
SurfaceTerrain string
);

and here is the transformation where we keep only the houses and the flats.


INSERT INTO VALEURS_FONCIERES_PROD
SELECT
TO_DATE(DateMutation, 'DD/MM/YYYY'),
NatureMutation ,
TO_DOUBLE(replace(ValeurFonciere,',','.')) ,
NoVoie ,
B_T_Q ,
TypeDeVoie ,
CodeVoie ,
Voie ,
CodePostal ,
Commune ,
CodeDepartement ,
CodeCommune ,
PrefixeDeSection ,
Section ,
NoPlan ,
NoVolume ,
CodeTypeLocal ,
TypeLocal ,
SurfaceReelleBati ,
NombrePiecesPrincipales ,
SurfaceTerrain
FROM VALEURS_FONCIERES
WHERE TypeLocal='Maison' OR TypeLocal='Appartement';

We can now start joining the 2: we will calculate the price per m2 and associate the geospatial area to each property sale.


SELECT ID, TypeLocal,NATUREMUTATION, VALEURFONCIERE / SURFACEREELLEBATI as prixm2, c.GEO FROM VALEURS_FONCIERES_PROD vf
JOIN CADASTRE_V c ON c.section=vf.section and TO_NUMERIC(c.numero)=vf.NoPlan

We can already render something cool. Below I'm using kepler.gl project to render the data. We can visualise in 3D each individual property and the height and color represents the price.



Adding some geospatial functions

Now we would like to calculate the average price per m2 for specific districts. For this we will load one additional dataset :

Here are from open data the districts of the city of Rennes: https://data.rennesmetropole.fr/explore/dataset/perimetres-des-12-quartiers-de-la-ville-de-rennes/export/

Let's do the same again, upload the file, load the table and create a view to simply the queries:

from SnowSQL:


PUT file:///Users/USERNAME/Documents/dev/immo/districts/* @%QUARTIER;

And then the SQL loading and view creation:


copy into QUARTIER
file_format = (type = 'JSON'
STRIP_OUTER_ARRAY = TRUE );
create or replace view quartier_v as
SELECT RECORD:fields:nom as nom,
TRY_TO_GEOGRAPHY(RECORD:fields:geo_shape) as geo
FROM DISTRICT;

Your view will look like this:


Now we can join all 3 sources in one query in order to 
  • Calculate average price of property
  • Associate district name by matching properties area included district area
  • Aggregate average price at district level
  • Associate the geospatial area to the results 


SELECT quartierNom,avgPrice, qv.geo as quartierGeo FROM (
    SELECT quartierNom, avg(prixm2) as avgPrice from (
        SELECT ID, TypeLocal,NATUREMUTATION, VALEURFONCIERE / SURFACEREELLEBATI as prixm2,
            c.GEO as cadastreGeo,
            q.nom as quartierNom
        FROM VALEURS_FONCIERES_PROD vf
    JOIN CADASTRE_V c ON c.section=vf.section and TO_NUMERIC(c.numero)=vf.NoPlan
    JOIN QUARTIER_V q ON ST_WITHIN(c.geo,q.geo)
    GROUP BY quartierNom)
JOIN QUARTIER_V qv ON qv.nom = quartierNom


And now we can visualise the final results:



We are done !

This is it, we illustrated in this simple exemple how in Snowflake we can load as-is different sources, structured and semi-structured (json), store and query geospatial data and visualise the results with a nice geospatial visualisation client.




Popular posts from this blog

Domain centric architecture : Data driven business process powered by Snowflake Data Sharing

Snowflake Data sharing is a game changer : Be ready to connect the dots (with a click)

Process XML, JSON and other sources with XQuery at scale in the Snowflake Data Cloud