Snowpark is more like a spark wrapper for snowflake. It provides an intuitive API for querying and processing data in a data pipeline. With this, you can build applications that process data in Snowflake and also date pipelines right on your jupyter notebook, and it’s relatively very fast and easy to use. There is no setup needed except for installing the snowpark library now that it is out of public preview. It has several features that are very similar and also a little bit different when compared to spark. Some of them are:
The Snowpark API provides programming language constructs for building SQL statements.
Snowpark operations are executed lazily on the server, which reduces the amount of data transferred between your client and the Snowflake database.
You can create your own user-defined functions (UDFs) in your code, and Snowpark can push your code to the server, where the code can operate on the data just like you do with PySpark.
Snowpark automatically pushes the custom code for UDFs to the Snowflake database. When you call the UDF in your client code, your custom code is executed on the server (where the data is). You don’t need to transfer the data to your client in order to execute the function on the data.
Setup:
You can directly install the package from pypi.org using pip install snowflake-snowpark-python
Or you could download the wheel file from https://pypi.org/project/snowflake-snowpark-python/#files and directly install it usingpip install <your_path_to_wheel_file_here.whl>
Starting and working with SnowPark
# import snowpark
from snowflake.snowpark import Session
# start snowpark session
connection_parameters = {
"account": account,
"user":user,
"password":pwd,
"role": role,
"warehouse": warehouse,
"database": db,
"schema": schema
}
# start session
test_session = Session.builder.configs(connection_parameters).create()
# return 1 row from snowflake table using sql
test_session.sql("SELECT * FROM Table1 LIMIT 1").collect()
# Make a table dataframe from the table on snowflake, you can also have multiple sessions defined parallely, when you have multiple sessiosn you need to specify the session when declaring udfs
table1 = test_session.table("table1")
# select multiple columns from the snowpark dataframe
table1_df = table1.select("Col1","Col2")
# creating new column
df = df.withColumn("COL3","col1"))
# substring one column and put it in a new column
df = df.withColumn("sno",shp.sno.substr(1,4))
…See you in part 2
More to come soon this week, please subscribe and let me know if you have questions.