Snowpark Series: Working with Snowpark (Part -1)

I love snowpark sno far!!

Aug 01, 2022

Snowpark is more like a spark wrapper for snowflake. It provides an intuitive API for querying and processing data in a data pipeline. With this, you can build applications that process data in Snowflake and also date pipelines right on your jupyter notebook, and it’s relatively very fast and easy to use. There is no setup needed except for installing the snowpark library now that it is out of public preview. It has several features that are very similar and also a little bit different when compared to spark. Some of them are:

The Snowpark API provides programming language constructs for building SQL statements.
Snowpark operations are executed lazily on the server, which reduces the amount of data transferred between your client and the Snowflake database.
You can create your own user-defined functions (UDFs) in your code, and Snowpark can push your code to the server, where the code can operate on the data just like you do with PySpark.
Snowpark automatically pushes the custom code for UDFs to the Snowflake database. When you call the UDF in your client code, your custom code is executed on the server (where the data is). You don’t need to transfer the data to your client in order to execute the function on the data.

Setup:

You can directly install the package from pypi.org using
pip install snowflake-snowpark-python

Or you could download the wheel file from https://pypi.org/project/snowflake-snowpark-python/#files and directly install it using
pip install <your_path_to_wheel_file_here.whl>

Starting and working with SnowPark

# import snowpark

from snowflake.snowpark import Session

# start snowpark session

connection_parameters = {
      "account": account,
      "user":user,
      "password":pwd,
      "role": role,
      "warehouse": warehouse,
      "database": db,
      "schema": schema
    }

# start session

test_session = Session.builder.configs(connection_parameters).create()

# return 1 row from snowflake table using sql

test_session.sql("SELECT * FROM Table1 LIMIT 1").collect()

# Make a table dataframe from the table on snowflake, you can also have multiple sessions defined parallely, when you have multiple sessiosn you need to specify the session when declaring udfs

table1 = test_session.table("table1")

# select multiple columns from the snowpark dataframe

table1_df = table1.select("Col1","Col2")

# creating new column

df = df.withColumn("COL3","col1"))

# substring one column and put it in a new column

df = df.withColumn("sno",shp.sno.substr(1,4))

…See you in part 2

More to come soon this week, please subscribe and let me know if you have questions.

Better Data Engineering

Discussion about this post