SECRET OF CSS

Build a FastAPI on the Databricks Lakehouse | Steve Russo


Create a FastAPI on the Databricks Lakehouse with full CRUD capabilities

1*8IVXzNY98yJu47J8kGEelg
FastAPI integrated with the Databricks Lakehouse

The team at Databricks has been pushing the new data management architecture called ‘Lakehouse.’ At its core, a Data Lakehouse enables efficient and secure access to data stored in a Data Lake. This architecture promises to blend all the great qualities of a data lake (object storage, unstructured data, etc.) and those of a traditional data warehouse (ACID transactions, performance, etc.) Hence, Data lake + house.

Developers in the Lakehouse can leverage various standard tools for data analytics, engineering, and ML workloads. So, can an application developer also leverage Lakehouse to build a scalable data application? Short answer: Yes!

So in this article, I’ll show you how you can create a simple FastAPI server with a backend database on a Databricks Lakehouse that supports full create, read, update and delete (CRUD) operations with SQLAlchemy.

All the code in this article is available in the repo here.

Background

For the example in this article, I created a table named users in a Databricks workspace with 1,000,000 records of fake user data. I leveraged the Databricks Labs library dbldatagen to generate all the data. The users table looks like this:

Databricks SQL statement to select * from the fastapi.users table
The output of the users table

The final API will be able to do the following actions on the users table:

  • Get a user by a specific ID (GET)
  • Create a new user (POST)
  • Update an existing user (PUT)
  • Delete a user (DELETE)

Setup

For the API to work correctly, you need to be able to access these items:

  1. Access to a Databricks Workspace
  2. Access to use or create a cluster
  3. Access to generate a Personal Access Token (PAT) for Databricks

Follow this Databricks guide to collect the below cluster details:

  • Server Hostname
  • HTTP Path
  • PAT Token

Add those values and a target database name into a .env file, or create them as environment variables. Once you complete all of the above, you can start to make your connection to Databricks Lakehouse from the API.

Apart from installing the standard FastAPI libraries, you’ll also need to install the Python library sqlalchemy-databricks. This library is a thin wrapper on the databricks-sql-connector and the PyHive SQLAlchemy libraries.

The code below establishes the connection to a Databricks cluster with the ability to use SQLAchlemy commands on defined data models.

With the connection established, you can start to define all the HTTP methods you want to perform using your API.

API Routes

The code below is the whole setup for all the routes in the main.py file for the API. Here’s the code:

With the API code in place, run the following command to start the FastAPI server.

uvicorn app.main:app --reload

You can send a request to the root endpoint to ensure the server runs locally.

 curl -X GET 'http://localhost:8000/'

Which should return a JSON message with {“message":”Hello Lakehouse"}. The /user defined routes enable full CRUD capabilities on the data in the Lakehouse.

A GET request will return a JSON message with details for the user with id=1.

--- Requestcurl -X GET 'http://localhost:8000/user?id=1'--- Response : 200[
{
"last_name": "Keller",
"email": "sbartlett@sanchez-zimmerman.com",
"first_name": "Sierra",
"address": "29032 Montes Village\nLake Josechester, NJ 42870",
"id": 1,
"ip_address": "172.26.121.155"
}
]

Create a user by sending a POST request. The response with be the id of the created user and the status code 200.

--- Requestcurl -X POST 'http://localhost:8000/user' \
--header 'Content-Type: application/json' \
--data-raw '{
"first_name": "Lake",
"last_name": "House"
}'
--- Response : 200{
"user_id": 3363124
}

Update the newly created user by sending a PUT request to /user/3363124.

--- Requestcurl -X PUT 'http://localhost:8000/user/3363124' \
--header 'Content-Type: application/json' \
--data-raw '{
"first_name": "Databricks",
"last_name": "LakeHouse"
}'
--- Response : 200{
"user_id": 3363124
}

Lastly, delete the user by sending the DELETE request to /user/3363124.

--- Requestcurl -X DELETE 'http://localhost:8000/user/3363124'--- Response : 200{
"user_id": 3363124,
"msg": "record deleted"
}

And voila! You have now read data from an existing user, created a new user, updated that same user, and deleted that user from the Lakehouse.

While this method is possible, I do want to throw out a few things to keep in mind if you choose to expand on this example.

  1. If your target cluster is off when the API makes a request, it can take 3–10 minutes for the cluster to spin up before you get a response
  2. Be aware of the data size and the cost to query the data
  3. Be mindful of cluster costs

Additional methods of accessing data can be done with the already mentioned databricks-sql-connector or the new delta sharing capability.

With this setup, you can quickly create a full CRUD API that leverages data from the Lakehouse architecture on Databricks. This baseline example can open up more API possibilities with integrations into ML models deployed in Databricks, real-time datasets, and big data applications.

Again, all the code in this article is available in the repo here.



News Credit

%d bloggers like this: