Databricks SQL

Knowi facilitates data discovery, query, aggregation, visualization, and reporting automation from Databricks along with other unstructured and structured datasources.

Overview

  1. Connect, extract and transform data from your Databricks SQL database, using one of the following options:

    a. Through our UI to connect directly.

    b. Using our Cloud9Agent. This can securely pull data inside your network. See agent configuration for more details.

  2. Visualize and Automate your Reporting instantly.

UI Based Approach

Connecting

Step 1: Log in to Knowi and select Queries from the left sidebar.

Step 2: Click on New Datasource + button and a new page with a list of datasources will open.

Step 3: Select databricks in Data Warehouses.

A new datasource page will open.

data-strategy

Step 4:Configure the following details to set up connectivity to your Databricks SQL database:

a. Datasource Name: Enter a name for your datasource
b. Host: Enter the DatabricksSQL host to connect to. For example: https://dbc-123ab4c5-d67f.cloud.databricks.com
c. Warehouse ID: Enter the Databricks SQL warehouse ID. For example: 9c7235e6e6e49147
d. Auth Token: Enter the personal access token
e. Schema Name: Enter the schema name
f. Catalog: Enter the name of the catalog
g. Connection String: Optional. Additional connection properties/url parameters. For example (in seconds), readTimeout=1800&connectTimeout=30.

Step 5: Click the Test Connection button and a connection successful pop-up message will appear.

adding-elasticsearch

For more information, please refer to the documentation on Connectivity & Datasources.

Step 6: Click on Save and start Querying.

adding-elasticsearch

Query

Set up Query using a visual builder or query editor

Visual Builder

Step 1: After connecting to the Databricks SQL datasource, Knowi will pull out a list of tables along with field samples. Using these tables, you can automatically generate queries through our visual builder in a no-code environment by either dragging and dropping fields or making your selections through the drop-down.

Tip: You can also write queries directly in the Query Editor, a versatile text editor that offers more advanced editing functionalities like SQLServer Query, support for multiple language modes, Cloud9QL, and more.

Step 2: Define data execution strategy by using any of the following two options:

  • Direct Execution: Directly execute the Query on the original Datasource, without any storage in between. In this case, when a widget is displayed, it will fetch the data in real time from the underlying Datasource.

  • Non-Direct Execution: For non-direct queries, results will be stored in Knowi's data Store. Benefits include- long-running queries, reduced load on your database, and more. Non-direct execution can be put into action if you choose to run the Query once or at scheduled intervals.

For more information, check out the documentation Defining Data Execution Strategy

Step 3: Click on the Preview button to analyze the results of your Query and fine-tune the desired output, if required. The result of your Query is called Dataset.

Step 4: After reviewing the results, name your dataset and hit the Save & Run button.

Query Editor

A versatile text editor designed for editing code that comes with a number of language modes including Databricks SQL and add-ons like Cloud9QL, and AI Assistant which empowers you with powerful transformations and analysis capabilities like prediction modeling and cohort analysis if you need it.

Use External Links to fetch data: Select this option to use External Links disposition when fetching data from Databricks SQL. This allows fetching larger data but uses cloud storage.

AI Assistant

AI assistant query generator automatically generates queries from plain English statements for searching the connected databases and retrieving information. The goal is to simplify and speed up the search process by automatically generating relevant and specific queries, reducing the need for manual input, and improving the probability of finding relevant information.

Step 1: Select Generate Query from AI Assistant dropdown and enter the details of the query you'd like to generate in plain English. Details can include table or collection names, fields, filters, etc.
Example: "Show the shipdate status"

Note: The AI Assistant uses OpenAI to generate a query and only the question is sent to OpenAI APIs and not the data.

Step 2: Define data execution strategy by using any of the following two options:

  • Direct Execution: Directly execute the Query on the original Datasource, without any storage in between. In this case, when a widget is displayed, it will fetch the data in real time from the underlying Datasource.

  • Non-Direct Execution: For non-direct queries, results will be stored in Knowi's data Store. Benefits include- long-running queries, reduced load on your database, and more. Non-direct execution can be put into action if you choose to run the Query once or at scheduled intervals.

For more information, check out the documentation Defining Data Execution Strategy

Step 3: Click on the Preview button to analyze the results of your Query and fine-tune the desired output, if required.

Note 1: The OpenAI must be enabled by the admin before using the AI Query Generator. 

{Account Settings > Customer Settings > OpenAI Integration}

Note 2: The user can copy the API key from the personal OpenAI account and use the same or use the default key provided by Knowi.

Furthermore, AI Assistant offers you additional features that can be performed on top of the generated query as listed below:

  • Explain Query
  • Find Issues
  • Syntax Help
Explain Query

Provides explanations for your existing query. For example, an explanation requested for the query generated below AI Assistant has returned the description-
This query is selecting the lorderkey, lextendedprice, and l_shipdate columns from the lineitem table in the samples.tpch database, and limiting the results to the first 10,000 rows.

Find Issues

Helps in debugging and troubleshooting the query. For example, finding issues in the query generated below returns this error- The query should use the SELECT keyword instead of SELCT. The correct query is: SELECT l_orderkey, l_extendedprice, l_shipdate FROM samples.tpch.lineitem LIMIT 10000.

Syntax Help

Ask questions around query syntax for this datasource. For example, suggesting the syntax for the requested query returned the response- SELECT * FROM Prices

Cloud9Agent Configuration

As an alternative to the UI-based connectivity above, you can use Cloud9Agent inside your network to pull from Databricks SQL securely. See Cloud9Agent to download your agent along with instructions to run it.

Highlights:

  • Pull data using SQL.
  • Execute queries on a schedule, or, one time.

The agent contains a datasourceexampledatabricks.json and queryexampledatabricks.json under the examples folder of the agent installation to get you started.

Edit those to point to your database and modify the queries to pull your data.

Move it into the config directory (datasource_XXX.json files first if the Agent is running).

Datasource Configuration:

Parameter Comments
name Unique Datasource Name.
datasource Set value to databricks
url URL with host, port and database name to connect to. Example for Databricks: localhost:5432/cloud9demo
userId User id to connect, where applicable
Password SPassword, where applicable

Query Configuration:

Query Config Params Comments
entityName Dataset Name Identifier
identifier A unique identifier for the dataset. Either identifier or entityName must be specified.
dsName Name of the datasource name configured in the datasource_XXX.json file to execute the query against. Required.
queryStr/td> Databricks SQL query to execute. Required
frequencyType One of minutes, hours, days,weeks,months. If this is not specified, this is treated as a one time query, executed upon Cloud9Agent startup (or when the query is first saved)
frequency Indicates the frequency, if frequencyType is defined. For example, if this value is 10 and the frequencyType is minutes, the query will be executed every 10 minutes
startTime Optional, can be used to specify when the query should be run for the first time. If set, the the frequency will be determined from that time onwards. For example, is a weekly run is scheduled to start at 07/01/2014 13:30, the first run will run on 07/01 at 13:30, with the next run at the same time on 07/08/2014. The time is based on the local time of the machine running the Agent. Supported Date Formats: MM/dd/yyyy HH:mm, MM/dd/yy HH:mm, MM/dd/yyyy, MM/dd/yy, HH:mm:ss,HH:mm,mm
c9QLFilter Optional post-processing of the results using Cloud9QL. Typically uncommon against SQL-based datastores.
overrideVals This enables data storage strategies to be specified. If this is not defined, the results of the query is added to the existing dataset. To replace all data for this dataset within Knowi, specify {"replaceAll":true}. To upsert data specify "replaceValuesForKey":["fieldA","fieldB"]. This will replace all existing records in Knowi with the same fieldA and fieldB with the the current data and insert records where they are not present.

Datasource Example:

[
  {
     "name":"demoDatabricks",
     "host":"localhost:5432",
     "datasource":"databrickssql",
     "dbName":"cloud9demo",
     "authToken": "sampleToken",
     "schema": "demo",
     "bucket": "catalog name"
  }
]

Query Example:

[
  {
     "entityName":"Errors",
     "dsName":"demoDatabricks",
     "queryStr":"select error_condition as 'Error', count 'Count' from errors",
     "frequencyType":"minute",
     "frequency":10,
     "overrideVals":{
     "replaceAll":true
    }
  },
  {
     "entityName":"Queues",
     "dsName":"demoDatabricks",
     "queryStr":"select Name, size as 'Queue Size', Type from queue",
     "overrideVals":{
     "replaceValuesForKey":["Type"]
    },
     "startTime":"07:20",
     "frequencyType":"daily",
     "frequency":1
  }
]