Thursday, February 15, 2024

Databricks - How to create function UDF

A user-defined function (UDF) is a means for a user to extend the native capabilities of Apache Spark™ SQL. SQL on Databricks has supported external user-defined functions written in Scala, Java, Python and R programming languages since 1.3.0. While external UDFs are very powerful, they also come with a few caveats:

  • Security. A UDF written in an external language can execute dangerous or even malicious code. This requires tight control over who can create UDF.
  • Performance. UDFs are black boxes to the Catalyst Optimizer. Given Catalyst is not aware of the inner workings of a UDF, it cannot do any work to improve the performance of the UDF within the context of a SQL query.
  • SQL Usability. For a SQL user it can be cumbersome to write UDFs in a host language and then register them in Spark. Also, there is a set of extensions many users may want to make to SQL which are rather simple where developing an external UDF is overkill.

 https://www.databricks.com/blog/2021/10/20/introducing-sql-user-defined-functions.html

Thursday, January 18, 2024

Openai with databricks sql for queries in natural language

Modern data platforms store and collect an incredible amount of both useful data and metadata. However, even knowing the metadata itself might be not useful for the end-users who don’t have enough experience with classical components of a relation-based data model. One of the challenges is not only the ability to write proper SQL statements to select the relevant information but also understanding of what needs to be joined (and how exactly this shall be done) even to get the simplest insights (e.g. top-5 customers from a given region by the number of orders).

 https://polarpersonal.medium.com/using-openai-with-databricks-sql-for-queries-in-natural-language-cf6521e88148