Thursday, February 15, 2024

Databricks - How to create function UDF

A user-defined function (UDF) is a means for a user to extend the native capabilities of Apache Spark™ SQL. SQL on Databricks has supported external user-defined functions written in Scala, Java, Python and R programming languages since 1.3.0. While external UDFs are very powerful, they also come with a few caveats:

  • Security. A UDF written in an external language can execute dangerous or even malicious code. This requires tight control over who can create UDF.
  • Performance. UDFs are black boxes to the Catalyst Optimizer. Given Catalyst is not aware of the inner workings of a UDF, it cannot do any work to improve the performance of the UDF within the context of a SQL query.
  • SQL Usability. For a SQL user it can be cumbersome to write UDFs in a host language and then register them in Spark. Also, there is a set of extensions many users may want to make to SQL which are rather simple where developing an external UDF is overkill.

 https://www.databricks.com/blog/2021/10/20/introducing-sql-user-defined-functions.html