Knowledge Base

Databricks and OOP, do they match ?

September 2, 2025
Development, Architecture
Databricks, Spark, Oop, Software-Engineering

Context #

Databricks and Apache Spark are often used in data engineering, data science, and machine learning workflows. Their APIs are designed around distributed data processing (RDDs, DataFrames, Datasets). The question arises: does Object-Oriented Programming (OOP) fit into this paradigm, or do we need a different style?


Databricks Programming Model vs OOP #

  • Spark API: functional and declarative. You express transformations (map, filter, select) on immutable distributed datasets.
  • OOP style: encapsulates data + behaviour inside classes, often with mutable state.

Where They Match #

  • Encapsulation of business logic: Wrapping Spark transformations inside reusable classes (e.g., DataCleaner, FeatureEngineer) helps modularize pipelines.
  • Abstractions for teams: Teams can expose high-level methods (.transform(df)) instead of low-level Spark calls.
  • Testing & reusability: OOP structures allow dependency injection, mock data, and unit testing.

Where They Clash #

  • Statefulness: Spark’s lazy evaluation and immutable DataFrames do not align with mutable OOP state.
  • Serialization: Classes with methods that capture external state may not serialize well when Spark ships code to executors.
  • Functional preference: Many Spark best practices push towards functional patterns (pure functions, stateless transformations).

Note on statefulness: In Learning Spark, Holden Karau makes distinction between stateless and stateful processing and emphazizes it. Stateless transformations are preferred, but spark also provides patterns for stateful processing, particularly in streaming contexts. e.g., updateStateByKey, windowing, watermarking, and event-time state management.

...

Databricks Naming Conventions

September 2, 2025
Development, Data Platforms
Databricks, Best-Practices, Naming, Data-Engineering, Environment

Introduction #

Consistent naming across env (dev, test, prod), layers (bronze/silver/gold), and domains is critical in Databricks. It prevents confusion, enforces governance, and supports automation with Unity Catalog and Delta Lake.


General Best Practices #

  • Separate dev / test / prod workspaces.
  • Apply RBAC + Unity Catalog.
  • Use modular notebooks; reuse with %run.
  • Version control all code.
  • Prefer job clusters; auto-terminate.
  • Vacuum Delta tables; use optimize + z-order.
  • Allow schema evolution only when intentional.

Environment‑Aware Medallion Naming #

Unity Catalog is the governance backbone. Inconsistent names break access policies and automation. Use env prefixes, clear domains, and snake_case. cf. Unity Catalog docs .

...

Multi-module build based on sbt

March 15, 2024
Development, Documentation
Scala, Templates, Development

import sbt.{Compile, Test, *}
import Keys.{baseDirectory, libraryDependencies, *}

// sbt.version = 1.6.2
ThisBuild / trackInternalDependencies := TrackLevel.TrackIfMissing

lazy val welcome = taskKey[Unit]("welcome")

val sparkVersion = "2.4.0-cdh6.2.1"
val hiveVersion = "2.1.1-cdh6.2.1"

lazy val commonSettings = Seq(
  //organization := "com.nnz",
  version := "0.1.0-SNAPSHOT",
  welcome := { println("Welcome !")},
  scalaVersion := "2.11.12",
  javacOptions ++= Seq("-source", "15.0.10", "-target", "15.0.10"),
  libraryDependencies ++= sparkDependencies,
  resolvers ++= Seq("Cloudera Versions" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
  )
)

lazy val root = (project in file("."))
  .settings(
    name := "multimodule-project",
    commonSettings,
    update / aggregate :=  true,
  )
  .aggregate(warehouse, ingestion, processing)

lazy val warehouse = (project in file("warehouse"))
  .settings(
    name := "warehouse",
    commonSettings,
    Compile / scalaSource := baseDirectory.value /"." / "src" / "main" / "scala",
    Test / scalaSource := baseDirectory.value  /"." / "src" / "test" / "scala",
  )

lazy val ingestion = (project in file("ingestion"))
  .dependsOn(warehouse)
  .settings(
    name := "ingestion",
    commonSettings,
    Compile / scalaSource := baseDirectory.value /"." / "src" / "main" / "scala",
    Test / scalaSource := baseDirectory.value  /"." / "src" / "test" / "scala",
  )

lazy val processing = (project in file("processing"))
  .dependsOn(warehouse, ingestion)
  .settings(
    name := "processing",
    commonSettings,
    Compile / scalaSource := baseDirectory.value /"." / "src" / "main" / "scala",
    Test / scalaSource := baseDirectory.value  /"." / "src" / "test" / "scala",
  )

/**
 * Spark Dependencies
 */
val sparkCore = "org.apache.spark" %% "spark-core" % sparkVersion
val sparkSQL = "org.apache.spark" %% "spark-sql" % sparkVersion
val sparkHive = "org.apache.spark" %% "spark-hive" %  sparkVersion

lazy val sparkDependencies = Seq(sparkCore, sparkSQL, sparkHive)

https://gist.github.com/Non-NeutralZero/d5be154ee38962176bcc0bf49182c691

...

Jupyter

November 20, 2023
Utils
Jupyter, Python

Memory Usage #

def memory():
    with open('/proc/meminfo', 'r') as mem:
        ret = {}
        tmp = 0
        for i in mem:
            sline = i.split()
            if str(sline[0]) == 'MemTotal:':
                ret['total'] = int(sline[1])
            elif str(sline[0]) in ('MemFree:', 'Buffers:', 'Cached:'):
                tmp += int(sline[1])
        ret['free'] = tmp
        ret['used'] = int(ret['total']) - int(ret['free'])
    return ret

No Hang Up #

nohup jupyter notebook --no-browser > notebook.log 2>&1 &

Workaround: no cells output #

se = time.time()
print(train.rdd.getNumPartitions())
print(test.rdd.getNumPartitions())
e = time.time()
print("Training time = {}".format(e - se))

your_float_variable = (e - se)
comment = "Training time for getnumpartition:"

# Open the file in append mode and write the comment and variable
with open('output.txt', 'a') as f:
    f.write(f"{comment} {your_float_variable}\n")

VS Code Configuration & Set-up

November 17, 2023
Utils, Tutorials
Git, Ssh

Configuration #

Remote SSH #

Host machine
    Hostname machine.com
    User user_name
    IdentityFile path/to/ssh/key

Remote SSH - SSH Tunnel #

Host tunnel_machine
    Hostname machine.com
    User user_name
    IdentityFile path/to/ssh/key

Host machine_after_tunnel
    Hostname machine_after_tunnel.com
    User user_name
    IdentityFile path/to/ssh/key
    ForwardAgent yes
    ProxyJump tunnel_machine

PC Configuration #

Authorize your windows local machine to connect to remote machine.

$USER_AT_HOST="your-user-name-on-host@hostname"
$PUBKEYPATH="$HOME\.ssh\id_ed25519.pub"

$pubKey=(Get-Content "$PUBKEYPATH" | Out-String); ssh "$USER_AT_HOST" "mkdir -p ~/.ssh && chmod 700 ~/.ssh && echo '${pubKey}' >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

Verify that the authorized_keys file in the .ssh folder for your remote user on the SSH host is owned by you and no other user has permission to access it.

...

Building a website using Hugo and Hosting it on GitHub Pages

October 26, 2023
Development, Tutorials
Markdown, Development

Installations #

Configuration #

  • To create a new Hugo website, run:
hugo new site mynewsite
  • then cd to the directory
cd mynewsite
  • Initialize the site as a git repository
git init
  • Choose the hugo theme that suits you.
    Hugo offer a selection of themes developed by the community. This site for example was built using Hugo-Book.
  • Add the theme as a submodule
# For example:
git submodule add https://github.com/alex-shpak/hugo-book themes/hugo-book
  • Add the theme to your site configuration file
# Could be config.toml OR config.yaml OR hugo.toml OR hugo.yaml
echo "theme = 'hugo-book'" >> config.toml
  • You will be able to see a first version of your website locally by running:
hugo server --minify 
  • Edit your configuration file
baseURL = 'http://example.org/'
languageCode = 'en-us'
title = 'My New Hugo Site'
Theme ConfigurationGuidelines
Themes’ publishers offer guidelines to configure your webiste in accordance to the theme. Check your theme publisher page on hugo themes or their theme github repo for guidance and help.

Hosting on Github Pages #

  • On your project settings, go to Pages. You’ll be able to see your site’s link.
  • Choose a Build and deployment source (Github actions OR deploy from branch).
  • You can also choose to publish it on a custom domain.
  • Edit your configuration file
baseURL = 'https://username.github.io/repository'
languageCode = 'en-us'
title = 'My New Hugo Site'
theme = 'hugo-book'

Other Great Tools For Building Static Websites #

Run plotly in JupyterLab

October 24, 2023
Utils, Tutorials
Jupyter, Python-Librairies

    1  pip uninstall plotly
    2  jupyter labextension uninstall @jupyterlab/plotly-extension
    3  jupyter labextension uninstall jupyterlab-plotly 
    4  jupyter labextension uninstall plotlywidget
    5  jupyter labextension update --all
    6  pip install plotly==5.17.0
    7  pip install "jupyterlab>=3" "ipywidgets>=7.6"
    8  pip install jupyter-dash
    9  jupyter labextension list

Install python packages offline

June 20, 2023
Utils, Tutorials
Pip, Python

1- Download packages locally using a requirements file or download a single package

pip download -r requirements.txt
## Example - single package
python -m pip download \
--only-binary=:all: \
--platform manylinux1_x86_64 --platform linux_x86_64 --platform any \
--python-version 39 \
--implementation cp \
--abi cp39m --abi cp39 --abi abi3 --abi none \
scipy

2- Copy them to the a temporary folder in your remote machine 3- On your machine, Activate conda and then install them using pip - specify installation options

...