Databricks

Databricks and OOP, do they match ?

September 2, 2025
Development, Architecture
Databricks, Spark, Oop, Software-Engineering

Context #

Databricks and Apache Spark are often used in data engineering, data science, and machine learning workflows. Their APIs are designed around distributed data processing (RDDs, DataFrames, Datasets). The question arises: does Object-Oriented Programming (OOP) fit into this paradigm, or do we need a different style?


Databricks Programming Model vs OOP #

  • Spark API: functional and declarative. You express transformations (map, filter, select) on immutable distributed datasets.
  • OOP style: encapsulates data + behaviour inside classes, often with mutable state.

Where They Match #

  • Encapsulation of business logic: Wrapping Spark transformations inside reusable classes (e.g., DataCleaner, FeatureEngineer) helps modularize pipelines.
  • Abstractions for teams: Teams can expose high-level methods (.transform(df)) instead of low-level Spark calls.
  • Testing & reusability: OOP structures allow dependency injection, mock data, and unit testing.

Where They Clash #

  • Statefulness: Spark’s lazy evaluation and immutable DataFrames do not align with mutable OOP state.
  • Serialization: Classes with methods that capture external state may not serialize well when Spark ships code to executors.
  • Functional preference: Many Spark best practices push towards functional patterns (pure functions, stateless transformations).

Note on statefulness: In Learning Spark, Holden Karau makes distinction between stateless and stateful processing and emphazizes it. Stateless transformations are preferred, but spark also provides patterns for stateful processing, particularly in streaming contexts. e.g., updateStateByKey, windowing, watermarking, and event-time state management.

...

Databricks Naming Conventions

September 2, 2025
Development, Data Platforms
Databricks, Best-Practices, Naming, Data-Engineering, Environment

Introduction #

Consistent naming across env (dev, test, prod), layers (bronze/silver/gold), and domains is critical in Databricks. It prevents confusion, enforces governance, and supports automation with Unity Catalog and Delta Lake.


General Best Practices #

  • Separate dev / test / prod workspaces.
  • Apply RBAC + Unity Catalog.
  • Use modular notebooks; reuse with %run.
  • Version control all code.
  • Prefer job clusters; auto-terminate.
  • Vacuum Delta tables; use optimize + z-order.
  • Allow schema evolution only when intentional.

Environment‑Aware Medallion Naming #

Unity Catalog is the governance backbone. Inconsistent names break access policies and automation. Use env prefixes, clear domains, and snake_case. cf. Unity Catalog docs .

...