Knowledge Base

Knowledge Base

Multi-module build based on sbt

March 15, 2024

import sbt.{Compile, Test, *}
import Keys.{baseDirectory, libraryDependencies, *}

// sbt.version = 1.6.2
ThisBuild / trackInternalDependencies := TrackLevel.TrackIfMissing

lazy val welcome = taskKey[Unit]("welcome")

val sparkVersion = "2.4.0-cdh6.2.1"
val hiveVersion = "2.1.1-cdh6.2.1"

lazy val commonSettings = Seq(
  //organization := "com.nnz",
  version := "0.1.0-SNAPSHOT",
  welcome := { println("Welcome !")},
  scalaVersion := "2.11.12",
  javacOptions ++= Seq("-source", "15.0.10", "-target", "15.0.10"),
  libraryDependencies ++= sparkDependencies,
  resolvers ++= Seq("Cloudera Versions" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
  )
)

lazy val root = (project in file("."))
  .settings(
    name := "multimodule-project",
    commonSettings,
    update / aggregate :=  true,
  )
  .aggregate(warehouse, ingestion, processing)

lazy val warehouse = (project in file("warehouse"))
  .settings(
    name := "warehouse",
    commonSettings,
    Compile / scalaSource := baseDirectory.value /"." / "src" / "main" / "scala",
    Test / scalaSource := baseDirectory.value  /"." / "src" / "test" / "scala",
  )

lazy val ingestion = (project in file("ingestion"))
  .dependsOn(warehouse)
  .settings(
    name := "ingestion",
    commonSettings,
    Compile / scalaSource := baseDirectory.value /"." / "src" / "main" / "scala",
    Test / scalaSource := baseDirectory.value  /"." / "src" / "test" / "scala",
  )

lazy val processing = (project in file("processing"))
  .dependsOn(warehouse, ingestion)
  .settings(
    name := "processing",
    commonSettings,
    Compile / scalaSource := baseDirectory.value /"." / "src" / "main" / "scala",
    Test / scalaSource := baseDirectory.value  /"." / "src" / "test" / "scala",
  )

/**
 * Spark Dependencies
 */
val sparkCore = "org.apache.spark" %% "spark-core" % sparkVersion
val sparkSQL = "org.apache.spark" %% "spark-sql" % sparkVersion
val sparkHive = "org.apache.spark" %% "spark-hive" %  sparkVersion

lazy val sparkDependencies = Seq(sparkCore, sparkSQL, sparkHive)

https://gist.github.com/Non-NeutralZero/d5be154ee38962176bcc0bf49182c691

...

LaTex

November 29, 2023

Utils

Latex, Bibtex

BibTex #

Limit number of authors in IEEEtran #

In the .bib file configure your IEEEtran as follows:

@IEEEtranBSTCTL{IEEEexample:BSTcontrol,
CTLuse_forced_etal       = "yes",
CTLmax_names_forced_etal = "3",
CTLnames_show_etal       = "2" }

Cheat-sheets #

Jupyter

November 20, 2023

Utils

Jupyter, Python

Memory Usage #

def memory():
    with open('/proc/meminfo', 'r') as mem:
        ret = {}
        tmp = 0
        for i in mem:
            sline = i.split()
            if str(sline[0]) == 'MemTotal:':
                ret['total'] = int(sline[1])
            elif str(sline[0]) in ('MemFree:', 'Buffers:', 'Cached:'):
                tmp += int(sline[1])
        ret['free'] = tmp
        ret['used'] = int(ret['total']) - int(ret['free'])
    return ret

No Hang Up #

nohup jupyter notebook --no-browser > notebook.log 2>&1 &

Workaround: no cells output #

se = time.time()
print(train.rdd.getNumPartitions())
print(test.rdd.getNumPartitions())
e = time.time()
print("Training time = {}".format(e - se))

your_float_variable = (e - se)
comment = "Training time for getnumpartition:"

# Open the file in append mode and write the comment and variable
with open('output.txt', 'a') as f:
    f.write(f"{comment} {your_float_variable}\n")

Run plotly in JupyterLab

VS Code Configuration & Set-up

November 17, 2023

Utils, Tutorials

Git, Ssh

Configuration #

Remote SSH #

Host machine
    Hostname machine.com
    User user_name
    IdentityFile path/to/ssh/key

Remote SSH - SSH Tunnel #

Host tunnel_machine
    Hostname machine.com
    User user_name
    IdentityFile path/to/ssh/key

Host machine_after_tunnel
    Hostname machine_after_tunnel.com
    User user_name
    IdentityFile path/to/ssh/key
    ForwardAgent yes
    ProxyJump tunnel_machine

PC Configuration #

Authorize your windows local machine to connect to remote machine.

$USER_AT_HOST="your-user-name-on-host@hostname"
$PUBKEYPATH="$HOME\.ssh\id_ed25519.pub"

$pubKey=(Get-Content "$PUBKEYPATH" | Out-String); ssh "$USER_AT_HOST" "mkdir -p ~/.ssh && chmod 700 ~/.ssh && echo '${pubKey}' >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

Verify that the authorized_keys file in the .ssh folder for your remote user on the SSH host is owned by you and no other user has permission to access it.

...

Building a website using Hugo and Hosting it on GitHub Pages

October 26, 2023

Development, Tutorials

Markdown, Development

Installations #

Install Git - Link
Install Hugo - Link

Configuration #

To create a new Hugo website, run:

hugo new site mynewsite

then cd to the directory

cd mynewsite

Initialize the site as a git repository

git init

Choose the hugo theme that suits you.
Hugo offer a selection of themes developed by the community. This site for example was built using Hugo-Book.
Add the theme as a submodule

# For example:
git submodule add https://github.com/alex-shpak/hugo-book themes/hugo-book

Add the theme to your site configuration file

# Could be config.toml OR config.yaml OR hugo.toml OR hugo.yaml
echo "theme = 'hugo-book'" >> config.toml

You will be able to see a first version of your website locally by running:

hugo server --minify

Edit your configuration file

baseURL = 'http://example.org/'
languageCode = 'en-us'
title = 'My New Hugo Site'

Theme ConfigurationGuidelines
Themes’ publishers offer guidelines to configure your webiste in accordance to the theme. Check your theme publisher page on hugo themes or their theme github repo for guidance and help.

Hosting on Github Pages #

On your project settings, go to Pages. You’ll be able to see your site’s link.
Choose a Build and deployment source (Github actions OR deploy from branch).
You can also choose to publish it on a custom domain.
Edit your configuration file

baseURL = 'https://username.github.io/repository'
languageCode = 'en-us'
title = 'My New Hugo Site'
theme = 'hugo-book'

Other Great Tools For Building Static Websites #

Sphinx https://www.sphinx-doc.org/en/master/index.html
VuePress https://vuepress.vuejs.org/
Read the docs https://about.readthedocs.com/features/

Run plotly in JupyterLab

October 24, 2023

Utils, Tutorials

Jupyter, Python-Librairies

    1  pip uninstall plotly
    2  jupyter labextension uninstall @jupyterlab/plotly-extension
    3  jupyter labextension uninstall jupyterlab-plotly 
    4  jupyter labextension uninstall plotlywidget
    5  jupyter labextension update --all
    6  pip install plotly==5.17.0
    7  pip install "jupyterlab>=3" "ipywidgets>=7.6"
    8  pip install jupyter-dash
    9  jupyter labextension list

Useful Links #

What is Right extension for Plotly in JupyterLab? https://stackoverflow.com/questions/62604893/what-is-right-extension-for-plotly-in-jupyterlab
https://jupyter-docker-stacks.readthedocs.io/en/latest/
https://github.com/jupyter/docker-stacks
https://github.com/plotly/plotly.py

Prompt Engineering

September 12, 2023

Development, Documentation

Llm, Langchain

RAG Architecture #

Install python packages offline

June 20, 2023

Utils, Tutorials

Pip, Python

1- Download packages locally using a requirements file or download a single package

pip download -r requirements.txt

## Example - single package
python -m pip download \
--only-binary=:all: \
--platform manylinux1_x86_64 --platform linux_x86_64 --platform any \
--python-version 39 \
--implementation cp \
--abi cp39m --abi cp39 --abi abi3 --abi none \
scipy

2- Copy them to the a temporary folder in your remote machine 3- On your machine, Activate conda and then install them using pip - specify installation options

...

Running PySpark & Jupyter With Docker

June 8, 2023

Development, Tutorials

Spark, Docker, Jupyter

Thanks to the Jupyter community, it’s now much easier to run PySpark on Jupyter using Docker. There are two ways you can do this : 1. the “direct” way and 2. the customized way.

The “direct” way #

verify your local settings are aligned with the pre-requisites to run this container, grosso modo make sure docker is installed, of course !
You have to have about 4 GB of free space
...

Git commands I often use

January 20, 2020

Utils

Git, Shell, Development

Add #

# only add files with .scala extension
git ls-files [path] | grep '\.scala$' | xargs git add
git stash --keep-index

March 15, 2024

November 29, 2023

BibTex #

Limit number of authors in IEEEtran #

Cheat-sheets #

November 20, 2023

Memory Usage #

No Hang Up #

Workaround: no cells output #

Related Entries #

November 17, 2023

Configuration #

Remote SSH #

Remote SSH - SSH Tunnel #

PC Configuration #

October 26, 2023

Installations #

Configuration #

Hosting on Github Pages #

Other Great Tools For Building Static Websites #

October 24, 2023

Useful Links #

September 12, 2023

RAG Architecture #

June 20, 2023

June 8, 2023

The “direct” way #

January 20, 2020

Add #