Spark & Databricks Q&A

1. What is the difference between managed and external tables?

Managed Tables:

In managed tables, the data and the metadata are both managed by the database or data warehouse system.
When a managed table is dropped, both the table metadata and the underlying data are deleted.
These tables are typically stored in the default database storage location.

External Tables:

In external tables, only the metadata is managed by the database system, while the data remains in an external location (e.g., cloud storage like ADLS).
Dropping an external table only removes the metadata; the actual data remains intact in its original location.
Useful when data needs to be shared across different systems.

2. What is the use of the `collect` command?

The collect command is used to bring the data from all the worker nodes back to the driver node as a list.
When it’s used: It is commonly used when small datasets need to be brought into the driver for further local operations or analysis.
Driver vs Worker Node: The collect command transfers data from each worker node to the driver node.
Scenarios: It is used when:
- Summarizing data.
- Performing local aggregations.
- Collecting small datasets for visualization or reporting.

Out-of-Memory Exceptions: These occur when:

The driver tries to collect too much data into memory, exceeding its available heap space.
Workers run out of memory while performing operations (e.g., shuffling large datasets).

3. Explain your project in detail about landing, bronze, silver, and gold layers:

Landing Layer:
- Raw data is ingested from various source systems.
- No transformations are applied at this stage.
- Data is stored in its original format (e.g., CSV, JSON).
Bronze Layer:
- Contains raw data with basic ingestion transformations (e.g., deduplication, adding metadata like ingestion timestamp).
- Data is stored in delta tables for efficient reads and writes.
Silver Layer:
- Semi-processed data with cleansing, standardization, and applying business logic.
- Joins, filtering, and aggregations are performed.
- This layer is ready for further transformations or reporting.
Gold Layer:
- Fully processed data that is ready for consumption by analytics or reporting tools.
- Data is aggregated and denormalized for performance.

Data Movement:

Data is ingested into the landing layer, processed into bronze, transformed in silver, and finally stored in gold for analytics.

4. How to convert managed into external tables?

To convert a managed table to an external table:

Create an external location where you want to store the data.
Use the CREATE TABLE ... AS SELECT syntax: CREATE TABLE external_table USING DELTA LOCATION '<external-location>' AS SELECT * FROM managed_table;

5. How do you connect Key Vault with Databricks?

Use Azure Key Vault-backed secret scope: databricks secrets create-scope --scope <scope-name> --scope-backend-type AZURE_KEYVAULT --resource-id <key-vault-resource-id> --dns-name <key-vault-dns>
Access secrets using: secret_value = dbutils.secrets.get(scope="<scope-name>", key="<secret-name>")

6. What is the `VACUUM` command?

The VACUUM command removes old files from Delta tables that are no longer in use.
Syntax: VACUUM delta_table RETAIN 168 HOURS;
It helps reduce storage costs by cleaning up stale data files.

7. `OPTIMIZE` vs `ZORDER` command:

OPTIMIZE: Used to compact small files into larger ones, improving read performance.
ZORDER: Used in combination with OPTIMIZE to colocate related data on disk, improving query performance for specific columns. OPTIMIZE delta_table ZORDER BY (column_name);

8. Spark Internals:

Spark Driver:

The driver is the master node that maintains information about the Spark application and manages the execution of tasks.
It coordinates the distribution of data and tasks across the worker nodes.

Spark Context vs Spark Session:

Spark Context: Entry point to the low-level Spark functionality.
Spark Session: Unified entry point for both Spark SQL and Spark Core. It provides better integration with DataFrames and Datasets.

9. What are Delta Tables?

Delta tables are an extension of Parquet tables with ACID transactions, versioning, and schema enforcement.
Layer: Delta tables are typically used in the bronze, silver, and gold layers.
Benefits:
- ACID compliance.
- Time travel.
- Schema evolution.

10. Parquet Format:

Parquet is a columnar storage format.
It is fast because:
- It supports efficient compression and encoding.
- Only the required columns are read.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Breaking

Spark & Databricks Q&A

1. What is the difference between managed and external tables?

2. What is the use of the `collect` command?

3. Explain your project in detail about landing, bronze, silver, and gold layers:

4. How to convert managed into external tables?

5. How do you connect Key Vault with Databricks?

6. What is the `VACUUM` command?

7. `OPTIMIZE` vs `ZORDER` command:

8. Spark Internals:

9. What are Delta Tables?

10. Parquet Format:

By Apurva Suke