
1. What is the difference between managed and external tables?
Managed Tables:
- In managed tables, the data and the metadata are both managed by the database or data warehouse system.
- When a managed table is dropped, both the table metadata and the underlying data are deleted.
- These tables are typically stored in the default database storage location.
External Tables:
- In external tables, only the metadata is managed by the database system, while the data remains in an external location (e.g., cloud storage like ADLS).
- Dropping an external table only removes the metadata; the actual data remains intact in its original location.
- Useful when data needs to be shared across different systems.
2. What is the use of the collect
command?
- The
collect
command is used to bring the data from all the worker nodes back to the driver node as a list. - When itโs used: It is commonly used when small datasets need to be brought into the driver for further local operations or analysis.
- Driver vs Worker Node: The
collect
command transfers data from each worker node to the driver node. - Scenarios: It is used when:
- Summarizing data.
- Performing local aggregations.
- Collecting small datasets for visualization or reporting.
Out-of-Memory Exceptions: These occur when:
- The driver tries to collect too much data into memory, exceeding its available heap space.
- Workers run out of memory while performing operations (e.g., shuffling large datasets).
3. Explain your project in detail about landing, bronze, silver, and gold layers:
- Landing Layer:
- Raw data is ingested from various source systems.
- No transformations are applied at this stage.
- Data is stored in its original format (e.g., CSV, JSON).
- Bronze Layer:
- Contains raw data with basic ingestion transformations (e.g., deduplication, adding metadata like ingestion timestamp).
- Data is stored in delta tables for efficient reads and writes.
- Silver Layer:
- Semi-processed data with cleansing, standardization, and applying business logic.
- Joins, filtering, and aggregations are performed.
- This layer is ready for further transformations or reporting.
- Gold Layer:
- Fully processed data that is ready for consumption by analytics or reporting tools.
- Data is aggregated and denormalized for performance.
Data Movement:
- Data is ingested into the landing layer, processed into bronze, transformed in silver, and finally stored in gold for analytics.
4. How to convert managed into external tables?
To convert a managed table to an external table:
- Create an external location where you want to store the data.
- Use the
CREATE TABLE ... AS SELECT
syntax:CREATE TABLE external_table USING DELTA LOCATION '<external-location>' AS SELECT * FROM managed_table;
5. How do you connect Key Vault with Databricks?
- Use Azure Key Vault-backed secret scope:
databricks secrets create-scope --scope <scope-name> --scope-backend-type AZURE_KEYVAULT --resource-id <key-vault-resource-id> --dns-name <key-vault-dns>
- Access secrets using:
secret_value = dbutils.secrets.get(scope="<scope-name>", key="<secret-name>")
6. What is the VACUUM
command?
- The
VACUUM
command removes old files from Delta tables that are no longer in use. - Syntax:
VACUUM delta_table RETAIN 168 HOURS;
- It helps reduce storage costs by cleaning up stale data files.
7. OPTIMIZE
vs ZORDER
command:
OPTIMIZE
: Used to compact small files into larger ones, improving read performance.ZORDER
: Used in combination withOPTIMIZE
to colocate related data on disk, improving query performance for specific columns.OPTIMIZE delta_table ZORDER BY (column_name);
8. Spark Internals:
Spark Driver:
- The driver is the master node that maintains information about the Spark application and manages the execution of tasks.
- It coordinates the distribution of data and tasks across the worker nodes.
Spark Context vs Spark Session:
- Spark Context: Entry point to the low-level Spark functionality.
- Spark Session: Unified entry point for both Spark SQL and Spark Core. It provides better integration with DataFrames and Datasets.
9. What are Delta Tables?
- Delta tables are an extension of Parquet tables with ACID transactions, versioning, and schema enforcement.
- Layer: Delta tables are typically used in the bronze, silver, and gold layers.
- Benefits:
- ACID compliance.
- Time travel.
- Schema evolution.
10. Parquet Format:
- Parquet is a columnar storage format.
- It is fast because:
- It supports efficient compression and encoding.
- Only the required columns are read.