1. What is the difference between managed and external tables?

Managed Tables:

  • In managed tables, the data and the metadata are both managed by the database or data warehouse system.
  • When a managed table is dropped, both the table metadata and the underlying data are deleted.
  • These tables are typically stored in the default database storage location.

External Tables:

  • In external tables, only the metadata is managed by the database system, while the data remains in an external location (e.g., cloud storage like ADLS).
  • Dropping an external table only removes the metadata; the actual data remains intact in its original location.
  • Useful when data needs to be shared across different systems.

2. What is the use of the collect command?

  • The collect command is used to bring the data from all the worker nodes back to the driver node as a list.
  • When itโ€™s used: It is commonly used when small datasets need to be brought into the driver for further local operations or analysis.
  • Driver vs Worker Node: The collect command transfers data from each worker node to the driver node.
  • Scenarios: It is used when:
    • Summarizing data.
    • Performing local aggregations.
    • Collecting small datasets for visualization or reporting.

Out-of-Memory Exceptions: These occur when:

  • The driver tries to collect too much data into memory, exceeding its available heap space.
  • Workers run out of memory while performing operations (e.g., shuffling large datasets).

3. Explain your project in detail about landing, bronze, silver, and gold layers:

  • Landing Layer:
    • Raw data is ingested from various source systems.
    • No transformations are applied at this stage.
    • Data is stored in its original format (e.g., CSV, JSON).
  • Bronze Layer:
    • Contains raw data with basic ingestion transformations (e.g., deduplication, adding metadata like ingestion timestamp).
    • Data is stored in delta tables for efficient reads and writes.
  • Silver Layer:
    • Semi-processed data with cleansing, standardization, and applying business logic.
    • Joins, filtering, and aggregations are performed.
    • This layer is ready for further transformations or reporting.
  • Gold Layer:
    • Fully processed data that is ready for consumption by analytics or reporting tools.
    • Data is aggregated and denormalized for performance.

Data Movement:

  • Data is ingested into the landing layer, processed into bronze, transformed in silver, and finally stored in gold for analytics.

4. How to convert managed into external tables?

To convert a managed table to an external table:

  1. Create an external location where you want to store the data.
  2. Use the CREATE TABLE ... AS SELECT syntax: CREATE TABLE external_table USING DELTA LOCATION '<external-location>' AS SELECT * FROM managed_table;

5. How do you connect Key Vault with Databricks?

  • Use Azure Key Vault-backed secret scope: databricks secrets create-scope --scope <scope-name> --scope-backend-type AZURE_KEYVAULT --resource-id <key-vault-resource-id> --dns-name <key-vault-dns>
  • Access secrets using: secret_value = dbutils.secrets.get(scope="<scope-name>", key="<secret-name>")

6. What is the VACUUM command?

  • The VACUUM command removes old files from Delta tables that are no longer in use.
  • Syntax: VACUUM delta_table RETAIN 168 HOURS;
  • It helps reduce storage costs by cleaning up stale data files.

7. OPTIMIZE vs ZORDER command:

  • OPTIMIZE: Used to compact small files into larger ones, improving read performance.
  • ZORDER: Used in combination with OPTIMIZE to colocate related data on disk, improving query performance for specific columns. OPTIMIZE delta_table ZORDER BY (column_name);

8. Spark Internals:

Spark Driver:

  • The driver is the master node that maintains information about the Spark application and manages the execution of tasks.
  • It coordinates the distribution of data and tasks across the worker nodes.

Spark Context vs Spark Session:

  • Spark Context: Entry point to the low-level Spark functionality.
  • Spark Session: Unified entry point for both Spark SQL and Spark Core. It provides better integration with DataFrames and Datasets.

9. What are Delta Tables?

  • Delta tables are an extension of Parquet tables with ACID transactions, versioning, and schema enforcement.
  • Layer: Delta tables are typically used in the bronze, silver, and gold layers.
  • Benefits:
    • ACID compliance.
    • Time travel.
    • Schema evolution.

10. Parquet Format:

  • Parquet is a columnar storage format.
  • It is fast because:
    • It supports efficient compression and encoding.
    • Only the required columns are read.


Leave a Reply

Your email address will not be published. Required fields are marked *