Skip to content

Support CREATE EXTERNAL TABLE backed by a Catalog with DataFusion #2021

@CTTY

Description

@CTTY

Background

DataFusion already supports CREATE EXTERNAL TABLE ... STORED AS ICEBERG. Today, iceberg-rust integrates via IcebergTableProviderFactory, but the factory primarily supports registering a static table (e.g., created from a metadata JSON path). That works for:

-- Static table (existing, backward compatible)
CREATE EXTERNAL TABLE my_table
STORED AS ICEBERG
LOCATION '/path/to/metadata.json';

However, we also want CREATE EXTERNAL TABLE to create a normal IcebergTableProvider backed by a Catalog, so users can define the catalog via SQL OPTIONS (and then resolve tables by identifier through that catalog).

Dumping my thoughts here and feedbacks are welcome!

Option A: Build Catalog inside the ProviderFactory using OPTIONS

IcebergTableProviderFactory parses OPTIONS and uses a CatalogBuilder to construct the Catalog internally, then creates a normal IcebergTableProvider

CREATE EXTERNAL TABLE my_table
STORED AS ICEBERG
LOCATION 'ignored_or_optional' // this will be ignored if a catalog is configured
OPTIONS (
  'datafusion.iceberg.catalog.type' = 'rest', // if catalog type is not configured, it should fall back to create static table
  'datafusion.iceberg.catalog.uri' = 'http://localhost:8181',
  'datafusion.iceberg.catalog.warehouse' = 's3://bucket/warehouse'
);

Option B: Allow injecting a pre-built Catalog into the factory

Essentially we have

pub struct IcebergTableProviderFactory {
  catalog: Option<Arc<dyn Catalog>>, // when it's none, fall back to static table
}
...
IcebergTableProviderFactory::new_with_catalog(Arc<dyn Catalog>)

I prefer this as it is much more straight-forward, but one drawback I can think of is users cannot easily use multiple catalogs at the same time. A workaround would look like this:

state
        .table_factories_mut()
        .insert("ICEBERG_REST_A".to_string(), Arc::new(IcebergTableProviderFactory(rest_catalog_a)));

state
        .table_factories_mut()
        .insert("ICEBERG_REST_B".to_string(), Arc::new(IcebergTableProviderFactory(rest_catalog_b)));

and then when creating the table using sql:

CREATE EXTERNAL TABLE my_table
STORED AS ICEBERG_REST_A
...

Willingness to contribute

I can contribute to this feature independently

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions