Skip to content

Improve REST Support for lazy snapshot loading#16207

Open
grantatspothero wants to merge 1 commit intoapache:mainfrom
grantatspothero:gn/lazySnapshotLog
Open

Improve REST Support for lazy snapshot loading#16207
grantatspothero wants to merge 1 commit intoapache:mainfrom
grantatspothero:gn/lazySnapshotLog

Conversation

@grantatspothero
Copy link
Copy Markdown
Contributor

@grantatspothero grantatspothero commented May 4, 2026

Previously this PR added support for lazy snapshot loading: https://github.com/apache/iceberg/pull/6850/changes

This PR improves the lazy loading by supporting lazy loading of snapshotLog.

For tables with high numbers of snapshots (eg: tables with low latency commits) this can result in significant memory savings.

Considerations:

  • Wanted to maintain backwards compatibility so kept and deprecated setSnapshotsSupplier
  • I don't believe this requires a REST catalog spec change, the description of REFS mode is generic enough that this shouldn't be a change in behavior

@github-actions github-actions Bot added the core label May 4, 2026
@grantatspothero grantatspothero changed the title Gn/lazy snapshot log Improve REST Support for lazy snapshot loading May 4, 2026
@grantatspothero grantatspothero force-pushed the gn/lazySnapshotLog branch 2 times, most recently from c9bc366 to c756769 Compare May 4, 2026 17:11
@gaborkaszab
Copy link
Copy Markdown
Contributor

Hi @grantatspothero ,
I see this is a draft PR but grabbed my attention as I was investigating the lazy snapshot loading area recently. Could you help me understand what is exactly that worries you wrt the snapshot log? Is it network traffic, is it memory consumption on client side or something else?
The reason I ask is that initially I'd think that there isn't much we can win with lazily loading the snapshot log because each log entry is just 2 longs. So basically with thousands of snapshots we still are in the low kilobytes territory in terms of memory usage. Above that amount of snapshots you're doomed anyway :)

@grantatspothero
Copy link
Copy Markdown
Contributor Author

grantatspothero commented May 4, 2026

Our problem was excessive memory usage due to caching TableMetadata on the client side.

Storing a List<HistoryEntry> in memory is fine for small numbers of snapshots, but each entry takes ~32 bytes and this grows quickly when you have a single coordinator service caching iceberg metadata in memory.

Example:

  • 1000 table metadata cached in memory
  • each table commits every 30s, with 30 days of snapshot retention = 2*60*24*30 ~100K snapshots in iceberg metadata
  • 32 bytes * 100K = 3.2 MB snapshotLog per table
  • 3.2MB/table * 1000 tables = 32GB

Note: this is "resident set size" not "total allocations" which tends to be significantly higher due to intermediate allocations of parsing JSON.

For multi-tenant coordinator services (eg: query engines, cache services) this memory usage is a problem. The biggest memory hog is by far the snapshots array, but snapshotLog is the next biggest. Since iceberg already defers snapshots, it seemed reasonable to defer snapshotLog.

@grantatspothero
Copy link
Copy Markdown
Contributor Author

grantatspothero commented May 4, 2026

Above that amount of snapshots you're doomed anyway :)

It is becoming more common to have large numbers of snapshots in iceberg due to prevalence of streaming ingestion/low latency commits.

See mailing list discussions: https://www.mail-archive.com/dev@iceberg.apache.org/msg12764.html
Examples: kafka-connect iceberg sink, Confluent Tableflow, Starburst streaming ingestion.

This doesn't solve the full problem mentioned in that mailing list thread (writes still pay the full cost of writing snapshots/snapshotLog), but it does solve the problem for readers. And for query engine/caching usecases, reads >> writes so this could be beneficial.

Previously only lazily loaded snapshots
@grantatspothero grantatspothero marked this pull request as ready for review May 4, 2026 22:03
@gaborkaszab
Copy link
Copy Markdown
Contributor

Thanks you for the explanation, @grantatspothero !
I feel that ~100k snapshot tables are at the very extreme end of use-cases. I'm wondering if the table is changed every 30sec, then is there any point storing it in a cache. We'd need to reload it frequently anyway.
I wanted to advise you to reach out to the dev@ list to see wider community feedback on this. I see you've already done so, thanks!

I can take a look at the code, if the improvement is simple enough, I don't see why not to include. If it's messy or complicated, we might need some community support to get it through.

@grantatspothero
Copy link
Copy Markdown
Contributor Author

grantatspothero commented May 5, 2026

I'm wondering if the table is changed every 30sec, then is there any point storing it in a cache.

Two different definitions of cache:

  1. "Within query metadata caching". Within a single query's lifetime, TableMetadata must live in coordinator memory. Queries are usually short but sometimes can take hours, wasting coordinator memory for hours for long running queries. This wasted memory is exacerbated by: # of concurrent queries and # of tables per query. Compare this to the hive table model where coordinator memory is mostly bounded.
  2. "Cross-query metadata caching". I believe this is what you are talking about. Trino does not support cross-query table metadata caching today, but some engines do and have problems. With a cross-query cache it is difficult to control caching at a fine granularity. "Cache these long lived table metadatas but not these constantly changing ones"

@gaborkaszab
Copy link
Copy Markdown
Contributor

Thanks again for the explanation @grantatspothero !
Let's see if @amogh-jahagirdar also chimes in as said on the dev@ list. I'm not saying that this enhancement is not needed but would be great to see wider community consensus.

In the meantime, I think logically the fix is 2 parts:

  1. Not to cache snapshotLog on the client side if we are in REFS mode
  2. Lazily load snapshotLog if requested similarly to snapshots
    I think this PR only takes care of the second part, but we still put the entire snapshotLog into TableMetadata if sent by the server. Or in other words, if the REST server returns a reduced snapshotLog in REFS mode, than your problem is already solved without this PR, if it sends the full snapshotLog then this PR won't solve the caching overhead issues for you.
    Take a look at this change I've put on top of yours: I changed the reference REST catalog that we use for REST catalog tests to still return a full snapshotLog as before this PR, and I kept all the other changes you had. The test I introduced to check the length of the snapshotLog expects reduced length, but fails.

@amogh-jahagirdar
Copy link
Copy Markdown
Contributor

Big +1 to everything @gaborkaszab said in #16207 (comment) , that also is where I was at in my reply on the dev-list, but thank you @grantatspothero for the explanation of the use case. I'll take a look at this PR today

@grantatspothero
Copy link
Copy Markdown
Contributor Author

Or in other words, if the REST server returns a reduced snapshotLog in REFS mode, than your problem is already solved without this PR, if it sends the full snapshotLog then this PR won't solve the caching overhead issues for you.

Walking through the two points you mentioned above:

  1. "REST server returns a reduced snapshotLog in REFS mode": This improves client side memory usage, but without snapshot lazy loading the iceberg TableMetadata interface is broken. Snapshot lazy loading is needed to maintain correctness and it is why lazy loading already exists in iceberg.
  2. "If it sends the full snapshotLog then this PR won't solve the caching overhead". Yes, that is why this PR also modified suppressHistoricalSnapshots to remove snapshot log entries not referenced by REFs. This is called by REST catalog server side in REFs mode.

Summary: we need both the client side and server side change. Server side suppression of historical snapshots in REFS mode and client side lazy loading of historical snapshots when requested.

Can you help me understand what your commit is trying to test above? It removes the server side change to suppress historical snapshots in REFs mode, and as mentioned above we need both parts.

@gaborkaszab
Copy link
Copy Markdown
Contributor

@grantatspothero
I agree with 1). In fact I've done some research and Polaris for instance sends a reduced snapshotLog not just reduced list of snapshots in REFS mode. This ATM is a broken contract between Polaris and Iceberg IMO. You're right, if we don't include snapshotLog into the lazy loading mechanism then the internal state of TableMetadata will be broken after we lazily load the snapshots.
The reason why we haven't caught this is that in the reference IRC we always populated the full snapshotLog even in REFS mode (see suppressHistoricalSnapshots), so no tests actually covered a reduced snapshotLog.
cc @amogh-jahagirdar to double-check but I think we have an issue here between Polaris and Iceberg that this PR can fix.

I disagree with 2) though, or I misunderstand the motivation :) Let me explain:
That change of suppressHistoricalSnapshots on the REST server side is for testing, to simulate the behavior of a potential REST server implementation (unless you use the reference IRC for your production REST catalog, that I wouldn't advise). So what you did with that change is you simulated a REST catalog server that reduces the snapshotLog not just the list of snapshots.
This doesn't mean that all the REST server implementations will work like this once the PR is merged. There could be ones that still send a full snapshotLog. That's why I said that for REST servers that reduced the snapshotLog, your memory concerns are already gone (we'd still need the lazy loading, though), but for REST servers that send the full snapshotLog we still load the full log on the client side. This is what I tried to show with my commit, that if a server still sends the full list, then this PR doesn't solve the caching issue.

I might overthink this. In case you have full control on the REST catalog server implementation (i.e. you have your own proprietary implementation) then you're fine here. In case you don't and you connect to any REST catalogs out there, then you have no control on whether they send filtered or full snapshotLog in REFS mode.

Sorry if I got this too long or complicated, hope it makes sense!

@grantatspothero
Copy link
Copy Markdown
Contributor Author

grantatspothero commented May 6, 2026

Re: Polaris, good to know. Sounds like an example that this PR could be valuable in standardizing approach of handling snapshotLog.

Re: Other REST catalog implementations, thank you for the explanation that makes sense. For arbitrary REST catalog implementations, there is no guarantee on the performance of using REFS mode. A catalog could choose to send the whole snapshotLog or choose to only send referenced snapshots from the log. Hopefully over time more REST catalog implementations opt-in to the performance benefit. If catalog implementations in REFS mode are already calling suppressHistoricalSnapshots() then they will opt-in automatically.

@nastra
Copy link
Copy Markdown
Contributor

nastra commented May 7, 2026

Just wanted to mention that I attempted essentially the same thing in #8173 a while ago but there were som concerns about increased complexity

@gaborkaszab
Copy link
Copy Markdown
Contributor

Hey @grantatspothero ,
I had a discussion with @nastra about the original intention of the REFS snapshot mode. It is slightly ambiguous in the REST spec for me (but whoever I asked understood it clearly :) ) that the intention is to have effect on the list of snapshots returned by the REST catalog and not on the snapshot log. This means kind of bad news for your approach to reduce memory footprint by keeping a reduced snapshot log, I'm afraid, because all the REST catalog servers that comply with the spec have to return a full snapshot log regardless of the snapshot mode.

Now, this PR might still makes sense, IMO, if we want to be foolproof on the client side and not break internal state of the TableMetadata if a catalog returns a reduced snapshot log for REFS mode for some reason (maybe misinterpreting the spec), and then we load the snapshots lazily (but not the log). But then, we kind of guard against some behavior of another projects that in fact breaks the contract of the REST spec. I'm somewhat hesitant to do that, but lean to have this PR anyway. I'm curious of others opinions here.
Note, we also discussed with @nastra that we probably want to reach out to Polaris to also fix this on their side, to return full snapshot log.

Taking one step back, what I still don't entirely get is this: @grantatspothero you mentioned that you'd like to reduce memory pressure by not keeping entire snapshot log in REFS mode. Could you help me understand the full picture how you try to achieve that? This PR doesn't help you with that.

@grantatspothero
Copy link
Copy Markdown
Contributor Author

that the intention is to have effect on the list of snapshots returned by the REST catalog and not on the snapshot log

Under this definition I agree this PR changes the behavior of REFS mode and that is not good. That said, the spec is ambiguous as written given different catalog implementations are already interpreting it differently. Taking a step back, what was the purpose of introducing REFS mode? I thought it was introduced as a performance optimization to allow clients to have a slim representation of TableMetadata for common operations while falling back to the full representation for less common operations. This PR seems to fit into that goal of REFS mode, but maybe REFS mode was introduced for a different reason?

Taking one step back, what I still don't entirely get is this: @grantatspothero you mentioned that you'd like to reduce memory pressure by not keeping entire snapshot log in REFS mode. Could you help me understand the full picture how you try to achieve that? This PR doesn't help you with that.

Consider a multitenant coordinator service (ie: a query engine like trino). If the coordinator service does a read only query that doesn't access historical snapshots, then loading a slim version of table metadata in the client is enough to satisfy query planning entirely. The full table metadata must only be pulled into the client for read queries using time travel or write queries. Both these types of queries happen significantly less often than read only queries over the current snapshot.

@grantatspothero
Copy link
Copy Markdown
Contributor Author

Just wanted to mention that I attempted essentially the same thing in #8173 a while ago but there were som concerns about increased complexity

Thank you for the context, I missed that PR. Given it is a few years later and low latency commit into iceberg is becoming more prevalent it might be worth another round of feedback from @rdblue

@gaborkaszab
Copy link
Copy Markdown
Contributor

Thanks for the explanation @grantatspothero !

Consider a multitenant coordinator service (ie: a query engine like trino). If the coordinator service does a read only query that doesn't access historical snapshots, then loading a slim version of table metadata in the client is enough to satisfy query planning entirely. The full table metadata must only be pulled into the client for read queries using time travel or write queries.

I get this, what I don't get is how this PR helps with achieving this. If a REST server sends a full snapshot log we still put that full log into the TableMetadata, this PR doesn't help you with reducing the memory pressure, then, right?

this PR changes the behavior of REFS mode

I think this is partially correct: this PR changes the behavior of REFS mode for the reference REST catalog we use for tests.

@grantatspothero
Copy link
Copy Markdown
Contributor Author

I get this, what I don't get is how this PR helps with achieving this. If a REST server sends a full snapshot log we still put that full log into the TableMetadata, this PR doesn't help you with reducing the memory pressure, then, right?

Yes agreed, this PR is only helpful if REFS mode is defined per spec as "truncate both snapshots and snapshotLog to only snapshot entries referenced by refs".

You are suggesting that is not the current definition of REFS mode per spec. I'm suggesting the spec is ambiguous as currently written and to meet the goals of REFS mode (slim metadata loading on clients) it might be useful to be more explicit in the spec.

spec: for REFS mode

          description:
            The snapshots to return in the body of the metadata. Setting the value to `all` would
            return the full set of snapshots currently valid for the table. Setting the value to
            `refs` would load all snapshots referenced by branches or tags.

            Default if no param is provided is `all`.
          required: false
          schema:
            type: string
            enum: [ all, refs ]

where "The snapshots" could refer to just snapshots array or snapshots array and snapshotLog. Right now it is ambiguous and some implementations have interpreted that differently.

@gaborkaszab
Copy link
Copy Markdown
Contributor

FYI, I opened a PR just to clarify the original intent of the REFS mode. cc @nastra
In case we want to change the expected behavior, I think it should be started as a conversation on dev@.

Now with this PR, the question is if we want to be robust enough on the client-side to not to have inconsistent snapshotLog wrt lazy snapshot loading for server implementations that interpreted the spec differently than as intended. cc @amogh-jahagirdar @nastra WDYT?

@nastra
Copy link
Copy Markdown
Contributor

nastra commented May 8, 2026

Now with this PR, the question is if we want to be robust enough on the client-side to not to have inconsistent snapshotLog wrt lazy snapshot loading for server implementations that interpreted the spec differently than as intended

IMO the spec is pretty clear, because it mentions snapshots specifically and doesn't mention the snapshotLog. We also have a reference server implementation that only truncates the snapshots and if a server implementation decides to truncate the snapshotLog as well, then this is a bug IMO. I don't think we should be fixing the client implementation in such a case. And even if we would do that, then you'd have to apply the same fix across multiple client implementations just because a server returns incorrect table metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants