You can compute the difference of two versions of the table, but as you guessed it’s expensive to do. it’s also tricky to compute the actual difference when the delta table has changes other than appends.
usually when people ask about this, they’re trying to design their own system that gives them exactly one processing of data from delta to somewhere; spark streaming + Delta source already exists to do this
if you do want to write your own, you can read the transaction log directly (protocol spec is at https://github.com/delta-io/delta/blob/master/PROTOCOL.md) and use the actions in the versions between the two you’re computing to figure out which files have changes to read
Please note that versions of a delta table are cached (persisted by Spark) so comparing different datasets should be fairly cheap.
val v0 = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta/t2")
val v1 = spark.read.format("delta").option("versionAsOf", 1).load("/tmp/delta/t2")
// v0 and v1 are persisted - see Storage tab in web UI
Getting those v0 and v1 isn’t expensive; comparing the two can be both expensive and tricky. If the table is append-only then it’s (v1 - v0); if it’s got upserts then you have to handle (v0 - v1) as well, and if it’s got metadata or protocol changes it gets even trickier.
And when you do all that logic yourself it’s suspiciously similar to re-implementing DeltaSource.
You may then consider the following:
val log = DeltaLog.forTable(spark, "/tmp/delta/t2")
val v0 = log.getSnapshotAt(0)
val actionsAtV0 = v0.state
val v1 = log.getSnapshotAt(1)
val actionsAtV1 = v1.state
actionsAtV0
and actionsAtV1
are all the actions that brought the delta table to versions 0 and 1, respectively, and can be considered a CDC of the delta table.
That's basically reading the transaction log, except using some Delta’s internal APIs to make that easier.