I'm in the middle of putting together a realtime strategy game using Differential Datalog[1] and Rust, with DDL managing the game's logic. Mostly as an excuse to expose myself to new ideas and engage in a whole lot of yak shaving.
That's a cool demo you're building using ddlog! FWIW, we, the ddlog team have moved on to found Feldera (https://github.com/feldera/feldera). You could consider using DBSP directly through Rust.
I had some light contact with Prolog long ago during my studies - I have a rough idea how it is used and what it can be useful for, but only on surface, not deep at all. I keep hearing about Datalog since, as some amazing thing, but I don't seem able to understand what it is - i.e. to grasp an answer to a simple question:
what is that Datalog improves over Prolog?
Just now I tried to skim the Wikipedia page of Datalog; the vague theory I'm getting from it, is that maybe Prolog has relatively poor performance, whereas Datalog dramatically improves performance (presumably allowing much bigger datasets and much more parallelized processing), at the cost of reducing expressiveness and features in some other important ways? (including making it no longer Turing-complete?) Is that what it's about, or am I completely missing the mark?
from what I know, prolog looked declarative, in a way that you just encode relations and it figures out the answers, but it really depended on the order of those rules, and some additional instructions like "cut" which not only prevented waste computations, but could affect the results.
datalog on the other hands is more or less a relation db with a different syntax.
It is slow going, partly since it is not a priority, partly because I suffer from second system syndrome. Mangle Rust should deal with any size data through getting and writing facts to disk via memory mapping. The golang implementation is in-memory.
This post is nice because it parses datalog and mentions the LSM tree, and much easier to follow than the data frog stuff.
There are very many datalog implementations in Rust (ascent, crepe) that use proc-macros. The downside is that they won't handle getting queries at runtime. For the static analysis use case where queries/programs are fixed, the proc macro approach might be better.
"I, a notorious villain, was invited for what I was half sure was my long-due comeuppance." -- Best opening line of a technical blog post I've read all year.
The narrator's interjections were a great touch. It's rare to see a post that is this technically deep but also so fun to read. The journey through optimizing the aliasing query felt like a detective story. We, the readers, were right there with you, groaning at the 50GB memory usage and cheering when you got it down to 5GB.
It is nice to see a core group of Datalog enthusiasts persist, even though the current Datalog revival seems to be on the decline. The recent Datalog 2.0 conference was quite small compared to previous years and the second HYTRADBOI conference was very light on Datalog as well, while the first one had a quarter of submissions with Datalog connection.
I'm encouraged by the other commenters sharing their recent Datalog projects. I am currently building a set of data quality pipelines for a legacy SQL database in preparation of a huge software migration.
We find Datalog much more useful in identifying and looking for data quality issues thatn SQL, as the queries can be incredibly readable when well-structured.
No offense, but I wouldn't take Datalog 2.0's small attendance as an exemplar of Datalog's decline, even if I agree with that high-level point. Datalog 2.0 is a satellite workshop of LPNMR, a relatively-unknown European conference that was randomly held in Dallas. I myself attended Datalog 2.0 and also felt the event felt relatively sparse. I also had a paper (not my primary work, the first author is the real wizard of course :-) at the workshop. I myself saw relatively few folks in that space even attending that event--with the notable exception of some European folks (e.g., introducing the Nemo solver).
All of this is to say, I think Datalog 2.0's sparse attendance this year may be more indicative of the fact that it is a satellite workshop of an already-lesser-prestigious conference (itself not even the main event! That was ICLP!) rather than a lack of Datalog implementation excitement.
For what it's worth, none of what I'm saying is meant to rebut your high-level point that there is little novelty left in implementing raw Datalog engines. Of course I agree, the research space has moved far beyond that (arguably it did a while ago) and into more exotic problems involving things like streaming (HydroFlow), choice (Dusa), things that get closer to the general chase (e.g., Egglog's chase engine), etc. I don't think anyone disagrees that vanilla Datalog is boring, it's just that monotonic, chain-forward saturation (Horn clauses!) are a rich baseline with a well-understood engineering landscape (esp in the high-performance space) to build out more interesting theories (semirings, Z-sets, etc..).
As a European, I'm definitely avoiding the US. I was asked whether I wanted to visit our US branch and I declined - which was simply accepted and probably expected in advance. Almost everyone I know does the same.
I like the author's datalog work generally, but I really wish his introductory material did not teach using binary join, which I found to get very messy internally as soon as you get away from the ideal case. I found the generic join style methods to be much, much simpler to generalize in one's head (see https://en.wikipedia.org/wiki/Worst-case_optimal_join_algori...).
related: McSherry's preceding blog post was all about demonstrating how binary joins can achieve worst-case optimal runtime, given suitable adjustments to the query plan.
For materialization-heavy workloads (program analysis, etc.), we often find that optimized binary join plans (e.g., profile-optimized, hand-optimized, etc.) beat worst-case optimal plans due to the ability to get better scalability (less locking) without the need to use a trie-based representation. Within the space of worst-case optimal plans, there are still lots of choices: but a bad worst-case optimal plan can often beat a bad (randomly-chosen) binary plan. And of course (the whole point of this exercise), there are some queries where every binary plan explodes and you do need WCOJ. There's also some work on making more traditional binary joins robust (https://db.in.tum.de/people/sites/birler/papers/diamond.pdf), among other interesting work (https://arxiv.org/html/2502.15181v1). Effectively parallelizing WCOJs is still an open problem as far as I am aware (at least, this is what folks working on it tell me), but there are some exciting potential directions in tackling that that several folks are working on I believe.
Some Clojure fans once told me they thought datalog was better than SQL and it was a shame that the relational DBs all used SQL. I never dug into it enough to find out why they thought that way.
Basically Datalog is much less verbose than SQL, imposes much lighter tariffs on factoring out views, and supports transitive closure enormously better. I started http://canonical.org/~kragen/binary-relations off with a simple nonrecursive query for which the SQL translation (given below) is already criminal and whose properly factored SQL solution merits the death penalty.
Recent additions to ANSI SQL have added the capacity for recursion, so it's no longer completely impossible. But they have three big disadvantages:
1. They accidentally made SQL Turing-complete. Datalog queries, by contrast, are guaranteed to terminate.
2. They're still extremely clumsy to use.
3. Because of #1, they're often not implemented fully, so they are hard to rely on.
Yes, #1 basically means that they screwed up the design from the get go, since it is impossible to reap the actual benefits of Datalog when the language you evaluate is not, in fact, Datalog. Recursive queries have the ability to perform arbitrary computation in projections, so for starters any top-down evaluation strategy or hybrid evaluation such as magic sets is ruled out.
I struggle to understand the Clojure/Datomic dialect, but I agree generally. I recommend Percival for playing around with Datalog in a friendly notebook environment online: https://percival.ink/
Although there’s no “ANSI SQL” equivalent standard across Datalog implementations, once you get a hang of the core idea it’s not too hard to understand another Datalog.
I started a Percival fork that compiles the Datalog to SQLite, if you want to check out how the two can express the same thing: https://percival.jake.tl/ (unfinished when it comes to aggregates and more advanced joins but the basic forms work okay). Logica is a much more serious / complete Datalog->SQL compiler written by a Google researcher that compiles to BigTable, DuckDB, and a few other SQL dialects (https://logica.dev/).
One area Datalog is an order of magnitude easier is when working with recursive queries / rules; this is possible in SQL but feels a bit like drinking playdough through a straw. Frank’s Materialize.com has a “WITH MUTUALLY RECURSIVE” SQL form (https://materialize.com/blog/recursion-in-materialize/) that’s much nicer than the ancient ANSI SQL recursive approach, we’re evaluating it for page load queries & data sync at Notion.
Feldera has a similar form for recursive views as well (https://www.feldera.com/blog/recursive-sql-queries-in-felder...). I like that Feldera lets you make each “rule” or subview its own statement rather than needing to pack everything into a single huge statement. Main downside I found when testing Feldera is that their SQL dialect has a bunch of limitations inherited from Apache Calcite, the Materialize SQL dialect tries very hard to be PostgresSQL compatible.
> Main downside I found when testing Feldera is that their SQL dialect has a bunch of limitations inherited from Apache Calcite
At Feldera, we're adding features to our SQL over time, by contributing them upstream to Calcite, making it better for everyone. Mihai Budiu, who is the author of the Feldera SQL compiler, is a Calcite committer.
Thanks for contributing. I see Mihai implemented the UUID type in Calcite (https://issues.apache.org/jira/browse/CALCITE-6738) back in January which is one of the issues I hit, so for sure my experience with Feldera is 6 months out of date and y'all move pretty quick.
Most of what I mean is places where Feldera/Calcite has slightly different syntax from Postgres for things. For example, Postgres syntax for cast to bigint is `some_expresion::bigint` although Postgres also supports ANSI SQL `CAST(some_expression AS bigint)`, most examples I find in the wild and in my own Postgres SQL use the Postgres special syntax. JSON syntax also differs; Feldera uses its own pretty elegant `VARIANT` type and `some_expression[key_expression]` to access properties, where Postgres calls this `json` or `jsonb`, and uses `some_expression->key_expression` to access properties. In those cases it's not like Feldera is wrong or lacks some support, but it's a bit harder to work with for me because I'm so used to Postgres syntax and I need to do some regex replace whenever I bring a query from Postgres over to Feldera.
Definitely not a deal-breaker, I am a Feldera enjoyer, but it does add some friction.
Thanks for the kind words. :) We hear you on the dialect differences.
An interesting case of a user dealing with this problem: they use LLMs to mass migrate SparkSQL code over to Feldera (it's often json-related constructs as you also ran into). They then verify that both their original warehouse and Feldera compute the same results for the same inputs to ensure correctness.
Flink SQL is quite limited compared to Feldera/DBSP or Frank’s Materialize.com, and has some correctness limitations: it’s “eventually consistent” but until you stop the data it’s unlikely to ever be actually correct when working with streaming joins. https://www.scattered-thoughts.net/writing/internal-consiste...
There has to be some change in the code, and they will not share the same semantics (and perhaps won't work when retractions/deletions also appear whilst streaming). And let's not even get to the leaky abstractions for good performance (watermarks et al).
Cozodb seems cool but also inactive. I poked around about in November 2024 and found some low hanging fruit in the sqlite storage backend: https://github.com/cozodb/cozo/issues/285
It's funny seeing this as the top story.
I'm in the middle of putting together a realtime strategy game using Differential Datalog[1] and Rust, with DDL managing the game's logic. Mostly as an excuse to expose myself to new ideas and engage in a whole lot of yak shaving.
[1] https://github.com/vmware-archive/differential-datalog
That's a cool demo you're building using ddlog! FWIW, we, the ddlog team have moved on to found Feldera (https://github.com/feldera/feldera). You could consider using DBSP directly through Rust.
I wonder if you could make a frankenstein version of differential datalog by combing the OP repo with salsa[1] (the crate that powers rust-analyzer)
[1] https://github.com/salsa-rs/salsa
On, nice!
I'll be interested in reading how this goes!
Very cool, I'm curious to see what the state of that implementation is and how far you get, since DDLog is not being actively maintained anymore.
I had some light contact with Prolog long ago during my studies - I have a rough idea how it is used and what it can be useful for, but only on surface, not deep at all. I keep hearing about Datalog since, as some amazing thing, but I don't seem able to understand what it is - i.e. to grasp an answer to a simple question:
what is that Datalog improves over Prolog?
Just now I tried to skim the Wikipedia page of Datalog; the vague theory I'm getting from it, is that maybe Prolog has relatively poor performance, whereas Datalog dramatically improves performance (presumably allowing much bigger datasets and much more parallelized processing), at the cost of reducing expressiveness and features in some other important ways? (including making it no longer Turing-complete?) Is that what it's about, or am I completely missing the mark?
from what I know, prolog looked declarative, in a way that you just encode relations and it figures out the answers, but it really depended on the order of those rules, and some additional instructions like "cut" which not only prevented waste computations, but could affect the results.
datalog on the other hands is more or less a relation db with a different syntax.
I made some progress porting mangle datalog to Rust https://github.com/google/mangle/tree/main/rust - it is in the same repo as the golang implementation.
It is slow going, partly since it is not a priority, partly because I suffer from second system syndrome. Mangle Rust should deal with any size data through getting and writing facts to disk via memory mapping. The golang implementation is in-memory.
This post is nice because it parses datalog and mentions the LSM tree, and much easier to follow than the data frog stuff.
There are very many datalog implementations in Rust (ascent, crepe) that use proc-macros. The downside is that they won't handle getting queries at runtime. For the static analysis use case where queries/programs are fixed, the proc macro approach might be better.
"I, a notorious villain, was invited for what I was half sure was my long-due comeuppance." -- Best opening line of a technical blog post I've read all year.
The narrator's interjections were a great touch. It's rare to see a post that is this technically deep but also so fun to read. The journey through optimizing the aliasing query felt like a detective story. We, the readers, were right there with you, groaning at the 50GB memory usage and cheering when you got it down to 5GB.
Fantastic work, both on the code and the prose.
[flagged]
No offence, but your comment violates the HN guidelines: https://news.ycombinator.com/newsguidelines.html
It is nice to see a core group of Datalog enthusiasts persist, even though the current Datalog revival seems to be on the decline. The recent Datalog 2.0 conference was quite small compared to previous years and the second HYTRADBOI conference was very light on Datalog as well, while the first one had a quarter of submissions with Datalog connection.
I'm encouraged by the other commenters sharing their recent Datalog projects. I am currently building a set of data quality pipelines for a legacy SQL database in preparation of a huge software migration.
We find Datalog much more useful in identifying and looking for data quality issues thatn SQL, as the queries can be incredibly readable when well-structured.
No offense, but I wouldn't take Datalog 2.0's small attendance as an exemplar of Datalog's decline, even if I agree with that high-level point. Datalog 2.0 is a satellite workshop of LPNMR, a relatively-unknown European conference that was randomly held in Dallas. I myself attended Datalog 2.0 and also felt the event felt relatively sparse. I also had a paper (not my primary work, the first author is the real wizard of course :-) at the workshop. I myself saw relatively few folks in that space even attending that event--with the notable exception of some European folks (e.g., introducing the Nemo solver).
All of this is to say, I think Datalog 2.0's sparse attendance this year may be more indicative of the fact that it is a satellite workshop of an already-lesser-prestigious conference (itself not even the main event! That was ICLP!) rather than a lack of Datalog implementation excitement.
For what it's worth, none of what I'm saying is meant to rebut your high-level point that there is little novelty left in implementing raw Datalog engines. Of course I agree, the research space has moved far beyond that (arguably it did a while ago) and into more exotic problems involving things like streaming (HydroFlow), choice (Dusa), things that get closer to the general chase (e.g., Egglog's chase engine), etc. I don't think anyone disagrees that vanilla Datalog is boring, it's just that monotonic, chain-forward saturation (Horn clauses!) are a rich baseline with a well-understood engineering landscape (esp in the high-performance space) to build out more interesting theories (semirings, Z-sets, etc..).
The USA is a hostile environment to visit, for Europeans. It's a theme all over western and northern Europe to avoid it.
If you were expecting European attendance, that would explain why it's different from the past.
It's the same in science circles, tourism and even business travel.
As a European, I'm definitely avoiding the US. I was asked whether I wanted to visit our US branch and I declined - which was simply accepted and probably expected in advance. Almost everyone I know does the same.
I like the author's datalog work generally, but I really wish his introductory material did not teach using binary join, which I found to get very messy internally as soon as you get away from the ideal case. I found the generic join style methods to be much, much simpler to generalize in one's head (see https://en.wikipedia.org/wiki/Worst-case_optimal_join_algori...).
related: McSherry's preceding blog post was all about demonstrating how binary joins can achieve worst-case optimal runtime, given suitable adjustments to the query plan.
- https://github.com/frankmcsherry/blog/blob/master/posts/2025...
For materialization-heavy workloads (program analysis, etc.), we often find that optimized binary join plans (e.g., profile-optimized, hand-optimized, etc.) beat worst-case optimal plans due to the ability to get better scalability (less locking) without the need to use a trie-based representation. Within the space of worst-case optimal plans, there are still lots of choices: but a bad worst-case optimal plan can often beat a bad (randomly-chosen) binary plan. And of course (the whole point of this exercise), there are some queries where every binary plan explodes and you do need WCOJ. There's also some work on making more traditional binary joins robust (https://db.in.tum.de/people/sites/birler/papers/diamond.pdf), among other interesting work (https://arxiv.org/html/2502.15181v1). Effectively parallelizing WCOJs is still an open problem as far as I am aware (at least, this is what folks working on it tell me), but there are some exciting potential directions in tackling that that several folks are working on I believe.
Some Clojure fans once told me they thought datalog was better than SQL and it was a shame that the relational DBs all used SQL. I never dug into it enough to find out why they thought that way.
Basically Datalog is much less verbose than SQL, imposes much lighter tariffs on factoring out views, and supports transitive closure enormously better. I started http://canonical.org/~kragen/binary-relations off with a simple nonrecursive query for which the SQL translation (given below) is already criminal and whose properly factored SQL solution merits the death penalty.
Recent additions to ANSI SQL have added the capacity for recursion, so it's no longer completely impossible. But they have three big disadvantages:
1. They accidentally made SQL Turing-complete. Datalog queries, by contrast, are guaranteed to terminate.
2. They're still extremely clumsy to use.
3. Because of #1, they're often not implemented fully, so they are hard to rely on.
Yes, #1 basically means that they screwed up the design from the get go, since it is impossible to reap the actual benefits of Datalog when the language you evaluate is not, in fact, Datalog. Recursive queries have the ability to perform arbitrary computation in projections, so for starters any top-down evaluation strategy or hybrid evaluation such as magic sets is ruled out.
I struggle to understand the Clojure/Datomic dialect, but I agree generally. I recommend Percival for playing around with Datalog in a friendly notebook environment online: https://percival.ink/
Although there’s no “ANSI SQL” equivalent standard across Datalog implementations, once you get a hang of the core idea it’s not too hard to understand another Datalog.
I started a Percival fork that compiles the Datalog to SQLite, if you want to check out how the two can express the same thing: https://percival.jake.tl/ (unfinished when it comes to aggregates and more advanced joins but the basic forms work okay). Logica is a much more serious / complete Datalog->SQL compiler written by a Google researcher that compiles to BigTable, DuckDB, and a few other SQL dialects (https://logica.dev/).
One area Datalog is an order of magnitude easier is when working with recursive queries / rules; this is possible in SQL but feels a bit like drinking playdough through a straw. Frank’s Materialize.com has a “WITH MUTUALLY RECURSIVE” SQL form (https://materialize.com/blog/recursion-in-materialize/) that’s much nicer than the ancient ANSI SQL recursive approach, we’re evaluating it for page load queries & data sync at Notion.
Feldera has a similar form for recursive views as well (https://www.feldera.com/blog/recursive-sql-queries-in-felder...). I like that Feldera lets you make each “rule” or subview its own statement rather than needing to pack everything into a single huge statement. Main downside I found when testing Feldera is that their SQL dialect has a bunch of limitations inherited from Apache Calcite, the Materialize SQL dialect tries very hard to be PostgresSQL compatible.
> Main downside I found when testing Feldera is that their SQL dialect has a bunch of limitations inherited from Apache Calcite
At Feldera, we're adding features to our SQL over time, by contributing them upstream to Calcite, making it better for everyone. Mihai Budiu, who is the author of the Feldera SQL compiler, is a Calcite committer.
Thanks for contributing. I see Mihai implemented the UUID type in Calcite (https://issues.apache.org/jira/browse/CALCITE-6738) back in January which is one of the issues I hit, so for sure my experience with Feldera is 6 months out of date and y'all move pretty quick.
Most of what I mean is places where Feldera/Calcite has slightly different syntax from Postgres for things. For example, Postgres syntax for cast to bigint is `some_expresion::bigint` although Postgres also supports ANSI SQL `CAST(some_expression AS bigint)`, most examples I find in the wild and in my own Postgres SQL use the Postgres special syntax. JSON syntax also differs; Feldera uses its own pretty elegant `VARIANT` type and `some_expression[key_expression]` to access properties, where Postgres calls this `json` or `jsonb`, and uses `some_expression->key_expression` to access properties. In those cases it's not like Feldera is wrong or lacks some support, but it's a bit harder to work with for me because I'm so used to Postgres syntax and I need to do some regex replace whenever I bring a query from Postgres over to Feldera.
Definitely not a deal-breaker, I am a Feldera enjoyer, but it does add some friction.
Thanks for the kind words. :) We hear you on the dialect differences.
An interesting case of a user dealing with this problem: they use LLMs to mass migrate SparkSQL code over to Feldera (it's often json-related constructs as you also ran into). They then verify that both their original warehouse and Feldera compute the same results for the same inputs to ensure correctness.
A new McSharry post! Excellent
Last I checked, VMWare had moved away from differential datalog?
The Differential Datalog team founded Feldera: https://www.feldera.com/
They switched from differential Datalog to differential SQL, I think because they realized Datalog is a really tough sell.
They did, and their product is great.
It is the only database/query engine that allows you to use the same SQL for both batch and streaming (with UDFs).
I have made an accessible version of a subset of Differential Dataflow (DBSP) in Python right here: https://github.com/brurucy/pydbsp
DBSP is so expressive that I have implemented a fully incremental dynamic datalog engine as a DBSP program.
Think of SQL/Datalog where the query can change in runtime, and the changes themselves (program diffs) are incrementally computed: https://github.com/brurucy/pydbsp/blob/master/notebooks/data...
> It is the only database/query engine that allows you to use the same SQL for both batch and streaming (with UDFs).
Flink SQL also checks that box.
Flink SQL is quite limited compared to Feldera/DBSP or Frank’s Materialize.com, and has some correctness limitations: it’s “eventually consistent” but until you stop the data it’s unlikely to ever be actually correct when working with streaming joins. https://www.scattered-thoughts.net/writing/internal-consiste...
Not true.
There has to be some change in the code, and they will not share the same semantics (and perhaps won't work when retractions/deletions also appear whilst streaming). And let's not even get to the leaky abstractions for good performance (watermarks et al).
If you wish to use Datalog and Rust, cozodb is written in Rust and has a Datalog query syntax.
Cozodb seems cool but also inactive. I poked around about in November 2024 and found some low hanging fruit in the sqlite storage backend: https://github.com/cozodb/cozo/issues/285
It's not a lot of code so it's easy to tinker with.
Posted 1 day ago
https://news.ycombinator.com/item?id=44274592