next up is duckdb. Fits a role in between small data and huge data. pandas >...
#pyconus next up is duckdb.
Fits a role in between small data and huge data.
pandas > duck db > distributed systems (spark etc)
Self-replies
Runs in process like sqlite does (but not inmemory, can be saved to file/page out)
MIT license
Uses modern tricks to run very fast
Easy enough for non-specialist
me: sounds like it is for supporting "data frame oriented programming"
---
SQL (postgres) poorly optimized for big data
Analytic dbs poorly optimized for installing the whole thing on your laptop
---
duck db tries to be an analytics db that works well on a workstation (local processing instead of sending data across network for a sql db to work with it)
duckb syntax is postgres-like
Pivot/unpivot features
... many features new to me...
GROUP BY ALL, SORT BY ALL, syntactic sugar to avoid listing lots of columns
Duckdb will query a dataframe in the current variable scope with SQL as if it was a table.
Whoa....
Duckdb can load/expert data fast...
Has ecosystem integrations (all the dataframe libraries, not just pandas)
Supports "Relational API" (what?) and SQL.
ah Relational API https://duckdb.org/docs/api/python/relational_api.html
You can simulate working with pyspark using duckdb on your workstation.... I wonder if this will simulate AWS Glue close enough?