next up is duckdb. Fits a role in between small data and huge data. pandas >...

@mistersql May 18, 2024

#pyconus next up is duckdb.

Fits a role in between small data and huge data.
pandas > duck db > distributed systems (spark etc)

Self-replies

May 18, 2024

Runs in process like sqlite does (but not inmemory, can be saved to file/page out)
MIT license
Uses modern tricks to run very fast
Easy enough for non-specialist

May 18, 2024

me: sounds like it is for supporting "data frame oriented programming"
---
SQL (postgres) poorly optimized for big data
Analytic dbs poorly optimized for installing the whole thing on your laptop
---
duck db tries to be an analytics db that works well on a workstation (local processing instead of sending data across network for a sql db to work with it)

May 18, 2024

duckb syntax is postgres-like
Pivot/unpivot features
... many features new to me...

GROUP BY ALL, SORT BY ALL, syntactic sugar to avoid listing lots of columns

May 18, 2024

Duckdb will query a dataframe in the current variable scope with SQL as if it was a table.

Whoa....

May 18, 2024

Duckdb can load/expert data fast...
Has ecosystem integrations (all the dataframe libraries, not just pandas)
Supports "Relational API" (what?) and SQL.

May 18, 2024

ah Relational API https://duckdb.org/docs/api/python/relational_api.html

May 18, 2024

You can simulate working with pyspark using duckdb on your workstation.... I wonder if this will simulate AWS Glue close enough?