Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Optimizing Top K in Postgres (paradedb.com)

76 points by philippemnoel 1 days ago | 12 comments

bob1029 2 hours ago [-]

Lucene really does feel like magic sometimes. It was designed expressly to solve the top K problem at hyper scale. It's incredibly mature technology. You can go from zero to a billion documents without thinking too much about anything other than the amount of mass storage you have available.

Every time I've used Lucene I have combined it with a SQL provider. It's not necessarily about one or the other. The FTS facilities within the various SQL providers are convenient, but not as capable by comparison. I don't think mixing these into the same thing makes sense. They are two very different animals that are better joined by way of the document ids.

jmgimeno 3 hours ago [-]

Maybe I'm wrong, but for this query:

SELECT * FROM benchmark_logs WHERE severity < 3 ORDER BY timestamp DESC LIMIT 10;

this index

CREATE INDEX ON benchmark_logs (severity, timestamp);

cannot be used as proposed: "Postgres can jump directly to the portion of the tree matching severity < 3 and then walk the timestamps in descending order to get the top K rows."

Postgres with this index can walk to a part of the tree with severity < 3, but timestamps are sorted only for the same severity.

Cervisia 1 hours ago [-]

The SQLite documentation explains how (and how well) this works: https://www.sqlite.org/optoverview.html#the_skip_scan_optimi...

igorw 39 minutes ago [-]

While Postgres did introduce skip scan in 18, it only works for equality matching: https://www.crunchydata.com/blog/get-excited-about-postgres-...

dragon96 2 hours ago [-]

If severity is a low cardinality enum, it still seems acceptable

mattashii 3 minutes ago [-]

The order returned from the Index Scan is not the ordering requested by the user, so there would still have to be a full (or topk) Sort over the dataset returned from the index scan, which could negate the gains you get from using an Index Scan; PostgreSQL itself does not produce merge join plans that merge a spread of index scans to get suffix-ordered data out of an index.

h1fra 27 minutes ago [-]

Postgres is really good at a lot of things, but it's very unfortunate that it's really bad at simple analytics. I wish there was a plugin instead of having to have N databases

davidelettieri 4 hours ago [-]

The "But Wait, We Need Filters Too" paragraph mentions "US" filter which is introduced only later on.

GrayShade 3 hours ago [-]

And footnote 3 is unreferenced.

Vadim_samokhin 2 hours ago [-]

Just in case, there is a btree_gin extension which can be used in queries combining gin-indexable column and btree-indexable column. It doesn’t solve top-K ordering problem though.

JEONSEWON 4 hours ago [-]

[flagged]

bbshfishe 3 hours ago [-]

[dead]

tacone 3 hours ago [-]

The issue here is the row based format. You simply can't filter on arbitrary columns with that. Either use an external warehouse or a columnar plug-in like Timescale.

hrmtst93837 55 seconds ago [-]

The catch is that Timescale isn't really a true columnar store, it's still fundamentally row-based under the hood, just with hypertables and some compression tricks. If you actually need warehouse-style columnar performance, Citus might get you further, but then you lose a lot of what makes Postgres sane for transactional workloads.

Warehouses are great until you need fine-grained updates or strong consistency, then the cracks show quickly. Most of these warehouse or columnar plug-in suggestions gloss over ETL friction and the pain of maintaining two systems in production.

If you try to optimize for both OLAP and OLTP in the same DB, you end up with a system that's mediocre at both. Pick your poison, or budget for dedicated infra.

Rendered at 10:08:15 GMT+0000 (Coordinated Universal Time) with Vercel.