Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The important factor is # of pages on disk that need to be retrieved. Secondary indexes (aka every index in Postgres) have to lookup the primary storage as well. If you use a primary key index to pull 100 rows, if the rows aren't clustered, then you're looking at ~300 pages needing to be pulled. That's roughly 2 per index traversal (the first level of the index is generally cached) plus one per row to pull from primary storage.

This can be improved in two ways. One, if you add a second index which gives you better locality in the index. For example (customer_id, product_id) will group up all the rows by customer id. This can reduce the # of pages for index traversals down to <5 as long as each customer doesn't have a lot of rows. And in many cases this makes the primary index on just id useless. This brings the total down to 105 pages give or take. (depends on how many products each customer has)

The other way is to use an actual primary index, or use a covering index so that the data you retrieve is already in the index. For example if you're just pulling product_name from your table, you can use covering index on (customer_id, id, product_name) so that the product_name has locality with the customer's product IDs. This would bring down the total pages to be retrieved down to maybe ~20, since product_name tends to be larger data. It's a question of how many (customer_id, product_id, product_name) tuples can fit on one 8KB page and how many products the customer has.

If you use a primary index, the whole rows are on pages. This lets you run queries that pull lots of data (or different data) and have good data locality, but it means less tuples per row so you need more rows. So you'd access maybe ~50 rows but this index could cover a lot of queries unlike the covering index which only works for product_name.

These days SSDs are much faster than hard drives, so # of rows pulled off disk is still important but not as much so. Another thing this buys you is that you don't pollute the in memory cache by evicting pages just to load a new page that doesn't get utilized well. For instance original the index leaf nodes that are just 4 bytes (product_id, rowid) so every one you throw away 99% of the data on that page.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: