3 min read

Database Indexing

An index is a data structure created on top of an existing table within a database. Its purpose is to improve the speed of data retrieval by providing a way to access quickly (fast lookups) the desired data without having to scan the entire table.

A real-world example can be found in a glossary organised in alphabetical order from A to Z. If you need details about a topic starting with N, you can directly refer to the N section instead of searching sequentially from the beginning.

Where are indexes stored?

The indexes are stored in memory (RAM) as well as on disk, it depends on the configuration. It's important to note that memory(RAM) is finite, so not all indexes or data can be cached. The decision of what indexes to cache in memory is managed by the database's query optimiser and caching algorithms. The DBMS makes choices based on usage patterns, query optimisation, and available memory resources.

We know indexes help us with fast lookups, but indexes are also used to police the database constraints https://www.postgresql.org/docs/current/sql-createtable.html.

  • Primary Key Constraint: ensures that the specified column or combination of columns contains unique values and enforces that the values are not null.
  • Foreign Key Constraint: Ensures referential integrity by enforcing that values in the referencing column(s) match values in the referenced column(s) in another table.
  • Unique Constraint: Ensures that values in a specified column or combination of columns are unique across all rows in a table.
  • Exclude Constraint: This type of constraint is often used with range types to ensure that ranges of values do not overlap. I have not used it.

The indexes are mainly built on two architectures Clustered and Non- Clustered indexes. PostgreSQL indexes are built on Non-Clustered index architecture.

Clustered Index

A clustered index determines the physical order of data rows in a table. The rows are stored in the same order as the clustered index's key values (physically orders the data on disk based on the index key). Each table can have only one cluster index. This can improve range-based queries since adjacent rows are stored together on a disk.

Non-Clustered Index

A non-clustered index is separate from the data storage and contains a copy of the indexed column data along with pointers to the actual rows. This type of index is suitable for improving search performance without changing the physical order of the data.

Types of indexes

B-tree Index is the default type of index in PostgreSQL. They work well for equality and range queries.
Example: Creating a B-tree index on a column named "age" in a table called "people"

CREATE INDEX btree_age_idx ON people (age);

Hash Index is useful for exact-match queries, but they are less effective for range queries or inequality searches.
Example: Creating a hash index on a column named "zipcode" in a table named "addresses"

CREATE INDEX hash_zipcode_idx ON addresses USING hash (zipcode);

GIN (Generalised Inverted Index) indexes are designed for complex data types like arrays, JSON, and full-text search.
Example: Creating a GIN index on a column named "tags" of type text[] in a table named "posts"

CREATE INDEX gin_tags_idx ON posts USING gin (tags);

GiST (Generalised Search Tree) indexes are suitable for more complex geometric and text search queries.
Example: Creating a GiST index on a column named "geometry" of type geometry in a table named "shapes"

CREATE INDEX gist_geometry_idx ON shapes USING gist (geometry);

SP-GiST (Space-Partitioned Generalised Search Tree) indexes are similar to GiST but optimised for space-partitioning problems.
Example: Creating an SP-GiST index on a column named "network" of type inet in a table named "devices"

CREATE INDEX spgist_network_idx ON devices USING spgist (network);

BRIN (Block Range Index) indexes are useful for large tables with naturally ordered data.
Example: Creating a BRIN index on a column named "timestamp" in a table named "logs"

CREATE INDEX brin_timestamp_idx ON logs USING brin (timestamp);

Partial Index is created based on a specific condition, allowing optimisation of queries on a subset of the data.
Example: Creating a partial index on a column named "score" in a table named "students" for rows where the score is greater than 90.

CREATE INDEX partial_score_idx ON students (score) WHERE score > 90;

B-Tree

In most cases, a data structure used to create the index is a variation of the B-Tree (Balanced Trees). They are particularly well-suited for organising and managing large amounts of data that need to be efficiently accessed and updated. They provide fast search, insertion, and deletion operations (O(log n)), this is achieved because it maintains a balanced structure.

Indexing Columns with Low Cardinality

Columns that are boolean, enum or, with a small number of distinct values are not good candidates for indexing. Adding an index on these types of columns may not provide significant performance improvements, as it would still result in scanning a large portion of the table.

Too Many Indexes

Adding too many indexes on multiple columns might give us a faster lookup time when retrieving data from the database. However, when performing operations such as adding, updating, or deleting records these can become significantly costlier both in terms of time and space.