know.2nth.ai › Data › data › warehousing › postgresql

data/warehousing · PostgreSQL · Skill Leaf

The database that
does everything.

PostgreSQL 18 — the relational database underneath Frappe, Superset, half the 2nth stack, and most of the internet. ACID-compliant, extensible, open-source, and the rare database that handles OLTP and analytical workloads in the same engine.

Live PostgreSQL License v18 ACID Extensible

01 · What it is

Not just storage. A programmable data platform.

PostgreSQL is an object-relational database that's been in active development since 1986. The "object-relational" part matters: it supports user-defined types, functions, operators, and aggregates. But the real story in 2026 isn't the core engine — it's the extension ecosystem that sits on top of it.

Extensions turn PostgreSQL into a vector database (pgvector), a time-series database (TimescaleDB), a geospatial database (PostGIS), a graph database (Apache AGE). The Frappe Framework runs on it. Apache Superset stores its metadata in it. Cloudflare Hyperdrive proxies connections to it from the edge. It is the substrate that the rest of the data stack builds on.

The core commitments: ACID transactions, row-level security, logical replication, MVCC concurrency (readers don't block writers), full-text search, JSONB document storage with indexing, and window functions for analytics — all built in, no add-ons required.

Why this is the 2nth default

PostgreSQL is the default because it handles the first three years of most builds without requiring a database migration. Start on Postgres, add materialized views for dashboard queries, add pgvector for RAG, add PostGIS for geospatial. By the time you outgrow it — if you ever do — you'll know exactly which specialized database you need and why.

It's also the database that Frappe's ERPNext, HRMS, and CRM run on (v15+), which means a single Postgres instance can serve both your ERP and your analytics metadata store. One operational target, not two.

02 · Core patterns

The capabilities that make it more than a row store.

PostgreSQL ships with a set of features that remove entire categories of external tooling from the architecture. Each one is worth knowing before you reach for a specialist service.

Feature	What it does
`JSONB`	Document storage with full GIN indexing — when you need NoSQL flexibility inside SQL structure. Supports path queries, containment checks, and array operations.
CTEs (`WITH`)	Common Table Expressions for readable complex logic. Break a multi-step query into named intermediate results. Makes recursive queries tractable.
Window functions	Analytics without leaving the database. Running totals, rank, lag/lead, percentile — computed over a defined partition and order without collapsing rows.
Materialized views	Pre-computed query results stored as a table, refreshable on demand or concurrently. The first tool to reach for when dashboard queries are slow.
Full-text search	`tsvector`/`tsquery` with GIN indexes. Fast enough to skip Elasticsearch for most document search use cases.
`LISTEN`/`NOTIFY`	Pub/sub built into the database. Notify channels on row changes; listeners receive events in real time. No message queue required for simple event fans.
Logical replication	Stream changes from a publication to one or more subscribers. Source for CDC pipelines, read replicas, and cross-database migrations.
Row-level security	Policies on tables that control which rows a session can read or write, evaluated per-user based on session variables. POPIA data-subject access control without application logic.

-- JSONB: document storage with index
CREATE INDEX idx_orders_meta ON orders USING gin (metadata jsonb_path_ops);

-- Containment query — uses the index
SELECT * FROM orders WHERE metadata @> '{"region": "ZA"}';

-- Path query
SELECT metadata ->> 'region' AS region, count(*) FROM orders GROUP BY 1;

-- Materialized view for a dashboard — refreshable without locking
CREATE MATERIALIZED VIEW daily_sales AS
SELECT
  date_trunc('day', created_at) AS day,
  sum(total)                     AS revenue,
  count(*)                       AS orders
FROM orders
WHERE status = 'completed'
GROUP BY 1;

-- Refresh without blocking reads
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales;

-- LISTEN/NOTIFY: real-time events without a message queue
LISTEN order_updates;
NOTIFY order_updates, '{"order_id": 42, "status": "shipped"}';

-- Window function: running total by day
SELECT
  day,
  revenue,
  sum(revenue) OVER (ORDER BY day) AS running_total
FROM daily_sales
ORDER BY day;

03 · Extensions

The killer feature is what it becomes.

PostgreSQL's extension system lets the database grow into specialized roles without replacing the core. The extensions most relevant to the 2nth stack cover vectors, geospatial, time-series, scheduling, and graph queries.

Extension	What it adds	Where it fits
`pgvector`	Vector similarity search — cosine, dot product, L2 distance. Store embeddings, run nearest-neighbour queries.	RAG pipelines, semantic search, recommendation
`PostGIS`	Geospatial geometry and geography types, spatial indexes, ST_Distance, ST_Within, routing.	Field service, logistics, mapping, delivery zones
`TimescaleDB`	Hypertables that auto-partition time-series data, continuous aggregates, compression.	IoT, metrics, event streams, financial tick data
`pg_cron`	Cron-scheduled SQL jobs inside the database. No external scheduler needed for routine maintenance or aggregation tasks.	Nightly rollups, cleanup jobs, data expiry
`pg_stat_statements`	Query performance monitoring — total time, calls, mean time, rows per query. Essential for tuning.	Performance investigation, index planning
`Apache AGE`	Graph queries using the Cypher query language on top of relational data. No separate graph database required.	Knowledge graphs, fraud detection, org charts

-- Install pgvector (once, per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with a vector column (1536-dim for text-embedding-3-small)
CREATE TABLE documents (
  id        bigserial  primary key,
  content   text,
  embedding vector(1536)
);

-- HNSW index for fast approximate nearest-neighbour search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

-- Semantic search: find the 5 most similar documents
SELECT id, content, embedding <=> '[0.1, 0.2, ...]' AS distance
FROM documents
ORDER BY distance
LIMIT 5;

-- TimescaleDB: hypertable for time-series data
CREATE EXTENSION IF NOT EXISTS timescaledb;

CREATE TABLE sensor_readings (
  time        timestamptz  NOT NULL,
  sensor_id   int,
  temperature numeric(5,2)
);

-- Convert to hypertable (auto-partitions by time)
SELECT create_hypertable('sensor_readings', 'time');

04 · Deployment

Where to run it and how to connect.

PostgreSQL runs everywhere. The choice between self-hosted and managed turns on operational overhead vs. control, and on where the data needs to live for compliance.

Option	Region	Good for	Tradeoff
Cloud SQL (Google)	africa-south1 (JHB)	SA data residency, POPIA, GCP stack	Vendor lock-in, cost at scale
Amazon RDS	af-south-1 (Cape Town)	AWS stack, multi-AZ HA	Cape Town region only for SA
Supabase	Closest AWS region	Rapid dev, built-in auth + storage + REST	No SA region; EU/US data only
Self-hosted	Your infra	Full control, POPIA on-prem compliance	You own backups, HA, patching

# Connection string pattern
postgresql://user:password@host:5432/dbname?sslmode=require

# With PgBouncer (connection pooling) in front
postgresql://user:password@pgbouncer-host:6432/dbname?sslmode=require

# Cloudflare Hyperdrive connection string (from a Worker)
# env.HYPERDRIVE.connectionString returns a pooled string automatically

Connection pooling is not optional in production

PostgreSQL's default max_connections is 100. Each connection holds memory (roughly 5-10 MB). A web application with 20 pods, each opening 10 connections, hits the limit before it gets interesting. Use PgBouncer for self-hosted deployments — it pools connections in transaction mode and makes the limit effectively irrelevant. Cloudflare Hyperdrive does the same thing for Workers: it maintains a warm pool at the edge so each Worker invocation doesn't pay the TCP + TLS handshake cost.

# Backup strategy

# Small databases (dev, staging): logical backup
pg_dump -Fc -h host -U user dbname > backup_$(date +%Y%m%d).dump
pg_restore -d newdb backup_20260416.dump

# Production: physical backup + WAL archiving
# pg_basebackup takes a binary snapshot
pg_basebackup -h host -U replicator -Ft -z -Xs -P -D /backups/base

# WAL archiving in postgresql.conf
archive_mode = on
archive_command = 'gsutil cp %p gs://my-bucket/wal/%f'

# Point-in-time recovery: restore base + replay WAL to a specific timestamp

05 · SA context

Running PostgreSQL in a South African context.

South Africa has specific infrastructure constraints and regulatory requirements that shape how PostgreSQL gets deployed and configured here.

Cloud SQL in africa-south1. Google Cloud SQL has a PostgreSQL offering in the Johannesburg region (africa-south1). This is the right default for POPIA-sensitive builds where data residency in South Africa is a requirement. Latency from most SA cities to Johannesburg is low enough that application-level connection pooling (PgBouncer or Hyperdrive) makes the round-trip negligible.

POPIA and row-level security. The Protection of Personal Information Act requires that data subjects can access and request deletion of their personal data. PostgreSQL's row-level security (RLS) makes this straightforward: define a policy that restricts row access based on a session variable set to the current data subject's ID. An agent or API can query as a specific data subject and automatically get only their rows — no application-level filtering required.

Money: never use float. For ZAR amounts, use NUMERIC(12,2) or store as integer cents. Floating-point arithmetic introduces rounding errors that compound across aggregations. A sum of 10,000 float-stored transactions may differ from the correct value by a few cents — which is a financial reporting error, not just a rounding quirk.

Load shedding resilience. During load shedding, database connections are dropped when the server restarts. The defence: connection pooling (PgBouncer reconnects automatically), application-level retry logic with exponential backoff, and WAL archiving to cloud storage so point-in-time recovery is available if a hard restart corrupts a checkpoint. Do not rely on pg_dump scheduled jobs alone — WAL archiving is the safety net.

-- Row-level security for POPIA data-subject access control
ALTER TABLE customer_data ENABLE ROW LEVEL SECURITY;

-- Policy: each session can only see rows where customer_id matches the session variable
CREATE POLICY customer_isolation ON customer_data
  USING (customer_id = current_setting('app.current_customer_id')::bigint);

-- Set before each query in the application layer
SET app.current_customer_id = '42';
SELECT * FROM customer_data; -- returns only customer 42's rows

-- ZAR money: always NUMERIC, never FLOAT
CREATE TABLE invoices (
  id         bigserial primary key,
  amount_zar numeric(12,2) NOT NULL,  -- correct
  amount_bad float                       -- never do this
);

06 · Gotchas

Things that only bite once you ship.

Six failure modes worth internalising before a PostgreSQL instance goes in front of production traffic.

Never use float for money

Use NUMERIC(12,2) or integer cents. Floating-point arithmetic accumulates rounding errors that show up in financial reports. This catches every new engineer exactly once — usually in production.

Connection limits hit fast

Default max_connections=100. A 20-pod app with 5 connections each is already at the limit. Use PgBouncer in transaction mode for self-hosted, or Cloudflare Hyperdrive for Workers. Raising max_connections without pooling just moves the pain.

VACUUM doesn't run itself well enough

Autovacuum is on by default but tuned conservatively. High-write tables accumulate dead tuple bloat. Monitor with pg_stat_user_tables, check n_dead_tup, and tune autovacuum thresholds for hot tables. Table bloat hurts query performance and index scans.

SELECT * in production

Always list columns explicitly, especially with JSONB columns. SELECT * on a table with a 10 MB JSONB column deserialises and transmits all of it. In a loop, this is a disaster. Always specify the columns you actually need.

Missing indexes on foreign keys

PostgreSQL does not automatically create indexes on foreign key columns. Every unindexed FK column becomes a sequential scan on the referenced table during JOINs and cascade deletes. Check pg_constraint after schema changes and index FKs manually.

Locale matters for sort order

The database locale controls collation — how strings are ordered and compared. Set LC_COLLATE=en_ZA.UTF-8 at database creation for correct South African sort order. Changing locale after the fact requires a full dump-and-restore.

07 · Decision guide

Use it for, skip it for.

PostgreSQL is the right default for most builds. Here is the honest set of cases where a specialist database is actually the better choice.

Use PostgreSQL when

You need ACID transactions — financial records, inventory, order management.
You want extensions to grow into specialized roles: pgvector for RAG, PostGIS for geospatial, TimescaleDB for time-series.
The team knows SQL — Postgres rewards SQL fluency with enormous capability.
Regulatory compliance (POPIA) requires self-hosted data residency in South Africa.
You're running Frappe/ERPNext — it's the native database for v15+.
You want one database that serves both operational (OLTP) and light analytical (OLAP) workloads.

Skip PostgreSQL when

You need sub-second analytics on 100 billion+ rows — use ClickHouse.
You need zero-ops serverless SQL — BigQuery or DuckDB are simpler to operate.
You need in-process embedded SQL in a script or notebook — DuckDB is the right shape.
You need multi-cloud storage/compute separation at enterprise scale — Snowflake.
The write throughput is so high that MVCC dead-tuple accumulation and vacuum overhead become unmanageable — ClickHouse's append-only model may fit better.

08 · 2ⁿ Perspective

How this leaf compounds in the tree.

PostgreSQL is not one tool in the stack — it is the substrate that multiple other skill nodes depend on. Understanding it once unlocks the operational model for a large part of the 2nth build surface.

The compounding logic: every node that talks to a PostgreSQL instance shares the same connection model, the same auth patterns, the same query language, and the same extension surface. An agent that knows how to operate a Postgres instance knows how to read Superset's metadata, back up a Frappe site's database, and query an ERPNext report — because they're all the same database.

The extension story matters for agents too. pgvector turns a Postgres instance into a vector store — no separate Pinecone or Weaviate to provision. PostGIS turns it into a geospatial engine. An agent operating against a Postgres instance with these extensions can do work that used to require three separate services.

tech/frappe/framework

Frappe Framework

Frappe v15 runs on PostgreSQL natively. Every Frappe doctype is a Postgres table. The Frappe leaf assumes this one as the backing store.

data/analytics/superset

Apache Superset

Superset stores its own metadata (dashboards, charts, datasets, users) in PostgreSQL. It also queries Postgres directly as a datasource — the same instance can play both roles.

tech/cloudflare/hyperdrive

Cloudflare Hyperdrive

Hyperdrive pools and caches Postgres connections from Cloudflare Workers. When D1 (SQLite-at-edge) isn't enough, Hyperdrive brings a real Postgres instance to the edge with pooling built in.

tech/cloudflare/d1

Cloudflare D1

D1 is SQLite at the edge — the right tool when you need a lightweight embedded database in a Worker. When you need the real thing (ACID, extensions, large datasets), the path is D1 → Hyperdrive → Postgres.

data/warehousing/postgres-vs-mssql

Postgres vs Microsoft SQL

The decision guide that turns this leaf into a migration plan. The BI-offload play drops your SQL Server licence by 35-70% without touching production; concrete licence math and a 4-week migration shape included.

data/warehousing

Warehousing sub-hub

The parent sub-hub. PostgreSQL covers the OLTP-to-light-OLAP spectrum; the other nodes in the hub fill the shapes Postgres doesn't cover.

data/engineering

Data Engineering

dbt, Airbyte, and the transformation layer all sit on top of the warehouse. Postgres is the most common target for these tools in the 2nth stack.

09 · Resources

Go deeper.

The official documentation is excellent — one of the best in the open-source world. The wiki has the operational patterns. pgvector and Supabase docs are the fastest path to the extension ecosystem.

Project home postgresql.org postgresql.org Docs PostgreSQL 18 documentation postgresql.org/docs/18 Wiki PostgreSQL Wiki wiki.postgresql.org Extension pgvector — vector similarity search github.com/pgvector/pgvector Platform docs Supabase documentation supabase.com/docs Managed hosting Cloud SQL for PostgreSQL cloud.google.com/sql/docs/postgres

The database thatdoes everything.