know.2nth.ai › Data › data › analytics › superset

data/analytics · Superset · Skill Leaf

Open-source BI.
SQL-native.

Apache Superset 6.0 — open-source BI for data exploration, SQL Lab, 40+ chart types, embedded analytics. The 2nth default over Power BI and Tableau for new builds.

Live Apache 2.0 Python + React SQLAlchemy v6.0

01 · What it is

Open-source BI that speaks SQL first.

Apache Superset is an open-source business intelligence platform built on Python and React. Its core loop: connect a database, define a dataset, build charts, assemble dashboards. The whole thing is driven by SQL — analysts write queries in SQL Lab, datasets are either SQL queries or direct table references, and every chart ultimately resolves to a SQL statement you can inspect.

SQL Lab is the ad-hoc query interface: syntax highlighting, schema browser, query history, CSV export, and the ability to promote any result set into a named dataset for charting. 40+ chart types — bar, line, area, scatter, heatmap, Sankey, sunburst, pivot, big number tiles, geospatial maps — cover most reporting needs without custom code.

Embedded analytics ships as a first-class feature: guest tokens let you embed dashboards into external applications with row-level security applied per viewer. SQLAlchemy backs every connection, which means any database with a SQLAlchemy dialect — and there are over 30 — works without additional drivers on the Superset side.

Why Superset is the 2nth default

Free and self-hosted. No per-seat licensing. Runs on your infrastructure, so data never leaves your VPC — relevant for POPIA compliance and any client with data residency requirements.

SQL-native. Every analyst already knows SQL. There's no proprietary query language to learn, no drag-and-drop abstraction that breaks on edge cases. Write SQL, see charts.

Embeddable. Guest token auth plus row-level security means you can surface dashboards inside client portals, ERPNext views, or custom applications without shipping a separate BI seat to every user.

02 · Data sources

Connect anything with a SQLAlchemy dialect.

Superset treats every data source as a SQLAlchemy URI. If the database has a dialect, it works. Datasets are either physical (point at a table) or virtual (wrap a SQL query).

Database	SQLAlchemy URI pattern	Notes
PostgreSQL	`postgresql://user:pass@host/db`	2nth default for OLTP + reporting. Cloud SQL in africa-south1.
ClickHouse	`clickhousedb://user:pass@host/db`	For high-volume event data. Sub-second aggregations on billions of rows.
DuckDB	`duckdb:////path/to/file.db`	Local analytics on Parquet/CSV. Lightweight; no server needed.
BigQuery	`bigquery://project/dataset`	Service account JSON via `credentials_base64` extra.
MySQL	`mysql://user:pass@host/db`	Works. MariaDB too. Not the preferred path for new builds.
Snowflake	`snowflake://user:pass@account/db/schema`	Add `?warehouse=WH&role=ROLE` to the URI.
Trino	`trino://user@host:port/catalog`	Federation layer. Query Hive, Iceberg, Delta from one surface.

Virtual vs physical datasets. A physical dataset points at a table — Superset generates the SQL from chart configuration. A virtual dataset is a saved SQL query that becomes the source table for one or more charts. Virtual datasets are how you hide complexity: pre-join, pre-filter, pre-aggregate in the dataset definition, then let analysts build charts on the clean surface.

# SQLAlchemy URI — PostgreSQL with Cloud SQL Auth Proxy
postgresql+psycopg2://superset:password@127.0.0.1:5432/superset_meta

# ClickHouse via HTTP interface
clickhousedb://default:@clickhouse-host:8123/analytics

# DuckDB — absolute path required
duckdb:////data/warehouse.db

# BigQuery — credentials in database extra JSON
bigquery://my-gcp-project/my_dataset
# extra: {"credentials_base64": "<base64-encoded-service-account-json>"}

# Virtual dataset SQL — hide the join complexity from chart authors
SELECT
  o.order_date,
  o.region,
  c.segment,
  SUM(o.revenue) AS revenue,
  COUNT(DISTINCT o.customer_id) AS customers
FROM orders o
JOIN customers c ON c.id = o.customer_id
GROUP BY 1, 2, 3

03 · How it works

Datasets → Charts → Dashboards → Filters.

Superset's mental model is a four-layer stack. Each layer is reusable: one dataset feeds many charts; one chart can appear on many dashboards; filters apply cross-chart via the filter bar.

Datasets are the base. Each dataset has columns (with types, descriptions, and certified status), metrics (named SQL aggregations like SUM(revenue)), and optional time grain configuration. Get the dataset right and chart creation becomes drag-and-drop.

SQL Lab is the power-user escape hatch. Full editor with autocomplete, multi-tab sessions, query history, explain plan, and CSV/JSON export. Any query result can be saved as a virtual dataset in one click.

Jinja templating in SQL Lab and virtual datasets lets you write dynamic SQL that responds to dashboard filters, the current user, and time range selections — without any custom code in the Superset application layer.

-- Jinja template in a virtual dataset or SQL Lab query
-- {{ filter_values("region")[0] }} pulls the active dashboard filter
-- {{ current_username() }} is the logged-in Superset user
-- {{ from_dttm }} and {{ to_dttm }} are the dashboard time range

SELECT *
FROM orders
WHERE region = '{{ filter_values("region")[0] }}'
  AND created_by = '{{ current_username() }}'
  AND order_date >= '{{ from_dttm }}'

Native filters vs Jinja filters

Native filters (the filter bar on the right of a dashboard) apply to all charts on the dashboard that share the same column name — zero SQL required. Jinja template filters are for queries that need server-side parameterisation: row-level security, user-scoped views, or dynamic WHERE clauses that the native filter system can't express. Use native filters first; reach for Jinja when you need something native filters can't do.

04 · Security

RBAC plus row-level security. POPIA-relevant.

Superset ships with four built-in roles that cover most deployment shapes. Row-level security adds data-subject access control at the dataset level — every query on the dataset gets the RLS clause appended automatically.

Role	What it can do	Typical user
`Admin`	Everything: database connections, user management, all datasets and dashboards	Platform operator, BI team lead
`Alpha`	Create datasets, charts, dashboards. Cannot manage database connections or users.	Senior analyst, BI developer
`Gamma`	View dashboards and charts they've been granted access to. Cannot create.	Business stakeholder, read-only consumer
`sql_lab`	Access SQL Lab and run ad-hoc queries on permitted databases.	Analyst who needs raw SQL access

Row-level security is defined per dataset as a SQL clause. Superset appends the clause to every query executed against that dataset — in SQL Lab, in charts, in embedded dashboards via guest tokens. The clause can reference current_username() to scope results to the authenticated user.

# superset_config.py — RLS clause example
# Applied automatically to all queries on the "sales" dataset.
# Users only see rows where region_code matches their username prefix.
# Defined in Security > Row Level Security in the Superset UI,
# or seeded via the Superset CLI / API during provisioning.

"region_code = '{{ current_username().split('@')[0] }}'"

# The resulting query Superset executes becomes:
# SELECT ... FROM sales WHERE region_code = 'alice'
# ...where 'alice' is the prefix of alice@company.com

POPIA relevance

RLS enforces data-subject access at the query layer — analysts cannot retrieve rows they're not authorised to see, even in SQL Lab (if RLS is enabled for SQL Lab via the ENABLE_ROW_LEVEL_SECURITY feature flag). Dataset permissions gate which analysts can see which datasets. Combined, these two mechanisms let you run a single Superset instance serving multiple business units or clients without cross-contamination of data.

Self-hosted on GCP africa-south1 or a local K8s cluster means data never transits through foreign infrastructure — which is the cleaner answer to POPIA data residency than any SaaS BI tool.

05 · Deployment

Helm on K8s for production. Docker Compose for dev.

The official Helm chart is the production path. One critical rule: all superset_config.py overrides go in a single configOverrides key. Split them and the Helm chart merges them unpredictably.

# values.yaml — Helm chart configuration for GKE + Cloud SQL
# Repo: helm.superset.apache.org

configOverrides:
  my_override: |
    FEATURE_FLAGS = {
        "DASHBOARD_CROSS_FILTERS": True,
        "ENABLE_TEMPLATE_PROCESSING": True,
        "ALERT_REPORTS": True,
        "ENABLE_ROW_LEVEL_SECURITY": True,
    }
    SQLALCHEMY_DATABASE_URI = "postgresql://superset:pass@cloudsql/superset"
    SQLLAB_TIMEOUT = 300         # seconds — default 30 is too short for warehouses
    SUPERSET_WEBSERVER_TIMEOUT = 300
    GUEST_TOKEN_JWT_EXP_SECONDS = 3600  # embedded dashboards stay live for 1 hour
    RESULTS_BACKEND = "RedisCache"  # or S3/R2 for large CSV exports

    # ONE BLOCK RULE: do not split configOverrides into multiple keys.
    # The Helm chart string-merges them and the result is unpredictable.
    # Everything goes in my_override (or whatever you name the single key).

supersetNode:
  connections:
    db_host: "127.0.0.1"    # Cloud SQL Auth Proxy sidecar
    db_port: "5432"
    db_user: "superset"
    db_name: "superset"

redis:
  enabled: true

celery:
  enabled: true

Docker Compose for local dev. The repo ships a docker-compose.yml that spins up Superset, PostgreSQL, Redis, and Celery workers. Run docker compose up in the repo root, wait two minutes, and you have a full instance at localhost:8088 with the admin/admin default credentials. Change those immediately if you expose it to a network.

Redis doubles as cache and Celery broker. Superset uses Redis for two separate jobs: caching query results (so repeated dashboard loads don't re-hit the database) and as the Celery task broker (for async queries, alert checks, and scheduled reports). One Redis instance handles both — but configure CACHE_CONFIG and DATA_CACHE_CONFIG separately in superset_config.py if you want different TTLs for UI metadata vs query results.

06 · Gotchas

Six things that only bite once you ship.

Most of these are not obvious from the docs. They surface in production, not in the local Docker Compose setup.

configOverrides one-block rule

Splitting configOverrides into multiple keys causes the Helm chart to merge them unpredictably — settings from one block override another without warning. Put everything in a single named key. Always.

SQL Lab timeout defaults to 30 seconds

SQLLAB_TIMEOUT defaults to 30 seconds. Any warehouse query that scans more than a few million rows will hit it. Set it to 300 seconds minimum for analytical workloads. Match SUPERSET_WEBSERVER_TIMEOUT too, or the web process times out before Celery returns the result.

Guest token expiry goes blank silently

GUEST_TOKEN_JWT_EXP_SECONDS defaults to 300 (5 minutes). Embedded dashboards with longer session lifetimes go blank when the token expires — no error, just a blank iframe. Users have no idea why. Set it to at least 3600 for any embedded use case.

Celery worker OOMs on large CSV exports

Large CSV exports load the entire result set into Celery worker memory. Workers OOM and die silently on result sets above ~500k rows. Set RESULTS_BACKEND to a remote store (Redis with a high maxmemory, S3, or Cloudflare R2) so results stream to storage rather than living in the worker process.

RLS is not applied in SQL Lab by default

Row-level security applies to chart queries automatically. It does NOT apply in SQL Lab unless you enable the ENABLE_ROW_LEVEL_SECURITY feature flag AND restrict SQL Lab access to sql_lab role only. Without this, an analyst with SQL Lab access can bypass RLS by writing their own SELECT.

FEATURE_FLAGS keys are UPPER_SNAKE_CASE

Not camelCase. Not kebab-case. DASHBOARD_CROSS_FILTERS, not dashboardCrossFilters or dashboard-cross-filters. Wrong case silently has no effect — the flag stays at its default. Check the source for the canonical key names before assuming a flag isn't working.

07 · Decision guide

Use it for, skip it for.

Superset is the right call for a specific shape of BI problem. When the shape changes, so does the answer.

Use Superset when

You need self-hosted BI with no per-seat licensing — open-source, Apache 2.0.
Your team already knows SQL — the entire product is SQL-first, no proprietary language.
POPIA data sovereignty matters — data stays on your infrastructure, your region.
Budget is a real constraint — Superset vs Tableau is a five-figure annual saving at mid-market scale.
You need embedded analytics — guest tokens and RLS make dashboard embedding straightforward.
You're connecting to ClickHouse, DuckDB, or Trino — first-class SQLAlchemy support, no workarounds.

Skip Superset when

Non-technical users need self-service without SQL — Metabase's UX is significantly more accessible for that persona.
You need real-time streaming dashboards (sub-second refresh) — Grafana is the right tool for time-series and metrics.
The organisation is already deep in the Microsoft ecosystem — Power BI's Azure integration and Excel familiarity will win that argument regardless of Superset's merits.
You need pixel-perfect printed reports — Superset is a BI explorer, not a report renderer. Tableau or a dedicated reporting tool handles that better.
You have no one to operate it — Superset requires a person who can manage Kubernetes, Redis, Celery, and Postgres. It's not a SaaS install.

08 · 2ⁿ Perspective

How this leaf compounds in the tree.

Superset is the read layer in the 2nth data stack. Every other data skill feeds it; it surfaces the results to humans and agents.

The compounding play is this: Superset doesn't produce data — it reads it. That means every upstream skill that produces clean, queryable data increases what Superset can surface. dbt models land in PostgreSQL or BigQuery, and Superset charts update automatically at the next dashboard refresh. ERPNext operational data flows into a reporting replica, and Superset becomes the analytical front-end for the ERP without touching the ERPNext application layer. Cloudflare R2 becomes the export backend so large analyst queries don't crash the Celery worker.

The agent angle: Superset has a REST API. An agent can query the /api/v1/chart/data endpoint, pass a dataset ID and time range, and get chart data back as JSON — without building a separate analytics pipeline. That's the pattern for agent-generated business summaries: pull structured data from Superset, format it, push it into a report or a Slack message.

data/warehouse/dbt

dbt

dbt models are the cleanest Superset datasets. Model in dbt, certify in Superset. The two tools are designed to work together.

data/warehouse/bigquery

BigQuery

BigQuery as the warehouse, Superset as the BI layer. Standard GCP data stack pattern — no additional infrastructure between them.

biz/erp/erpnext

ERPNext

ERPNext operational data feeds a reporting replica; Superset reads that replica. Operational reporting without touching ERPNext's application layer.

biz/oracle-scm

Oracle SCM / Fusion Analytics

Oracle Fusion exports land in BigQuery or PostgreSQL; Superset reads them. A Superset layer avoids Oracle Fusion Analytics licensing for read-only reporting.

tech/cloudflare/workers

Cloudflare Workers

Workers proxy the Superset guest token API for embedded dashboards — keeps the Superset admin credentials server-side and off the browser.

tech/cloudflare/r2

Cloudflare R2

R2 as the RESULTS_BACKEND for large Celery exports. Analysts download large CSVs from R2 directly — zero load on the Superset web process.

09 · Resources

Go deeper.

The official docs are thorough but occasionally lag behind the codebase. The GitHub repo and the Preset blog are the most current references for configuration and feature flags.

Project home Apache Superset superset.apache.org Source apache/superset on GitHub github.com/apache/superset Helm chart Official Helm chart github.com/apache/superset/tree/master/helm/superset Managed hosting Preset — managed Superset preset.io Reference SQLAlchemy dialect docs docs.sqlalchemy.org/en/20/dialects/ Docs Configuring Superset superset.apache.org/docs/configuration

Open-source BI.SQL-native.