Glossary·34 terms

Data Glossary

Words people throw around. Translated with zero respect.

If you landed on a rawquery page and did not recognise a term, it is here. Short, honest, sometimes snarky. If something is missing, email us.

Storage and formats

S3: A giant storage system on the internet, made by Amazon in 2006. You upload files, they stay there, you pay per gigabyte. Cheap. Never goes down. Became the universal place to dump any kind of data. Every modern data platform stores there, or on a clone (MinIO, Cloudflare R2, Wasabi). Unkillable.
Apache Iceberg: A way to take a pile of files sitting on cloud storage (see S3) and make them behave like a proper database table. Schema, history, the works. What they did right: open format, read by every serious engine (Snowflake, Spark, Databricks). Netflix built it, everyone adopted it. If your data lives here, you can walk away anytime.
Parquet: A file format for data. Like a CSV but stores each column together instead of each row, which makes it smaller and faster to scan. What it did right: became the default for analytics in 2013 and never let go. If you see a .parquet file, it is someone's data.
CSV: The oldest data format still in use. A text file with commas between values, one row per line. Everyone's first spreadsheet export. Still how most business data moves around, in 2026, somehow.
JSON: A text format for structured data. Looks like {"name": "alice", "age": 30}. Every programming language reads and writes it. Became the default for apps talking to apps since ~2010. Replaced XML, which everyone hated.
Lakehouse: Marketing word. Means “data files sitting on cloud storage (S3), with metadata on top so they behave like a database”. Invented by Databricks, copied by everyone. What the idea did right: decoupled storage from compute, so you stopped paying Snowflake just to hold your files. rawquery is a lakehouse.
Data Warehouse: A database specifically designed for asking questions about lots of data at once. Snowflake, BigQuery, Redshift. What they did right: scaled analytics to billions of rows without the user thinking about infrastructure. What they did wrong: pricing that invoices you for breathing.
Schema: The structure of a table: what columns exist and what type of data goes in each. Also used to mean “a group of tables sharing a namespace”. Software naming is like that. Live with it.
Table: A grid. Columns are fields, rows are records. A Stripe invoice is a row in the invoices table. You already got this.

Engines and languages

Postgres: A database. A database is a thing that stores information in tables (spreadsheets on steroids) and lets you ask questions about it. Postgres is the most popular open source one, from 1986. What they did right: free, reliable, handles almost anything, no corporate drama. 40 years later it still eats newer databases alive. You've used it without knowing.
DuckDB: A database that runs on one computer and chews through millions of rows in seconds. Free. Made by a team in Amsterdam. What they did right: realized most “big data” fits in memory on a laptop and skipped the whole distributed cluster circus. If your data fits on an SSD, DuckDB beats the expensive warehouse.
SQL: The language you type to ask a database questions. Looks like English (SELECT email FROM customers WHERE country = 'France'), does not quite read like it. Invented in 1974. What it did right: you declare what you want, the database figures out how to get it. 50 years later, nothing has replaced it. LLMs now write it for you, and that is fine.
Query: A SQL statement. SELECT this FROM that WHERE condition. The thing you type to get an answer. Now your agent types it for you.
OLAP vs OLTP: Two kinds of database workloads. OLTP runs your app, one row at a time (Stripe processing a payment). OLAP answers your boss, millions of rows at once (how much revenue last quarter). Postgres does both, badly at OLAP. DuckDB does OLAP brilliantly.

Interfaces and plumbing

CLI: Command Line Interface. A black terminal where you type commands and stuff happens. No buttons, no menus. What it does right: fast, scriptable, works over SSH, agents can use it directly. Most serious dev tools ship one. We built our product around it.
API: Application Programming Interface. A door into a system. Your app sends a message, the system sends one back. Usually in JSON. What it did right: commoditized integration. No SDK required, curl works. Bad APIs are still a plague.
Wire Protocol: The exact bytes two programs exchange to understand each other. A secret handshake. Postgres has one, used since 1996. What it did right: became the universal language for database clients. If your tool speaks Postgres wire protocol, it can point at any database that also speaks it. psql, Metabase, Grafana, your Python script.
Connector: A little program that pulls data from a SaaS product (Stripe, HubSpot) into your warehouse. What connectors did right: save engineers from writing the same ingestion code a thousand times. What vendors did wrong: gatekeep behind enterprise pricing, and the one you need is always missing.
Sync: Running a connector to fetch fresh data. “Sync my Stripe” means “go get everything new since last time”. Should handle pagination, rate limits, schema changes, resuming after a crash. If it does not, you find out at 3am.
Transform: A SQL query that reshapes raw data into something useful. Usually a join, a filter, a summary. Runs on a schedule. What the idea did right: keep business logic in SQL, versioned in git. What the ecosystem did wrong: turned it into a 400-page docs website and a career path.
ETL and ELT: Three letters for “take data from one system, reshape it, put it in another”. ETL reshapes before landing. ELT reshapes after. What ELT did right: cheap storage made it possible to load raw data first, so you never lose the original. Which order you pick is a religious debate.
DAG: Directed Acyclic Graph. A flowchart that does not loop back on itself. Every data tool draws one for “step A runs before step B runs before step C” and calls it revolutionary. It is a flowchart.

What you show people

Chart: A SQL query with a rendering on top. Bars, lines, numbers. The thing your boss asks for every Monday.
Dashboard: Multiple charts on one screen. Usually titled “KPIs”. Often abandoned three weeks after being built because nobody set up the alerts.
BI Tool: Business Intelligence tool. Looker, Metabase, Tableau. You point it at a database, it draws charts. What they did right: let non-engineers build dashboards. What they did wrong: turned into multi-year implementation projects staffed by consultants.
Attribution Model: A method to decide which marketing touchpoint gets credit for a sale. Every company fights about it. Nobody agrees. Mostly wankery. Pick one, document it, move on.

Vendors you keep hearing about

Fivetran: A company that sells connectors. What they did right: proved people will pay a lot to not maintain sync code. What they did wrong: priced themselves out of every team under 100 people. Merged with dbt in 2025.
Snowflake: A data warehouse. Big. What they did right: made warehousing feel like a SaaS, separated storage and compute, sold hard to every Fortune 500. What they did wrong: the bill. Most companies use 5% of it and pay for 100%.
Databricks: Spark (a distributed compute engine) in a tuxedo. What they did right: packaged distributed compute for the Fortune 500, bought their way into every data category (lakehouse, ML, BI, Postgres). What they did wrong: overkill for your last quarter's revenue.
dbt: A tool to run SQL transforms on a schedule. What it did right: versioned SQL in git, added tests, made lineage visible. What the ecosystem did wrong: convinced 3-person teams they needed 400 models and a data mesh.

AI

LLM: Large Language Model. Claude, GPT, Gemini. A neural network trained on most of the written internet, now capable of reading and writing text almost like a human. What they did right: made average SQL free. You describe what you want in English, the LLM writes the query.
Agent: An LLM with access to tools. It runs commands, reads files, calls APIs. You say what you want in English, it figures out the steps. If your data platform has a CLI, the agent operates it directly.

People

Data Engineer: The person who writes the pipelines, fixes the broken syncs, keeps the warehouse running. What they do well: the stuff that breaks silently at 3am. Underpaid for how much falls over without them.
Data Analyst: The person who writes the SQL and makes the charts. Explains to your boss why the number went up. What they do well: ask the right question, which is harder than writing the query.

rawquery replaces Snowflake, Fivetran, and dbt with one product. EU-hosted, open format, no sales call. Try it free.