Full-Text Search
Search any text column in your lakehouse. Built on Tantivy (Rust port of Lucene), queryable from any SQL via the fts_search() table function. Joins cleanly with the rest of your data and works through the wire protocol.
How It Works
- Build an index with
rq fts createon a text column. - Query with
fts_search('index_name', 'your query')from any SQL: interactive editor, saved queries, wire-protocol clients, public charts. - Rebuild when the source table changes via
rq fts rebuild.
The index lives as a single archive on your storage (counted toward your storage quota). It's separate from your data. Dropping it doesn't touch your tables, building it doesn't add columns.
Quick Start
# Build an index over a text columnrq fts create hn_body --table hackernews.comments --column body
# Use it in any SQLrq query " SELECT c.id, c.author, h.score FROM hackernews.comments c JOIN fts_search('hn_body', 'rust async runtime') h ON c.id = h.doc_id ORDER BY h.score DESC LIMIT 20"# Build an index over a text columnrq fts create hn_body --table hackernews.comments --column body
# Use it in any SQLrq query " SELECT c.id, c.author, h.score FROM hackernews.comments c JOIN fts_search('hn_body', 'rust async runtime') h ON c.id = h.doc_id ORDER BY h.score DESC LIMIT 20"That's it. The query runs in milliseconds even on tens of millions of rows.
The fts_search Function
fts_search(index_name, query [, limit [, fuzzy]])fts_search(index_name, query [, limit [, fuzzy]])Returns a relation with two columns:
| Column | Type | Meaning |
|---|---|---|
doc_id | BIGINT | Row position in the source table |
score | FLOAT | BM25 relevance score (higher = better) |
Arguments:
| Arg | Default | Notes |
|---|---|---|
index_name | required | The name you gave the index when creating it |
query | required | Lucene-style query string (see below) |
limit | 10 | Top-N hits to return (1..1000) |
fuzzy | 0 | Levenshtein edit distance for single-term fuzzy match (0, 1, or 2) |
You join the result on doc_id to recover the rest of the row's columns.
Query Syntax
The default analyzer (bm25) lowercases tokens and applies Snowball English stemming, so rust matches rust, rusty, rusted. The query string supports:
-- Single term (matches stemmed forms)fts_search('hn_body', 'rust')
-- Multiple terms, implicit OR, scored by BM25fts_search('hn_body', 'rust async runtime')
-- AND requirementfts_search('hn_body', '+rust +async')
-- Phrasefts_search('hn_body', '"async runtime"')
-- Exclusionfts_search('hn_body', 'rust -tokio')
-- Top 50 hitsfts_search('hn_body', 'rust async', 50)-- Single term (matches stemmed forms)fts_search('hn_body', 'rust')
-- Multiple terms, implicit OR, scored by BM25fts_search('hn_body', 'rust async runtime')
-- AND requirementfts_search('hn_body', '+rust +async')
-- Phrasefts_search('hn_body', '"async runtime"')
-- Exclusionfts_search('hn_body', 'rust -tokio')
-- Top 50 hitsfts_search('hn_body', 'rust async', 50)For a typo-tolerant search, set fuzzy to 1 or 2 (Levenshtein edit distance). Only the first whitespace-separated token is used for fuzzy matching:
-- Matches 'rust', 'rusty', 'rusted', and typos within 2 editsfts_search('hn_body', 'rsut', 50, 2)-- Matches 'rust', 'rusty', 'rusted', and typos within 2 editsfts_search('hn_body', 'rsut', 50, 2)Multiple Columns
You can index several columns in a single index. They're concatenated with a space at index time and treated as one searchable field:
rq fts create products --table catalog.items \ --column title --column description --column tagsrq fts create products --table catalog.items \ --column title --column description --column tagsLifecycle
rq fts # list indexes in the workspacerq fts show hn_body # status, size, snapshot, freshnessrq fts rebuild hn_body # rebuild against the current snapshotrq fts drop hn_body # delete the index and its archiverq fts # list indexes in the workspacerq fts show hn_body # status, size, snapshot, freshnessrq fts rebuild hn_body # rebuild against the current snapshotrq fts drop hn_body # delete the index and its archiveWhen the source table changes (sync, transform, push), the index is marked STALE. It still works, queries just surface a warning so you know to rebuild. There's no auto-rebuild yet, you control when the index refreshes.
Saved query + REST API
fts_search() works inside a saved query. Bind the search text as a :param
and call /run.
Save:
rq queries create job-search \ --sql "SELECT t.title, t.author, t.url, h.score FROM hackernews.jobs t JOIN fts_search('jobs_fts', :q, :limit) h ON t.id = h.doc_id ORDER BY h.score DESC"rq queries create job-search \ --sql "SELECT t.title, t.author, t.url, h.score FROM hackernews.jobs t JOIN fts_search('jobs_fts', :q, :limit) h ON t.id = h.doc_id ORDER BY h.score DESC"Run:
curl -X POST https://api.rawquery.dev/api/v1/workspaces/<workspace-id>/saved-queries/job-search/run \ -H "Authorization: Bearer rq_xxx" \ -H "Content-Type: application/json" \ -d '{"parameters": {"q": "rust async runtime", "limit": "20"}, "format": "objects"}'curl -X POST https://api.rawquery.dev/api/v1/workspaces/<workspace-id>/saved-queries/job-search/run \ -H "Authorization: Bearer rq_xxx" \ -H "Content-Type: application/json" \ -d '{"parameters": {"q": "rust async runtime", "limit": "20"}, "format": "objects"}'Response:
{ "data": [ {"title": "...", "author": "...", "score": 11.31}, {"title": "...", "author": "...", "score": 8.14} ], "row_count": 2, "execution_time_ms": 187.9}{ "data": [ {"title": "...", "author": "...", "score": 11.31}, {"title": "...", "author": "...", "score": 8.14} ], "row_count": 2, "execution_time_ms": 187.9}The :q parameter accepts the full query syntax (single term, +AND, "phrase",
-exclusion). For fuzzy matching expose a fourth parameter:
JOIN fts_search('jobs_fts', :q, :limit, :fuzzy) h ON t.id = h.doc_idJOIN fts_search('jobs_fts', :q, :limit, :fuzzy) h ON t.id = h.doc_idHybrid Search (BM25 + Vector)
You can combine FTS and vector search in pure SQL. Run both, fuse with Reciprocal Rank Fusion or convex combination:
-- Top BM25 hitsWITH lex AS ( SELECT doc_id, score AS bm25 FROM fts_search('hn_body', 'rust async', 100)),-- Top vector hits (assumes a hackernews.comments_embedded table from `rq embed`)vec AS ( SELECT id AS doc_id, array_distance(embedding, :query_vec::FLOAT[384]) AS dist FROM hackernews.comments_embedded ORDER BY dist LIMIT 100)SELECT COALESCE(lex.doc_id, vec.doc_id) AS doc_id, -- Convex combination: tune alpha (0..1) for your dataset COALESCE(0.6 * (lex.bm25 / (SELECT MAX(bm25) FROM lex)), 0) + COALESCE(0.4 * (1 - vec.dist / (SELECT MAX(dist) FROM vec)), 0) AS hybrid_scoreFROM lexFULL OUTER JOIN vec USING (doc_id)ORDER BY hybrid_score DESCLIMIT 20;-- Top BM25 hitsWITH lex AS ( SELECT doc_id, score AS bm25 FROM fts_search('hn_body', 'rust async', 100)),-- Top vector hits (assumes a hackernews.comments_embedded table from `rq embed`)vec AS ( SELECT id AS doc_id, array_distance(embedding, :query_vec::FLOAT[384]) AS dist FROM hackernews.comments_embedded ORDER BY dist LIMIT 100)SELECT COALESCE(lex.doc_id, vec.doc_id) AS doc_id, -- Convex combination: tune alpha (0..1) for your dataset COALESCE(0.6 * (lex.bm25 / (SELECT MAX(bm25) FROM lex)), 0) + COALESCE(0.4 * (1 - vec.dist / (SELECT MAX(dist) FROM vec)), 0) AS hybrid_scoreFROM lexFULL OUTER JOIN vec USING (doc_id)ORDER BY hybrid_score DESCLIMIT 20;Quotas
Each plan includes a monthly allotment of rows indexed. Rebuilds count toward this quota.
| Plan | Rows / month | Overage |
|---|---|---|
| Free | 100k | hard cap |
| Team | 2M | €0.50 / 1M |
| Business | 20M | €0.30 / 1M |
Storage of the index archive itself counts toward your normal storage quota.
The index archive size is typically 20–35% of the indexed text column's compressed size, with ZSTD applied on top.
Tips
- Project early. When joining with the source table, select only the columns you need:
SELECT c.id, c.author, h.score, notSELECT *. - Tighten
limit. Fewer hits = less work. The default of 10 is for "show me the most relevant"; raise it only when you need ranking far down the list. - Index narrow, not wide. A separate
fts createper column gives you more control than one giant multi-column index. - Rebuild on a cadence that matches your data. Static corpora (HN dump, product catalogue): build once. Live data (support tickets, logs): rebuild after each sync.
Limits
- Index name: 1–100 chars, alphanumeric +
_+- - Up to 5 columns per index
- Single build job: hard-capped at 5× your monthly included rows on overage plans (mid-job safety)
- Concurrent builds platform-wide: 3 (additional submissions wait in
QUEUED) fts_searcharguments must be string literals in the SQL. No column references in the index name or queryfts_searchlimitargument: 1..1000fuzzy: 0, 1, or 2