SQL Reference¶
Auto-synced from source
This page is auto-synced from the SynapCores AIDB engine
repository on every release tag. The canonical source is
AIDB_SQL_MANUAL.md in the engine repo — do not edit this
page directly; your change will be overwritten on the next release.
Last synced from: v1.8.1-ce on 2026-06-11
AIDB is an AI-native SQL database with first-class support for vector embeddings, AutoML, Cypher graph queries, and LLM functions. This manual is the authoritative reference for AIDB SQL features (v1.6.0 through v1.6.5.1). Use ONLY features documented here.
Quick reference — AIDB extensions at a glance¶
The features below are AIDB-specific extensions that distinguish AIDB SQL from generic SQL. Recipes for AI / analytics intents should prefer these over hand-rolled equivalents.
| Feature | Form |
|---|---|
| Vector column type | col_name VECTOR(N) where N is the embedding dimension |
| Text embedding | EMBED(text_expr) -> VECTOR |
| Cosine similarity | COSINE_SIMILARITY(vec_a, vec_b) -> float in [-1, 1] |
| Euclidean distance | EUCLIDEAN_DISTANCE(vec_a, vec_b) -> float >= 0 |
| Train AutoML model | CREATE EXPERIMENT name AS SELECT ... WITH (task_type=..., ...) |
| Predict with AutoML model | SELECT AUTOML.PREDICT('model', col1, col2, ...) AS risk FROM t |
| List/describe models | SHOW MODELS, DESCRIBE MODEL name |
| LLM text generation | GENERATE(prompt_text) -> TEXT |
| Native-inference model pull | PULL_MODEL('qwen2.5-coder:7b') -> TEXT (v1.8.0+) |
| Native-inference model list | LIST_MODELS() -> table (v1.8.0+) |
| Native-inference model drop | DELETE_MODEL('model_name') -> TEXT (v1.8.0+) |
| Cypher graph pattern | MATCH (n:Label) RETURN n (per-tenant graph) |
| Cypher graph write | CREATE (n:Label {prop: value}), MERGE, DETACH DELETE n |
Data Definition Language (DDL)¶
CREATE DATABASE / DROP DATABASE / USE / SHOW DATABASES¶
CREATE DATABASE [IF NOT EXISTS] db_name;
DROP DATABASE [IF EXISTS] db_name [CASCADE];
USE db_name;
SHOW DATABASES [LIKE 'pattern'];
CREATE TABLE¶
CREATE TABLE [IF NOT EXISTS] table_name (
column_name data_type [column_constraint],
...
[table_constraint]
);
Data Types (full list):
Scalar: BOOLEAN, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, DECIMAL(p, s),
TEXT, VARCHAR(n), CHAR(n), BYTEA, JSON, JSONB, UUID,
TIMESTAMP, DATE, TIME.
AI-native: VECTOR(N) where N is the embedding dimension (must match the configured embedding model — default MiniLM is 384).
Multimedia: AUDIO, VIDEO, IMAGE, PDF.
Column constraints: PRIMARY KEY, UNIQUE, NOT NULL, CHECK (expr), DEFAULT expr, REFERENCES other_table(other_col).
Worked example — table with a vector column for semantic search:
CREATE TABLE products (
id BIGINT PRIMARY KEY,
name TEXT NOT NULL,
category TEXT,
price DECIMAL(10, 2),
description TEXT,
description_vec VECTOR(384)
);
ALTER TABLE / DROP TABLE / CREATE INDEX / DROP INDEX¶
ALTER TABLE t ADD COLUMN c data_type [constraint];
ALTER TABLE t DROP COLUMN c;
ALTER TABLE t RENAME COLUMN old TO new;
ALTER TABLE t ALTER COLUMN c TYPE new_type;
DROP TABLE [IF EXISTS] t [CASCADE];
CREATE [UNIQUE] INDEX [IF NOT EXISTS] idx_name ON t (col [ASC|DESC], ...);
DROP INDEX [IF EXISTS] idx_name;
Data Manipulation Language (DML)¶
INSERT INTO t [(c1, c2, ...)] VALUES (v1, v2, ...), ...;
UPDATE t SET c1 = v1, c2 = v2 [WHERE condition];
DELETE FROM t [WHERE condition];
Worked example — populate a vector column from text using EMBED:
Query Language¶
SELECT¶
SELECT [ALL | DISTINCT] expr [AS alias], ...
FROM table_name
[WHERE condition]
[GROUP BY expr, ...]
[HAVING condition]
[ORDER BY expr [ASC | DESC], ...]
[LIMIT n] [OFFSET k];
ORDER BY may reference projection aliases directly:
SELECT id,
COSINE_SIMILARITY(description_vec, EMBED('wireless headphones')) AS similarity
FROM products
ORDER BY similarity DESC
LIMIT 10;
(If the alias is misspelled, the query returns a clear unknown column error rather than silently returning empty — fixed in v1.6.5.1.)
Joins, CTEs, subqueries¶
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= NOW() - INTERVAL '30 days';
WITH recent_orders AS (
SELECT * FROM orders WHERE created_at >= NOW() - INTERVAL '30 days'
)
SELECT customer_id, COUNT(*) AS n_orders
FROM recent_orders
GROUP BY customer_id;
Transaction Control¶
Built-in Functions¶
Math¶
ABS, CEIL/CEILING, FLOOR, ROUND, MOD, POWER/POW, SQRT, EXP,
LOG/LN, LOG10, SIGN, TRUNCATE/TRUNC, PI, RAND/RANDOM,
SIN, COS, TAN, ASIN, ACOS, ATAN, DEGREES, RADIANS.
String¶
UPPER, LOWER, LENGTH, SUBSTRING, CONCAT, TRIM, LTRIM, RTRIM,
REPLACE, LEFT, RIGHT, LPAD, RPAD, REPEAT, REVERSE,
INSTR/POSITION, ASCII, CHAR/CHR, INITCAP, MD5, SHA1, SHA256.
Date / Time¶
NOW/CURRENT_TIMESTAMP, CURRENT_DATE, CURRENT_TIME,
YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP,
DATE_FORMAT(date, fmt), STR_TO_DATE(s, fmt),
DATE_ADD(date, n, unit), DATE_SUB(date, n, unit), DATEDIFF(d1, d2),
LAST_DAY, DAYNAME, MONTHNAME, QUARTER, WEEK/WEEKOFYEAR,
DAYOFWEEK, DAYOFYEAR.
Format examples:
DATE_FORMAT(NOW(), '%Y-%m-%d %H:%i:%s') -- '2026-05-17 14:30:45'
DATE_ADD('2026-01-15', 30, 'DAY') -- 2026-02-14
DATEDIFF('2026-12-31', NOW()) -- days until year end
Conditional / null¶
GREATEST(a, b, ...), LEAST(a, b, ...), IF(cond, then, else)/IIF,
IFNULL(expr, alt)/ISNULL, COALESCE(...), NULLIF(a, b),
CASE WHEN ... THEN ... ELSE ... END.
Vector & AI Functions¶
AIDB exposes vector and LLM operations as first-class SQL functions. They compose normally with WHERE, ORDER BY, joins, and CTEs.
EMBED(text)¶
Computes an embedding for the given text using the configured embedding model.
- Argument: any
TEXTexpression. - Returns:
VECTOR(N)matching the configured model dimension (default 384 for MiniLM). - The column you store the result in must use the matching dimension.
SELECT EMBED('wireless noise cancelling headphones');
UPDATE products SET description_vec = EMBED(description);
COSINE_SIMILARITY(vec_a, vec_b)¶
Returns a DOUBLE in [-1, 1]. 1 = identical direction, 0 = orthogonal, -1 = opposite.
SELECT id, name,
COSINE_SIMILARITY(description_vec, EMBED('running shoes')) AS similarity
FROM products
ORDER BY similarity DESC
LIMIT 10;
EUCLIDEAN_DISTANCE(vec_a, vec_b)¶
Returns a DOUBLE >= 0. Smaller = more similar.
SELECT id, EUCLIDEAN_DISTANCE(description_vec, EMBED('hiking boots')) AS dist
FROM products
ORDER BY dist ASC
LIMIT 5;
GENERATE(prompt)¶
Calls the configured completion model and returns the generated text.
- Argument:
TEXTprompt. - Returns:
TEXT. - Cached on identical prompt within a session (call latency drops to ~0 after first call).
- Local LLMs are slow per-row — use
GENERATEfor small result sets, not full-table scans.
SELECT id,
GENERATE('Summarize this customer review in one sentence: ' || review_text) AS summary
FROM reviews
WHERE rating <= 2
LIMIT 50;
SEMANTIC_MATCH, MULTI_MODAL_SIMILARITY, CROSS_MODAL_SEARCH¶
Higher-level helpers used inside SEMANTIC JOIN and multi-modal queries. Prefer EMBED + COSINE_SIMILARITY for explicit similarity, and use SEMANTIC_MATCH only in SEMANTIC JOIN clauses.
Other built-in AI functions¶
CLASSIFY(text, categories), EXTRACT_ENTITIES(text), SENTIMENT(text), SUMMARIZE(text), TRANSLATE(text, target_lang).
Native-inference model lifecycle (v1.8.0+)¶
v1.8.0-ce ships an in-process OCI v2 model registry: the gateway can
pull GGUF models from registry.ollama.ai (or any Docker Distribution
v2 registry) and serve them via the embedded local provider — no
external Ollama daemon, no separate process. The three functions below
expose that registry as SQL, alongside the equivalent synapcores pull
/ synapcores models list CLI commands.
These functions are active when the gateway is running with
[query.ai_service].provider = "local" (the v1.8 default — set
automatically when [query.ai_service] is omitted from gateway.toml).
The model store lives under data_dir/models/ with sha256-addressed
blobs and JSON manifest sidecars.
PULL_MODEL(name) — fetch a model into the local store¶
- Argument:
TEXT— model reference. Acceptsname,name:tag,namespace/name[:tag], orregistry/namespace/name[:tag]. Defaults: registry=registry.ollama.ai, namespace=library, tag=latest. - Returns:
TEXT— the resolved manifest digest. - Idempotent: a second pull with the same name short-circuits when the local manifest digest matches the registry's current digest; no blob bytes are re-fetched.
- Resume: interrupted pulls leave a
.partialfile; re-runningPULL_MODELresumes from the byte offset already on disk. - Min engine version: 1.8.0.
-- Pull the default 7B chat model (the v1.8 install-script default).
SELECT PULL_MODEL('qwen2.5-coder:7b');
-- Pull an embedding model (used by EMBED + AGENT_RUN's memory layer).
SELECT PULL_MODEL('library/all-minilm:latest');
-- Pull from a third-party namespace.
SELECT PULL_MODEL('bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M');
LIST_MODELS() — inventory the local model store¶
- Arguments: none.
- Returns: a table with columns
(name TEXT, architecture TEXT, size_bytes BIGINT, digest TEXT, pulled_at TIMESTAMP, last_used_at TIMESTAMP). - Reads on-disk manifest sidecars only — never touches the network.
- Min engine version: 1.8.0.
-- Inventory installed models.
SELECT * FROM LIST_MODELS();
-- Total disk used by the model store.
SELECT SUM(size_bytes) AS total_bytes FROM LIST_MODELS();
DELETE_MODEL(name) — remove a model from the local store¶
- Argument:
TEXT— model reference (same name forms asPULL_MODEL). - Returns:
TEXT— the digest of the removed manifest. - Reference-counts content-addressed blobs: a blob shared with another tag is kept on disk until the last reference is dropped.
- Errors if the model is currently loaded in the LRU; unload it first
by pointing
[query.ai_service].modelelsewhere, or restart the gateway. - Min engine version: 1.8.0.
AutoML¶
AutoML trains real models from a SQL SELECT and exposes the trained model as an in-SQL function AUTOML.PREDICT. Training and prediction are first-class SQL — you do not call out to Python.
CREATE EXPERIMENT — train a model¶
Syntax (current — v1.6.5):
CREATE EXPERIMENT model_name AS
SELECT feature_1, feature_2, ..., label_column AS target
FROM training_table
[WHERE ...]
WITH (
task_type = 'binary_classification' | 'multi_classification' | 'regression'
| 'clustering' | 'time_series',
-- 'anomaly_detection' task_type — coming in v1.8 (Algorithm::IsolationForest + ANOMALY_SCORE())
target_column = 'target',
[optimization_metric = 'auc' | 'accuracy' | 'f1' | 'rmse' | 'mae' | ...,]
[max_trials = 50,]
[algorithms = ['logistic_regression', 'random_forest', 'gradient_boosting']]
);
- The
targetcolumn from the SELECT becomes the label. By convention, for binary classificationtarget = 1is the positive class (e.g. fraud, churn). - Without
algorithms, AutoML runs in Auto mode and explores a sensible default set. - Validation predictions are calibrated with isotonic regression for binary tasks, so
AUTOML.PREDICTreturns a well-calibratedP(class=1). CREATE EXPERIMENT ASYNC name AS ...schedules training in the background; poll withSHOW MODELS/DESCRIBE MODEL.
Algorithm options:
| Algorithm | Best for | Speed | Accuracy |
|---|---|---|---|
| logistic_regression | Binary classification, interpretable | Fast | Good |
| linear_regression | Simple regression, interpretable | Fast | Good for linear |
| random_forest | General purpose, robust | Medium | High |
| gradient_boosting | High accuracy, competitions | Slow | Very High |
| neural_network | Complex patterns, large data | Slow | High |
| knn | Simple, local patterns | Fast | Medium |
| svm | Binary classification, kernels | Medium | High |
| naive_bayes | Text classification, simple | Very fast | Medium |
Worked example — train a churn model:
CREATE EXPERIMENT churn_model_v1 AS
SELECT tenure_months,
monthly_charges,
total_charges,
visits_30d,
churned AS target
FROM customers
WITH (
task_type = 'binary_classification',
target_column = 'target',
optimization_metric = 'auc',
max_trials = 30,
algorithms = ['logistic_regression', 'random_forest', 'gradient_boosting']
);
AUTOML.PREDICT(...) — predict with a trained model¶
Syntax:
SELECT pass_through_col_1, pass_through_col_2, ...,
AUTOML.PREDICT('model_name', feature_1, feature_2, ...) [AS alias]
FROM scoring_table
[WHERE ...]
[ORDER BY alias DESC|ASC]
[LIMIT n];
- First argument is the model name as a quoted string.
- Remaining arguments are the feature columns, in any order — they are matched by name to the model's feature schema.
- Returns:
- Binary classification: calibrated
P(target = 1)asDOUBLE. - Multiclass: probability of the predicted (top) class.
- Regression: the raw numeric prediction.
- Default alias is
predictionifAS aliasis omitted. - You may sort or filter on the alias (
ORDER BY alias DESC,WHERE alias > 0.8) — the planner pushes the prediction down so the alias is in scope.
Worked example — rank customers by churn risk:
SELECT id, name, tier,
AUTOML.PREDICT('churn_model_v1',
tenure_months, monthly_charges, total_charges, visits_30d) AS risk
FROM customers
WHERE tier = 'Gold'
ORDER BY risk DESC
LIMIT 50;
Important — feature columns also in the projection (v1.6.5.1): If a feature column is also a pass-through column, the planner dedupes it automatically; you do NOT need to list it twice:
-- OK: tenure_months in BOTH pass-through and features — dedupe handles it.
SELECT id, tenure_months,
AUTOML.PREDICT('churn_model_v1', tenure_months, monthly_charges) AS risk
FROM customers
ORDER BY risk DESC;
Model lifecycle¶
SHOW MODELS; -- list all models in current tenant
DESCRIBE MODEL churn_model_v1; -- schema, algorithm, metrics, training time
DROP MODEL churn_model_v1; -- delete model artifacts
SHOW EXPERIMENTS; -- list (legacy) experiments
DESCRIBE EXPERIMENT name;
Known limitations:
- Model artifacts live under
<data_dir>/models/and are not portable across binaries (the in-memory representation may change with version upgrades). - Models are tenant-prefixed; a tenant cannot use another tenant's models.
AUTOML.PREDICTis supported in the SELECT projection. Wrapping it inside aFROM (...) AS subsubquery requires the v1.6.5.1 outer-subquery-wrap fix and works forORDER BY alias/LIMITbut not for arbitrary outer projection rewrites.
Cypher Graph Queries¶
AIDB ships a per-tenant property graph engine with a Cypher subset. Cypher statements are routed automatically by /v1/query/execute (and by the SQL /v2/query/execute path) — you do not need a separate endpoint.
Read patterns¶
-- Find all nodes with a given label
MATCH (n:Person) RETURN n LIMIT 100;
-- Filter on properties
MATCH (n:Person) WHERE n.age >= 18 RETURN n.name, n.age;
-- Traverse a relationship
MATCH (a:Person {name: 'Alice'})-[:KNOWS]->(b:Person)
RETURN b.name;
-- Variable-length pattern + filter on the path
MATCH (a:Account)-[:TRANSFERRED*1..3]->(b:Account)
WHERE a.owner = 'alice@example.com'
RETURN a.id, b.id;
Write patterns¶
-- Create a node
CREATE (p:Person {name: 'Bob', age: 30});
-- Create a relationship between two existing nodes
MATCH (a:Person {name: 'Alice'}), (b:Person {name: 'Bob'})
CREATE (a)-[:KNOWS {since: 2026}]->(b);
-- MERGE = match-or-create (good for ingest pipelines)
MERGE (p:Patient {mrn: 'MRN-101'})
MERGE (drug:Drug {name: 'Warfarin'})
MERGE (p)-[:PRESCRIBED]->(drug);
-- Delete a node and ALL its relationships
MATCH (n:Person {name: 'Charlie'}) DETACH DELETE n;
-- UNWIND a list to bulk-create
UNWIND [{name: 'Alice', age: 30}, {name: 'Bob', age: 25}] AS row
CREATE (:Person {name: row.name, age: row.age});
When to use graph vs SQL JOIN¶
- Use a SQL JOIN for tabular, fixed-depth relationships you already model in tables.
- Use Cypher when you need multi-hop traversals, variable-length paths, or to express "find me everyone reachable from X via these edge types" succinctly. Cypher beats N-way SQL self-joins on graph-shaped data.
Discovery¶
SHOW PROPERTY GRAPHS; -- list graphs in this tenant
CALL db.labels(); -- list all node labels in the active graph
Multi-modal SQL¶
For images, audio, video, and PDF stored in IMAGE/AUDIO/VIDEO/PDF columns, AIDB exposes:
-- Embed any modality and search across modalities
SELECT id,
MULTI_MODAL_SIMILARITY(text := description,
image := cover_image,
weights := '{"text":0.6,"image":0.4}') AS score
FROM products
ORDER BY score DESC LIMIT 10;
-- Semantic JOIN: match rows by semantic similarity instead of equality
SELECT a.id, b.id
FROM articles a
SEMANTIC JOIN reference_docs b
ON SEMANTIC_MATCH(a.body, b.text, threshold := 0.75);
MULTI_MODAL_SIMILARITY(...), CROSS_MODAL_SEARCH(...), and SEMANTIC_MATCH(...) accept named arguments using the name := value syntax.
Triggers and Procedures¶
CREATE [OR REPLACE] TRIGGER trg_name
{BEFORE | AFTER} {INSERT | UPDATE | DELETE} ON table_name
[FOR EACH ROW]
[WHEN (condition)]
EXECUTE PROCEDURE proc_name(args);
DROP TRIGGER [IF EXISTS] trg_name ON table_name;
CREATE [OR REPLACE] PROCEDURE proc_name(args) AS $$
BEGIN
-- procedure body
END;
$$ LANGUAGE plpgsql;
DROP PROCEDURE [IF EXISTS] proc_name;
CALL proc_name(args);
SHOW PROCEDURES [LIKE 'pattern'];
SHOW TRIGGERS [FROM table_name] [LIKE 'pattern'];
Natural Language¶
ASK '<natural language question>'; -- run a natural language query
EXPLAIN NATURAL '<natural language question>'; -- show the SQL plan
Critical "do's and don'ts" for AIDB recipes¶
DO prefer the AIDB-native extension when the intent matches:
- Need text/image similarity? ->
EMBED+COSINE_SIMILARITY(orMULTI_MODAL_SIMILARITY). - Need a trained model? ->
CREATE EXPERIMENT ... WITH (...)thenAUTOML.PREDICT(...). - Need risk-ranked output? ->
ORDER BY <prediction_alias> DESC LIMIT Ndirectly. - Need multi-hop relationships? -> Cypher
MATCH, not N-way self-joins. - Need LLM-generated text per row? ->
GENERATE(prompt)in the SELECT.
DON'T:
- Don't invent functions or syntax not in this manual.
- Don't list a feature column twice when it's also in the pass-through projection —
AUTOML.PREDICTdedupes. - Don't use placeholder comments like
-- your SQL hereor[bracket placeholders]— every recipe must have actual, runnable SQL. - Don't include
DROP TABLE/DROP INDEXcleanup steps in recipes (they delete user data). - Don't store an embedding in a column whose declared dimension differs from the model's output dim — that is a runtime error.