Filesystem Collections¶

Available in v1.2.0+

This feature is available in v1.2.0 and later. If you're on an earlier version, upgrade with:

curl -fsSL https://get.synapcores.com/install.sh | sh

A Filesystem Collection is a folder on disk that the gateway watches. Drop a file in, and SynapCores ingests it: CSVs become SQL tables, PDFs and text files get chunked and embedded for RAG, images get OCR'd and CLIP-embedded, and audio/video gets transcribed via the embedded inference engine. Everything is queryable from SQL the moment it lands.

It's a thin operator-friendly path to the same primitives that already power the Web UI's media gallery and document collections — except the upload step is "save the file."

Use cases¶

Knowledge bases — point a collection at the team's research folder; every PDF, Markdown note, and meeting transcript becomes vector-searchable from SQL or AI Chat.
Document search over a shared drive — sync an SMB or NFS mount into a watched path; ingest is automatic as the drive updates.
Log ingestion — point a collection at a directory of CSV log exports; each file becomes a SQL table you can query and join immediately.
Image gallery from a synced folder — drop screenshots or product photos in; CLIP embeddings make them visually searchable, OCR captures any embedded text.
Podcast or meeting archive — drop .mp3 or .mp4 files in; Whisper transcribes the audio track, the transcript is embedded, and the original is linked from the Web UI media gallery.

How files get processed¶

The processor dispatches on file extension. Each extension has a single deterministic path; you don't configure it per-file.

Extension	What happens
`.csv`	Schema-inferred from the header row, becomes a SQL table you can query directly.
`.pdf`, `.txt`, `.md`	Text extracted, chunked (default 512 tokens), embedded; queryable via `SELECT ... WHERE COSINE_SIMILARITY(...)`.
`.jpg`, `.png`, `.webp`	OCR run for any text content (Tesseract), CLIP embedding for visual search, indexed in the gallery.
`.mp3`, `.wav`, `.m4a`	Whisper transcription, transcript chunked and embedded.
`.mp4`, `.mov`, `.webm`	Frames sampled at fixed intervals (CLIP-embedded), audio track Whisper-transcribed; both go into the same collection index.
anything else	Metadata-only entry — filename + path + size embedded so you can still find it by name.

Files outside the collection's extensions allowlist are ignored entirely — no metadata entry, no log line beyond a debug-level skip.

Setting up a watched folder¶

Prerequisites¶

The folder must exist and be readable by the synapcores system user (the user the gateway runs as).
The path must be canonical — symlinks pointing outside the path are rejected. This is a hard security guard, not a config option.
For first ingest with embeddings, the gateway needs outbound network access to Hugging Face (one-time model download — see First-run AI setup).

Option A — From the Web UI¶

Log in (see Web UI).
Open Filesystem Collections in the left nav.
Click Create, fill in:
- Name — your handle for the collection (e.g. team_kb)
- Path — server-side absolute path (e.g. /data/incoming)
- Extensions — multi-select from the supported list
- Recursive — watch subdirectories too
- Chunk size — for text files, tokens per chunk (default 512)
- Embed model — minilm (default), bert-base, or bert-large
Click Create. The collection enters active state immediately and a startup scan kicks off.

Screenshots coming with v1.2.0 release.

Option B — From SQL¶

CREATE COLLECTION my_docs
  FROM FILESYSTEM '/data/incoming'
  WITH (
      extensions   = 'pdf,csv,jpg,mp4,txt',
      recursive    = true,
      auto_index   = true,
      chunk_size   = 512,
      embed_model  = 'minilm'
  );

CREATE COLLECTION ... FROM FILESYSTEM is a SynapCores extension to SQL. It's parsed through the same dual-path parser routing as CREATE IMMUTABLE TABLE, so it works identically from the CLI, the REST API, and the Web UI's query editor.

What you'll see during the first ingest¶

In the gateway logs:

filesystem_collections: created collection "my_docs" id=fc_01HAB...
filesystem_collections: scanning "/data/incoming" (recursive=true, 47 candidate files)
filesystem_collections: processed report-q3.pdf (12 chunks embedded, 0.84s)
filesystem_collections: processed sales.csv (1,203 rows → table sales, 0.31s)
filesystem_collections: processed launch.mp4 (transcribed 8m12s, 14 chunks, 22.4s)
filesystem_collections: collection "my_docs" caught up — 47 files, 0 failed

After that, the watcher takes over. Drop another file in and you'll see a single processed line within a second or two of the OS event.

Querying the results¶

Text RAG (PDFs, .txt, .md)¶

SELECT name, content_chunk
FROM   my_docs
WHERE  COSINE_SIMILARITY(embedding, EMBED('quarterly revenue trends')) > 0.7
ORDER  BY similarity DESC
LIMIT  5;

CSV-as-table¶

When a CSV is ingested, the collection registers a SQL table with the file's basename. Query it like any other table:

SELECT region, SUM(revenue) AS total
FROM   sales
GROUP  BY region
ORDER  BY total DESC;

If you want the full join syntax across collections, the table is addressable as my_docs.sales too.

Large CSV files (streaming mode)¶

CSV files with more than 10 000 rows are processed in streaming mode: the processor reads the file in 1 000-row batches, infers the column schema from the first 1 000 rows, then refines it as later batches arrive (always widening — never tightening — the inferred type). This caps peak memory regardless of file size: a 100 MB / 1 M-row CSV ingests in well under 200 MB of RSS, where the previous fully- buffered code path would OOM the gateway.

Progress for streaming CSVs is reported per-batch on the same WebSocket progress feed used by the rest of the pipeline (see the /v1/filesystem-collections/:id/progress endpoint). Each batch event carries the running row count and elapsed milliseconds so the UI can show a live 42 000 / 1 000 000 rows indicator during a long ingest.

Tunables:

chunk_size (per-collection, see below) — only affects the text preview generated from the first ~200 CSV rows for full-text RAG. It does not affect the SQL ingest path or the streaming threshold.
The streaming threshold itself is currently a fixed gateway default (10 000 rows). On-the-fly tuning via WITH (...) options is on the v1.3 roadmap.

Image search¶

SELECT name, path
FROM   my_docs
WHERE  file_type = 'image'
  AND  COSINE_SIMILARITY(embedding, EMBED('whiteboard photo with handwritten diagrams')) > 0.6
LIMIT  10;

OCR text is in the ocr_text column if you'd rather grep it directly:

SELECT name, ocr_text
FROM   my_docs
WHERE  file_type = 'image'
  AND  ocr_text ILIKE '%invoice%';

Audio / video transcripts¶

SELECT name, content_chunk, start_time, end_time
FROM   my_docs
WHERE  file_type IN ('audio', 'video')
  AND  COSINE_SIMILARITY(embedding, EMBED('roadmap commitments for Q4')) > 0.7
ORDER  BY similarity DESC
LIMIT  10;

start_time / end_time are in seconds within the source media — useful for deep-linking back into the original.

Cross-collection queries (within a tenant)¶

A tenant can query across all of its collections in a single statement:

SELECT collection_name, name, content_chunk
FROM   filesystem_collections.documents
WHERE  COSINE_SIMILARITY(embedding, EMBED('open security incidents')) > 0.7
ORDER  BY similarity DESC
LIMIT  20;

Configuration reference¶

`CREATE COLLECTION` options¶

Option	Type	Default	Description
`extensions`	comma-separated string	`'pdf,csv,txt,md,jpg,png,mp3,mp4'`	Allowlist of file extensions to ingest.
`recursive`	bool	`true`	Watch subdirectories of the root path.
`auto_index`	bool	`true`	Build the vector index as files arrive. Set `false` to ingest now and build the index later via `REINDEX COLLECTION`.
`chunk_size`	int	`512`	Tokens per chunk for text files.
`embed_model`	string	`'minilm'`	Embedding model: `minilm` (384-d), `bert-base` (768-d), `bert-large` (1024-d).

REST endpoints¶

All endpoints require Authorization: Bearer <JWT> and are tenant-scoped automatically.

Method	Path	Description
`POST`	`/v1/filesystem-collections`	Create a new collection.
`GET`	`/v1/filesystem-collections`	List collections for the authenticated tenant.
`GET`	`/v1/filesystem-collections/:id`	Detail view with recent ingest events.
`PATCH`	`/v1/filesystem-collections/:id`	Pause / resume — body `{"paused": true}` or `{"paused": false}`.
`DELETE`	`/v1/filesystem-collections/:id`	Delete. Optional query string `?retain_data=true` keeps the ingested rows / vectors.
`GET`	`/v1/filesystem-collections/:id/documents`	List ingested files with status, size, and last-processed timestamp.
`GET`	`/v1/filesystem-collections/:id/documents/:doc_id/file`	Download the original file binary. Append `?thumbnail=true` for a 256x256 JPEG preview. Streams the response and supports HTTP range requests for video.
`POST`	`/v1/filesystem-collections/:id/documents/:doc_id/reprocess`	Re-queue one document for ingest. Returns `202 Accepted` with the enqueue timestamp. Idempotent: repeated clicks while the doc is still pending are deduped.

Fetching file binaries¶

The file endpoint streams the original bytes back with the appropriate Content-Type (PNG, MP4, PDF, …). HTTP range requests are honored — every full-file response sets Accept-Ranges: bytes, and clients can issue Range: bytes=N-M to seek without re-downloading earlier bytes. HTML5 <video> does this automatically for scrubbing.

# Full file
curl -O -H "Authorization: Bearer $TOKEN" \
     "http://localhost:8080/v1/filesystem-collections/$CID/documents/$DOC_ID/file"

# 256x256 JPEG thumbnail (browser-cacheable for an hour)
curl -H "Authorization: Bearer $TOKEN" \
     "http://localhost:8080/v1/filesystem-collections/$CID/documents/$DOC_ID/file?thumbnail=true" \
     -o thumb.jpg

# Range request — first 1 MB only (HTML5 video does this implicitly)
curl -H "Authorization: Bearer $TOKEN" \
     -H "Range: bytes=0-1048575" \
     "http://localhost:8080/v1/filesystem-collections/$CID/documents/$DOC_ID/file" \
     -o head.bin

Thumbnails are generated on demand: images and PDFs render to JPEG via the in-process decoders, videos extract a frame at 1.0s via FFmpeg when present, and audio / SVG / unknown formats fall back to a generic SVG icon. Thumbnail responses ship with Cache-Control: private, max-age=3600; full-file responses use no-cache because the watcher may have re-ingested the file between requests.

If the source file was deleted from disk after ingest, the endpoint returns 410 Gone with a "file removed from disk; re-process to update" message — trigger the reprocess endpoint above to clean up. Cross-tenant requests collapse to 404 Not Found. Malformed Range headers return 416 Range Not Satisfiable.

Example — create from curl:

TOKEN="<JWT from /v1/auth/login>"
curl -X POST http://localhost:8080/v1/filesystem-collections \
     -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
           "name": "my_docs",
           "path": "/data/incoming",
           "extensions": ["pdf", "csv", "txt"],
           "recursive": true,
           "auto_index": true,
           "chunk_size": 512,
           "embed_model": "minilm"
         }'

WebSocket progress¶

Live progress events for a single collection:

GET /ws/filesystem-collections/:id/progress

The server emits one JSON message per ingest lifecycle event (scan_started, file_processing, file_done, file_failed, scan_complete, paused, resumed). The Web UI uses this stream to power its real-time event log.

Tenant isolation¶

Filesystem Collections are tenant-scoped end to end:

Collection metadata is keyed by tenant in RocksDB.
The vector index is created per-collection-per-tenant, with no shared name space.
Two tenants can legitimately watch overlapping paths on the host. Each gets its own ingestion, its own index, its own results — they don't see each other's collections, documents, or queries.
The gateway optionally enforces a per-tenant allowed-roots list in gateway.toml; if set, collection paths must be under one of those roots.

Limits & quotas¶

Max file size: 100 MB per file by default. Files above the cap are skipped with a logged warning. Override per-collection in v1.3.
Max collections per tenant: subject to the standard CE caps — see Limits & quotas.
Embedding throughput: bounded by the embedding model's per-second rate. The pipeline backpressures gracefully; under sustained overload the queue policy is drop-oldest-pending with a warning.
Vectors per collection: same 50,000,000 ceiling as any other vector collection in CE.

Troubleshooting¶

"I dropped a file but it doesn't appear"¶

Walk the checklist:

Tail the gateway log during the drop:
```
sudo journalctl -u synapcores -f | grep filesystem_collections
```
No log line at all? The watcher didn't see the OS event.
Confirm the extension is in the collection's allowlist. A .docx file will not be ingested if the collection only watches pdf,csv,txt.
Check file permissions. The synapcores user must be able to read the file:
```
sudo -u synapcores cat /data/incoming/report.pdf > /dev/null
```
Permission denied? Fix the owner or group on the file.
Confirm the watched folder still exists. If the directory was deleted and recreated under the same path, the kernel inode changed and the old watch may be dead. Pause and resume the collection to re-arm it.

"Embedding model download is slow / hangs"¶

The first call to EMBED() triggers a Hugging Face download into ~/.cache/huggingface/. On a fresh install, expect ~90 MB for the default MiniLM model. Subsequent calls are cached and run offline.

If the download fails, the gateway logs a clear error and the file stays in pending state — no data loss, just retry once connectivity is back. To force a retry without waiting for the next OS event, pause and resume the collection.

"A single document failed — how do I retry it?"¶

If a document fails to ingest (transient OCR error, network blip during a Hugging Face model download, encoder crash on a corrupt video), click Re-process in the document grid (UI) or POST /v1/filesystem-collections/:id/documents/:doc_id/reprocess (API). Recovery is at-most-once per click; the orchestrator dedupes if the doc is still queued, so a double-click does not create duplicate work. The doc transitions back to pending immediately and to indexed once a worker drains it. You can re-process a document in any state (pending / processing / indexed / failed); this is also the right tool to re-embed a single file after changing the embedding model. If the underlying source file has been deleted from disk, the API returns 410 Gone — delete the document record instead.

"Watched folder must exist when collection is created"¶

Yes — by design. Creating a collection over a path that doesn't exist returns a 400 Bad Request. If the folder gets deleted out from under a live collection, the next scan logs a clear error and the collection moves to failed state until the path is restored.

What's not supported in Community Edition¶

Cloud storage watching (S3, GCS, Azure Blob) — Enterprise feature.
Cross-tenant collections — collections are strictly tenant-scoped; there is no "shared" mode in CE.
OCR for handwritten content — the embedded Tesseract pipeline handles printed text only.
Real-time collaborative editing of ingested files — ingest is read-only. Edit the source file at the OS layer and the watcher will re-ingest it; SynapCores doesn't write back to the watched directory.

Where to next¶

Web UI — full UI tour, including the media gallery and vector search surfaces that Filesystem Collections feed into.
First-run AI setup — wire up an LLM provider so the ingested content is reachable from AI Chat and NL2SQL.
Limits & quotas — the per-tenant caps that apply.