Dataset Management
Create, upload, and bind datasets to your Thotis RAG agents.
What is a dataset?
A dataset is a collection of documents that the agent uses as its knowledge base. When a user asks a question, the RAG engine retrieves relevant chunks from bound datasets and includes them in the LLM context.
Creating a dataset
Datasets are created in the RagFlow instance and managed from the Thotis console.
- Go to Datasets in the RAG sidebar.
- Click New Dataset and provide a name and optional description.
- The dataset is created in RagFlow and registered in the Thotis control plane.
Uploading documents
Supported formats include Markdown, PDF, DOCX, TXT, and HTML.
Upload documents through the dataset detail page. Each document goes through:
- Parsing: the document is split into chunks using the configured method.
- Indexing: chunks are embedded and stored in the vector index.
- Ready: the dataset is available for retrieval.
Chunking configuration
The default chunking method is Markdown with these parameters:
| Parameter | Default | Description |
|---|---|---|
| Chunk size | 500-800 tokens | Target chunk size |
| Overlap | 100-150 tokens | Overlap between consecutive chunks |
| Delimiter | ## | Section delimiter for splitting |
Binding datasets to agents
- Open the agent's Sources tab.
- Click Add Source and select one or more datasets.
- The binding is stored locally and pushed to RagFlow on publish.
An agent can be bound to multiple datasets. The retrieval engine searches across all bound datasets simultaneously.
Data sources
In addition to manual uploads, you can configure automated data sources:
- Web scraping via Firecrawl for crawling websites
- ONISEP XML integration for structured education data
- Custom uploads via the RagFlow API
Updating datasets
When you update documents in a dataset, the changes take effect immediately for all agents bound to that dataset — no republish required. The retrieval index is updated in real-time by RagFlow.