Dataset Management

Create, upload, and bind datasets to your Thotis RAG agents.

What is a dataset?

A dataset is a collection of documents that the agent uses as its knowledge base. When a user asks a question, the RAG engine retrieves relevant chunks from bound datasets and includes them in the LLM context.

Creating a dataset

Datasets are created in the RagFlow instance and managed from the Thotis console.

  1. Go to Datasets in the RAG sidebar.
  2. Click New Dataset and provide a name and optional description.
  3. The dataset is created in RagFlow and registered in the Thotis control plane.

Uploading documents

Supported formats include Markdown, PDF, DOCX, TXT, and HTML.

Upload documents through the dataset detail page. Each document goes through:

  1. Parsing: the document is split into chunks using the configured method.
  2. Indexing: chunks are embedded and stored in the vector index.
  3. Ready: the dataset is available for retrieval.

Chunking configuration

The default chunking method is Markdown with these parameters:

ParameterDefaultDescription
Chunk size500-800 tokensTarget chunk size
Overlap100-150 tokensOverlap between consecutive chunks
Delimiter## Section delimiter for splitting

Binding datasets to agents

  1. Open the agent's Sources tab.
  2. Click Add Source and select one or more datasets.
  3. The binding is stored locally and pushed to RagFlow on publish.

An agent can be bound to multiple datasets. The retrieval engine searches across all bound datasets simultaneously.

Data sources

In addition to manual uploads, you can configure automated data sources:

  • Web scraping via Firecrawl for crawling websites
  • ONISEP XML integration for structured education data
  • Custom uploads via the RagFlow API

Updating datasets

When you update documents in a dataset, the changes take effect immediately for all agents bound to that dataset — no republish required. The retrieval index is updated in real-time by RagFlow.