An AI chatbot without your data is just a general-purpose assistant. It can hold a conversation, but it can't answer questions about your products, your policies, your pricing, or your processes. That's where training on custom data changes everything.

In LoopReply, "training" doesn't mean fine-tuning a language model or writing training datasets. It means feeding your knowledge base with your own content — documents, web pages, databases, spreadsheets — and letting the AI reference that content when answering questions. This technique is called Retrieval-Augmented Generation (RAG), and it's the most practical way to make a chatbot actually useful for your business.

This guide walks you through every data source LoopReply supports, how to set each one up, and how to tune your knowledge base for accuracy and freshness. No machine learning expertise required.

How RAG Works (In Plain English)
Step 1: Navigate to Your Knowledge Base
Step 2: Add Your Data Sources
Step 3: Choose Your Embedding Model
Step 4: Configure Auto-Refresh
Step 5: Test Retrieval Quality
Step 6: Connect Knowledge Base to Your Workflow
Optimizing for Accuracy
Troubleshooting Poor Answers
Frequently Asked Questions
Next Steps

How RAG Works (In Plain English)

Before diving into setup, it helps to understand what's happening behind the scenes — no computer science degree needed.

RAG works in three steps:

Indexing — When you upload a document or connect a data source, LoopReply breaks the content into small chunks and converts each chunk into a mathematical representation called an "embedding." These embeddings are stored in a vector database (Pinecone) where they can be searched quickly.
Retrieval — When a visitor asks a question, their question is also converted into an embedding. The system then finds the chunks in your knowledge base that are most similar to the question — like finding the pages in a book that are most relevant to a specific query.
Generation — The retrieved chunks are passed to the AI model as context, along with the visitor's question. The AI generates a response that's grounded in your actual content, not just its general knowledge.

The result: your chatbot answers questions using your data, with dramatically fewer hallucinations and much higher accuracy on company-specific topics.

Think of it like giving an employee your company handbook before their first day. They still use their own judgment and communication skills (the AI model), but their answers are based on your actual information (the knowledge base).

Step 1: Navigate to Your Knowledge Base

From your LoopReply dashboard:

Click on the bot you want to train.
Navigate to the Knowledge Base tab.
You'll see the knowledge base manager — an empty canvas ready for your data sources.

Each bot has its own knowledge base, so the data you add here is specific to this bot's domain. If you have multiple bots, each can have different knowledge sources tailored to its purpose.

Step 2: Add Your Data Sources

Click Add Source to start connecting your data. LoopReply supports six source types, each suited for different kinds of content.

PDF Documents

Best for: Product manuals, policy documents, whitepapers, contracts, research papers, compliance documentation.

Click Add Source → PDF Upload.
Drag and drop your PDF files or click to browse. You can upload multiple files at once.
LoopReply extracts the text content, handles formatting, tables, and even scanned documents (OCR).
Click Process to start indexing.

Tips:

Clean, text-based PDFs index faster and more accurately than scanned images.
If your PDF has complex tables, consider also uploading the data as a CSV for better structured retrieval.
Large PDFs (100+ pages) work fine but take a few minutes to process.

Excel and CSV Files

Best for: Product catalogs, pricing tables, FAQ lists, feature comparison matrices, inventory data, customer data.

Click Add Source → Excel/CSV Upload.
Upload your .xlsx, .xls, or .csv file.
LoopReply parses the spreadsheet and indexes each row as a searchable chunk.
For multi-sheet Excel files, all sheets are processed.

Tips:

Use clear column headers — they help the AI understand the structure of your data.
Keep data clean: remove empty rows, merge cells that span multiple rows, and use consistent formatting.
Spreadsheets with 10,000+ rows work, but consider whether all that data is relevant to chatbot conversations.

Web URLs

Best for: Help center articles, blog posts, product pages, documentation sites, landing pages, any public web content.

Click Add Source → Web URL.
Enter the URL you want to index. You can enter individual pages or a sitemap URL.
If you enter a top-level URL (like https://docs.yoursite.com), LoopReply will offer to crawl linked pages within the same domain.
Set the crawl depth — how many levels deep to follow links (1-3 is typical).
Click Start Crawl.

Tips:

Start with your most important pages: FAQ, pricing, documentation, and product descriptions.
For large sites (1,000+ pages), use specific section URLs rather than the root domain to keep the knowledge base focused.
Web content changes frequently — set up auto-refresh (covered in Step 4) to keep your knowledge base current.

SQL Databases

Best for: Dynamic data like order statuses, customer records, product availability, pricing that changes frequently, internal documentation stored in databases.

Click Add Source → Database.
Enter your database connection details: host, port, database name, username, password.
LoopReply supports PostgreSQL, MySQL, and SQL Server.
Write a SELECT query that pulls the data you want the chatbot to access. For example:

SELECT product_name, description, price, availability
FROM products
WHERE active = true

Click Test Connection to verify, then Process to index the results.

Tips:

Use read-only database credentials. The chatbot only needs to read data, never write it.
Be selective with your query — don't dump entire tables. Pull only the columns and rows relevant to customer conversations.
For sensitive data, make sure your query excludes PII or anything you don't want the chatbot referencing.
Combine with auto-refresh to keep the chatbot's answers current as your database changes.

Amazon S3 Buckets

Best for: Large document libraries, archived files, media assets with metadata, versioned documentation, regulatory filings.

Click Add Source → S3 Bucket.
Enter your AWS credentials: access key ID, secret access key, bucket name, and region.
Optionally set a prefix to limit which files in the bucket are indexed (e.g., docs/ to only index files in the docs folder).
LoopReply will list the files in the bucket. Select the ones you want to process.
Click Process to start indexing.

Tips:

Supported file types in S3: PDF, DOCX, TXT, CSV, MD, HTML.
Use IAM roles with minimal permissions — read-only access to the specific bucket or prefix.
S3 is ideal for organizations with large document repositories that are already organized in cloud storage.

Markdown and HTML Files

Best for: Technical documentation, developer guides, internal wikis, static site content, README files.

Click Add Source → File Upload.
Upload your .md or .html files.
LoopReply parses the content, preserving heading structure and formatting.
Markdown files with clear heading hierarchies (#, ##, ###) are chunked by section, which improves retrieval accuracy.

Tips:

If your documentation is in a Git repository, export the Markdown files and upload them. You can automate this with a CI/CD pipeline that pushes updated docs to LoopReply via API.
HTML files are stripped of tags and indexed as clean text. JavaScript-rendered content won't be captured — use the Web URL source for SPAs instead.

Step 3: Choose Your Embedding Model

After adding your data sources, you'll need to select an embedding model. This is the model that converts your text into the mathematical representations (embeddings) used for search.

LoopReply offers two options:

Model	Best For	Characteristics
text-embedding-3-small	Most use cases, cost-effective	Fast processing, good accuracy, lower cost per token
text-embedding-3-large	Maximum accuracy, complex content	Higher dimensional embeddings, better for nuanced technical content

Our recommendation: Start with text-embedding-3-small. It handles the vast majority of use cases excellently, processes faster, and costs less. Switch to text-embedding-3-large only if you notice the bot struggling with nuanced queries on technical or specialized content.

To change your embedding model, go to Knowledge Base → Settings → Embedding Model.

Note: Changing the embedding model requires re-indexing all your data sources. The existing index will be replaced. This process runs in the background and your bot continues working with the old index until the new one is ready.

Step 4: Configure Auto-Refresh

Your data isn't static. Product prices change, help articles get updated, inventory shifts. Auto-refresh keeps your knowledge base current by re-indexing data sources on a schedule.

Go to Knowledge Base → Settings → Auto-Refresh and choose an interval:

Interval	Best For
5 minutes	Rapidly changing data (inventory, pricing, status pages)
15 minutes	Frequently updated content (support articles, feature docs)
1 hour	Moderately changing content (blog posts, knowledge bases)
Daily	Mostly static content (policies, manuals, guides)

You can set different refresh intervals for different data sources. For example, your product database might refresh every 15 minutes while your PDF policies refresh daily.

Tips:

More frequent refreshes use more processing resources. Only use 5-minute intervals for data that genuinely changes that often.
Web URL sources benefit most from auto-refresh since web content changes without you manually re-uploading anything.
Database sources also benefit heavily — the chatbot always has the latest product availability, pricing, or order data.

Step 5: Test Retrieval Quality

Before connecting your knowledge base to your chatbot's workflow, test whether the retrieval is actually working well.

In the Knowledge Base tab, use the Test Search feature:

Type a question a real customer might ask — for example, "What's your return policy?" or "How much does the Pro plan cost?"
The system shows you the chunks it retrieved, ranked by relevance score.
Review the results:
- Are the right chunks showing up? The most relevant content should be in the top 3 results.
- Is the content accurate? Make sure the retrieved text actually answers the question.
- Are there gaps? If a question returns no relevant results, you need to add more data covering that topic.

Run 10-15 test queries covering common customer questions. This gives you confidence that the knowledge base is well-configured before going live.

Step 6: Connect Knowledge Base to Your Workflow

The knowledge base doesn't do anything on its own — you need to connect it to your chatbot's conversation flow using the visual workflow builder.

Open your bot's Workflow tab.
Drag a Knowledge Search node onto the canvas.
Position it before your AI Response node.
Connect them: ... → Knowledge Search → AI Response → ...

When a conversation reaches the Knowledge Search node, it takes the visitor's latest message, searches your knowledge base, and passes the relevant chunks to the AI Response node as context. The AI then generates a response that's grounded in your data.

For a detailed walkthrough of building workflows, see our no-code chatbot building guide.

Optimizing for Accuracy

Once your knowledge base is running, here are proven ways to improve the quality of answers:

Write for the Chatbot, Not Just Humans

Your existing documentation might be optimized for humans browsing a help center. But the chatbot finds content through semantic search, not navigation menus. Consider:

Use clear, descriptive headings. "Return Policy for Physical Products" is better than "Section 4.2".
Front-load key information. Put the answer in the first sentence of each section, then elaborate.
Be explicit. Instead of "Contact us for pricing", include actual pricing. The chatbot can only reference what's in the knowledge base.

Deduplicate Your Sources

If the same information exists in multiple documents, the knowledge base might retrieve conflicting versions. Audit your sources for duplicates and keep the most authoritative version.

Use Structured Data for Structured Questions

For questions like "What's the price of Plan X?" or "Is Product Y available in size Z?", structured data (spreadsheets, databases) often works better than unstructured documents. The chunking and retrieval is more precise with tabular data.

Create a Chatbot FAQ Document

Write a document specifically for the chatbot that covers your most common questions in a Q&A format. Each question-answer pair becomes a highly retrievable chunk. This is often the single most impactful thing you can do for answer quality.

Monitor and Iterate

Review conversations in your dashboard regularly. When the bot gives a wrong or incomplete answer, trace it back to the knowledge base:

Was the relevant content in the knowledge base? If not, add it.
Was the content retrieved but the AI misinterpreted it? Rewrite the content for clarity.
Was irrelevant content retrieved instead? Improve your chunking or remove noisy data.

Troubleshooting Poor Answers

The bot says "I don't have information about that"

The topic isn't in your knowledge base. Add content covering that topic.
The content exists but isn't being retrieved. Test the search with the exact question. If the relevant chunk doesn't appear, the content might need to be reworded to be more semantically similar to how customers ask the question.
The similarity threshold is too high. If you've customized the threshold, try lowering it slightly to allow more results through.

The bot gives outdated information

Auto-refresh isn't configured or is set to too long an interval for that source. Check your refresh settings.
The source data itself is outdated. Re-upload the latest version of static files (PDFs, spreadsheets).
Cached embeddings. After significant content changes, manually trigger a re-index from the Knowledge Base settings.

The bot mixes up information from different sources

Too many overlapping sources. If you have the same topic covered in a PDF, a web page, and a spreadsheet, the retrieval might pull conflicting chunks. Consolidate to one authoritative source per topic.
Chunks are too small. If the chunking splits related information across multiple chunks, the AI might get incomplete context. This is less common with default settings but can happen with very dense documents.

The bot hallucinates despite having a knowledge base

The Knowledge Search node isn't in the workflow. Make sure you've added the node and connected it before the AI Response node.
Low-quality source content. If your documents contain vague or ambiguous language, the AI might fill in gaps with incorrect information. Write clear, specific content.
Model choice matters. Some models are better at staying grounded in provided context. GPT-5 and Claude Opus 4.6 are particularly strong at this.

Frequently Asked Questions

How much data can I add to the knowledge base?

The free tier supports a generous amount of content for small businesses. Pro and Scale plans support significantly more data. In practical terms, most businesses use between 50-500 documents or equivalent web pages. If you have enterprise-scale documentation needs, contact us about custom limits.

Does training the chatbot require machine learning knowledge?

Not at all. "Training" in LoopReply means uploading your documents and connecting data sources. There's no model training, no dataset preparation, no hyperparameter tuning. The RAG system handles everything automatically. For more context on how AI chatbots work, see our guide on what is an AI chatbot.

How long does indexing take?

It depends on the volume: a few PDFs process in seconds, a website crawl of 100 pages takes a few minutes, and a large database query might take 5-10 minutes. Indexing happens in the background — your bot continues working while new data is being processed.

Can I remove or update specific data sources?

Yes. In the Knowledge Base manager, you can delete any source, re-upload updated files, or re-crawl URLs. Removing a source also removes its embeddings from the vector store, so the bot immediately stops referencing that content.

Is my data secure?

Your data is encrypted at rest (AES-256) and in transit (TLS 1.3). Embeddings are stored in isolated Pinecone namespaces per bot. Your data is never used to train AI models and is never accessible to other customers. LoopReply is SOC 2 compliant and HIPAA-ready.

Can I use multiple data sources together?

Yes, and this is encouraged. Most effective setups combine several source types — for example, web URLs for help articles, PDFs for product manuals, a database for real-time pricing, and a CSV for FAQ content. The Knowledge Search node searches across all sources simultaneously and returns the most relevant results regardless of source type.

What file formats are supported?

LoopReply supports PDF, DOCX, XLSX, XLS, CSV, Markdown (.md), and HTML (.html) for direct file uploads. For connected sources, you can use web URLs (any public page), SQL databases (PostgreSQL, MySQL, SQL Server), and Amazon S3 buckets (which can contain any of the supported file formats).

Next Steps

Your chatbot now has the knowledge it needs to give accurate, company-specific answers. Here's what to do next:

Build a complete workflow — If you haven't already, create a conversation flow that uses the Knowledge Search node effectively. Follow our no-code chatbot building tutorial.
Set up human handover — Configure escalation for questions the bot can't answer. See our human handover best practices guide.
Install the widget — Get the chatbot live on your website. Our installation guide covers every platform.
Create a chatbot FAQ document — Write a dedicated Q&A document covering your 50 most common questions. This single addition often improves answer quality by 30-40%.
Set a review cadence — Check your conversations weekly, identify gaps, and add content to fill them. The best knowledge bases are continuously maintained.

The more relevant data your chatbot has access to, the more valuable it becomes. Start with your most critical content, get it live, and expand from there.

Train Your AI Chatbot on Custom Data

Table of Contents

How RAG Works (In Plain English)

Step 1: Navigate to Your Knowledge Base

Step 2: Add Your Data Sources

PDF Documents

Excel and CSV Files

Web URLs

SQL Databases

Amazon S3 Buckets

Markdown and HTML Files

Step 3: Choose Your Embedding Model

Step 4: Configure Auto-Refresh

Step 5: Test Retrieval Quality

Step 6: Connect Knowledge Base to Your Workflow

Optimizing for Accuracy

Write for the Chatbot, Not Just Humans

Deduplicate Your Sources

Use Structured Data for Structured Questions

Create a Chatbot FAQ Document

Monitor and Iterate

Troubleshooting Poor Answers

The bot says "I don't have information about that"

The bot gives outdated information

The bot mixes up information from different sources

The bot hallucinates despite having a knowledge base

Frequently Asked Questions

How much data can I add to the knowledge base?

Does training the chatbot require machine learning knowledge?

How long does indexing take?

Can I remove or update specific data sources?

Is my data secure?

Can I use multiple data sources together?

What file formats are supported?

Next Steps

Related Articles

Lead Qualification Chatbot That Converts

Chatbot-to-Human Handover: Best Practices

Build a Chatbot Without Coding (2026)

Ready to build your AI chatbot?

SOC-2 Type II

ISO/IEC 27001:2022

GDPR Compliant

HIPAA Compliant