How to create a RAG engine with Local LLM using Ollama

Retrieve-Augmented Generation (RAG) is a powerful technique for making Large Language Models (LLMs) more effective, accurate, and context-aware by enhancing their responses with external knowledge sources. It combines retrieval (finding relevant text from documents) with generation (LLM text creation), effectively enabling models to respond with updated, accurate information not contained in their original training data.

What is Ollama:

Ollama is an open-source platform designed to simplify the process of running and managing large language models (LLMs) directly on your local machine. It enables developers, researchers, and enthusiasts to deploy AI models without relying on cloud services, thereby enhancing data privacy, reducing latency, and offering greater control over AI workflows

Benefits of Using Ollama

  • Enhanced Privacy: By running models locally, users maintain full ownership of their data, eliminating potential security risks associated with cloud-based services.

  • Offline Capability: Ollama enables AI model usage without an internet connection, making it ideal for environments with limited or no connectivity.

  • Cost Efficiency: Operating models locally reduces dependency on cloud infrastructure, potentially lowering operational costs.

  • Customization and Flexibility: Users can fine-tune models, adjust parameters, and tailor AI functionalities to meet specific project requirements

How to set up OLLAMA:

To get started with Ollama, the first step is to ensure your system meets the basic requirements. Ollama supports macOS, Linux, and Windows (currently in preview). It runs efficiently on machines with at least 8GB of RAM, though 16GB or more is recommended for handling larger language models. One of the biggest advantages of Ollama is that it doesn't require a GPU, making it accessible for a wide range of users who want to run LLMs locally.

Installation is straightforward. On macOS, you can use Homebrew to install Ollama with a single command. For Linux users, Ollama provides a shell script that automates the installation process. Windows users can download a preview installer directly from Ollama’s official website. Once installed, running Ollama is as simple as entering a command like ollama run llama3, which will download and launch the Llama 3 model on your local system. This process initializes a local server that lets you interact with the model in real time, without relying on any cloud services.

After setup, Ollama automatically runs a local API server at http://localhost:11434, enabling developers to connect it to custom applications using standard HTTP requests. For those who want to go a step further, Ollama also offers support for creating custom models using its Modelfile system. This allows users to fine-tune or personalize models by defining their own prompts and behaviors. Managing models is simple, too—you can pull, list, or remove models using the Ollama command-line interface.

In essence, Ollama brings powerful LLM capabilities directly to your device, offering a private, offline, and customizable experience that’s ideal for developers, researchers, and enthusiasts alike. Whether you're building AI applications or simply exploring large language models, Ollama offers a lightweight and efficient environment to do so locally.

Building a Local RAG Chatbot with Ollama and a PDF Survival Guide

Retrieval-Augmented Generation (RAG) is a powerful architecture that enhances the capabilities of language models by grounding their responses in real-world documents. Instead of relying purely on what the model was trained on, RAG systems actively retrieve relevant information from a knowledge base at the time of answering a user’s query. With tools like Ollama, it's now possible to implement such systems entirely on your local machine, ensuring privacy, speed, and complete control over the models being used. In this setup, we’ll be using two models locally: qwen3:8b-q4_K_M as the large language model for generation, and nomic-embed-text:137m-v1.5-fp16 for creating embeddings that power the retrieval process.

The core use case here is a chatbot that can answer questions from a PDF survival guide—a document that might contain structured instructions, emergency scenarios, lists, and essential tips. The goal is to allow users to query the document conversationally, getting clear and accurate answers grounded in the actual content of the guide.

The process begins with setting up Ollama, a tool that simplifies the installation and management of open-source LLMs. With a single command, both the language model and the embedding model can be downloaded and run locally. For example, ollama pull qwen3:8b-q4_K_M and ollama pull nomic-embed-text:137m-v1.5-fp16 will make these models available on your system. These models don’t require an internet connection after installation, and they run entirely on CPU or optionally GPU, depending on your setup.

Once the models are in place, the PDF survival guide is processed. The first step is to extract text from the PDF. This is typically done using Python libraries like PyMuPDF, pdfminer. six, or pdfplumber, which convert the document into plain text while preserving sections and formatting as much as possible. The extracted text is then chunked—split into smaller segments such as paragraphs or few-sentence blocks. This is a critical step, as the embedding model works best when operating on concise, self-contained pieces of information.

Each chunk is then passed through the embedding model (nomic-embed-text:137m-v1.5-fp16), which transforms it into a numerical vector—a mathematical representation of the chunk’s meaning. These vectors are stored in a vector database like FAISS or chroma, which allows for fast similarity search. When a user types a query like “How to signal for rescue in the wild?”, the same embedding model converts this question into a query vector. The system then compares this vector against the stored document vectors to find the most relevant chunks from the survival guide.

The selected chunks are concatenated and prepended as context to the user’s question and sent to the LLM, in this case qwen3:8b-q4_K_M. This model, now grounded with relevant knowledge, generates a response that is not only fluent and intelligent but also accurate and traceable to the document source. Importantly, all of this happens offline and locally, without any dependence on cloud APIs or remote servers.

Running a RAG pipeline locally with Ollama also enables customization. For instance, you can define your own system prompts, adjust how many document chunks are retrieved, or modify the way documents are split and cleaned. This makes it an excellent choice for developers building private AI assistants for enterprise knowledge bases, educational materials, or any sensitive documentation.

In summary, by combining Ollama’s local model management with embedding and retrieval logic, you can build a robust RAG chatbot that runs entirely offline. Using qwen3:8b-q4_K_M for answering and nomic-embed-text:137m-v1.5-fp16 for understanding content, the system can transform any PDF—like a survival guide—into a dynamic, interactive, and intelligent assistant. Whether you’re building internal tools or just exploring the capabilities of local AI, this approach offers a powerful, private, and production-ready solution.

To know more about RAG, please visit our article through the below link:

https://www.lotuslabs.ai/post/retrieve-augment-and-generate-rag-a-technique-to-hack-llms-to-your-command

To work on similar and various other AI use cases, connect with us at

https://www.lotuslabs.ai/

To work on computer vision use cases, get to know our product Padmé

https://www.padme.ai/

Blog Posts