How to create a local RAG-enabled LLM server that provides safe access to your documents

This tutorial explains how to set up a headless RAG-enabled large language model (LLM) on an Ubuntu server. By the end, you will be able to chat with your LLM running locally on your own infrastructure. Your conversations will not be sent to third-party servers, ensuring that your prompts, responses, and any uploaded files remain private and under your control.

A quick word on security

Effective network security begins with a strong perimeter defence. One of the most critical steps is configuring a firewall on your router and hosts to explicitly deny all unsolicited inbound traffic from the internet. This “default deny” posture ensures that no internal services are accidentally exposed, creating a controlled environment where all communication is initiated from within your trusted network.

A common and potentially dangerous misconception is that Network Address Translation (NAT) alone provides sufficient security. While NAT obscures internal devices, it is a routing mechanism rather than a security control. Meaningful protection comes only from a deliberately configured firewall. Relying solely on NAT can leave your network vulnerable to determined attacks that bypass this superficial barrier.

Index

This tutorial is divided into the following sections:

  1. The Ubuntu Server
  2. (Optional) Python scripts
  3. Install and configure Ollama
  4. Choosing your model
  5. Install Docker
  6. Install and configure Open WebUI
  7. Install and configure Chroma DB
  8. Reconfigure Open WebUI to use the Chroma DB vector database
  9. Create a Workspace Knowledge Collection with your RAG content
  10. systemctl Cheat Sheet
  11. ufw Cheat Sheet
  12. Ollama Cheat Sheet
  13. Docker Cheat Sheet
  14. (Optional) Working with Chroma DB

Each topic is substantial, and this tutorial focuses on the essentials required to achieve the desired outcome. Along the way, we validate progress at each stage. You are encouraged to comment on and extend this work to help others. If you find it useful, please like, share, and subscribe.

 

1. The Ubuntu Server

All examples in this guide assume a CPU-only setup. If you have a GPU available—either on bare metal or passed through to a virtual machine—you will need to adjust parameters accordingly to take advantage of that hardware.

The server used in this tutorial is a virtual machine with the following specifications:

  • 32 GB RAM
  • 128 GB disk
  • 16 CPU cores
  • No GPU
  • OS: Ubuntu 24.04.3 LTS

During installation, the OpenSSH server was installed (this can also be done post-install). All system packages are fully up to date.

The firewall is enabled, with SSH access allowed on port 22 from the local LAN only.

sudo ufw allow from 192.168.16.0/24 to any port 22

The server has a static IP address of 192.168.16.7.

 

2. (Optional) Python scripts

Throughout the tutorial, optional Python scripts are provided. These are not required to follow the core setup steps, but they may be useful if you intend to build custom applications or integrations around your locally hosted LLM.

All scripts were tested using Visual Studio Code. You should create a Python virtual environment. A requirements.txt file is provided alongside each example to document the required dependencies.

All code has been tested. If you encounter problems when applying the requirements.txt file, the likely problem are incompatibilities with the python version. What this video to understand how to manage this: https://youtu.be/v5hipvLeo2E.  

 

3 Install and configure Ollama

Ollama website: https://ollama.com/

Download and install Ollama as described in the official documentation:

curl-fsSL https://ollama.com/install.sh |sh

This installs Ollama and configures it as a systemd service (ollama.service).

Verify the installation

ollama--version

Check the Ollama service status

systemctl status ollama.service

or

systemctl is-active ollama.service

Test Ollama locally

curl-w"\n" http://127.0.0.1:11434/api/tags 

As no models are installed, the output should be:

{"models":[]}

Configure Ollama

By default, Ollama listens on 127.0.0.1:11434. To allow connections from other machines on the local network, the service configuration must be modified.

sudo systemctl edit ollama.service

Add the following:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=-1"
  • OLLAMA_KEEP_ALIVE=-1 keeps loaded models in memory indefinitely.
  • OLLAMA_HOST=0.0.0.0:11434 allows connections from other hosts on the LAN.

Restart the service:

sudo systemctl restart ollama.service
systemctl status ollama.service

Enable the firewall

sudo ufw enable

(Optional) Disable IPv6

sudo nano /etc/default/ufw

Change IPV6=yes to IPV6=no.

Allow access to Ollama

sudo ufw allow 11434

Verify firewall rules

sudo ufw status

Test from a browser

Navigate to:

http://192.168.16.7:11434/api/tags

You should see the same JSON response indicating that no models are installed.

4. Choosing your model

Ollama provides access to a wide range of open-source LLMs, including families such as Llama (Meta), Mistral, and Gemma (Google). These models vary significantly in size, from compact 3B parameter models to very large 70B+ parameter models. Model size directly impacts hardware requirements and performance.

Smaller models are faster, require less memory, and can run acceptably on CPU-only systems, making them suitable for laptops, desktops, or lightweight servers. Larger models offer improved reasoning and language capabilities but require significantly more memory and typically benefit from a GPU with ample VRAM.

Selecting a model is a balance between task complexity and available hardware. Models also differ by specialization: some are optimized for general conversation, others for coding, mathematics, or domain-specific tasks. Quantized variants (for example, Q4 or Q5) reduce memory usage at some cost to accuracy.

For experimentation or CPU-only systems with limited memory, a 3–4B model is a practical starting point. In this tutorial, we use Google’s gemma3:4b model.

Download the model

ollama pull gemma3:4b

List installed and running models:

ollama list # lists installed models
ollama ps # lists running models

Preload the model into memory

curl http://localhost:11434/api/generate -d'{"model": "gemma3:4b", "keep_alive": -1}'

At this point, the model is loaded and ready to accept requests.

(Optional) Chat with your model

The code below demonstrates how to interact with your model programmatically. This approach allows you to embed AI capabilities directly into custom applications or services.

requirments.txt

ollama>=0.6.1

Python Code: ollama_query_client.py

from ollama import Client
HOST ="http://192.168.16.7:11434"# default port 11434
MODEL ="gemma3:4b"# change to your pulled model
PROMPT ="Why is the sky blue?"
client = Client(host=HOST)
# Chat-style
resp = client.chat(model=MODEL, messages=[{"role": "user", "content": PROMPT}], stream=False)
print(resp["message"]["content"])
# OR simple generate
resp2 = client.generate(model=MODEL, prompt=PROMPT, stream=False)
print(resp2["response"])
 

5. Install Docker

Docker website: https://www.docker.com/

We will use Open WebUI and Chroma DB, both of which run inside Docker containers. As a result, Docker must be installed on the host system.

Set up Docker’s apt repository.

# Add Docker's official GPG key:
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release &&echo"${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF
sudo apt update

Install the latest version of Docker

sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Verify the Docker installation:

docker--version
 

6. Install and Configure Open WebUI

Run the Open WebUI container

sudo docker run -d\
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-e CORS_ALLOW_ORIGIN="http://localhost:3000;http://192.168.16.7:3000"\
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
 ghcr.io/open-webui/open-webui:latest # --gpus=all # for GPU support

Notes:

  • **CORS_ALLOW_ORIGIN** restricts web access to the specified origins and prevents CORS-related errors.
  • **-v ollama:/root/.ollama** persists Ollama models and data in a Docker volume named ollama.
  • **-v open-webui:/app/backend/data** persists Open WebUI application data.
  • **--restart always** ensures the container restarts automatically after crashes or system reboots.

These settings ensure that application state and models are preserved across restarts.

sudo docker ps

Verify Open WebUI responsiveness

curl http://localhost:3000

You should see the HTML output for the Open WebUI web interface.

Allow LAN access through the firewall

sudo ufw allow 3000/tcp

Log in to Open WebUI

Navigate to http://192.168.16.7:3000 in your browser. The Open WebUI login page should appear.

Create an administrator account.

Open WebUI should automatically detect the local Ollama instance. If only one model is installed (for example, gemma3), it will be selected as the default. If multiple models are available, you can choose a default model from the settings.

Open WebUI ships with reasonable default settings. Depending on the model in use, these may be tuned for improved performance or accuracy. For example, when using gemma3, increasing the Context Length to more than 8192 tokens helps prevent RAG truncation.

(Optional) Python API Example

The following Python script demonstrates how to query Open WebUI using its API. In production environments, API keys should be issued to users with a User role rather than an Admin role.

Enable API access in Open WebUI

  1. In the admin panel, open Settings → General and enable API Keys. Save the changes.
  2. Navigate to Users → Groups and create a group (for example, Python). Enable API access for this group.
  3. Add users to the group.
  4. Each user must log in and generate an API key under Settings → Account.

Requirements

requests>=2.32.4

Example Python client: ollama_query_client.py

import requests
API_KEY ="sk-2a3*****"# paste the API key you generated
API_URL ="http://192.168.16.7:3000/api/chat/completions"# your Open WebUI URL
MODEL_NAME ="gemma3:4b"# your model name (as shown in Open WebUI or /api/models)
# Prepare the authorization header with the Bearer token
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Define your prompt message and target model
prompt ="Why is the sun yellow?"
payload = {
"model": MODEL_NAME, 
"messages": [
 {"role": "user", "content": prompt}
 ]
}
# Send the POST request to the chat completions endpoint
response = requests.post(API_URL, headers=headers, json=payload)
# Print the assistant's reply (the response is in JSON format)
result = response.json()
print(result)
 

7. Install and configure Chroma DB

Chroma DB website: https://www.trychroma.com/

Pull the latest Chroma DB image and start the container:

sudo docker pull Chroma DB/chroma:latest
sudo docker run -d\
-p 8000:8000 \
-v /llm/pdf_index/chroma_db:/data \
-e IS_PERSISTENT=TRUE \
-e PERSIST_DIRECTORY=/data \
-e ALLOW_RESET=TRUE \
-e ANONYMIZED_TELEMETRY=FALSE \
--name Chroma DB_server \
--restart always \
 Chroma DB/chroma:latest

Configuration notes

  • **IS_PERSISTENT=TRUE** enables disk-backed storage instead of in-memory operation.
  • **PERSIST_DIRECTORY=/data** defines the internal storage path, mapped to /llm/pdf_index/chroma_db on the host.
  • **ALLOW_RESET=TRUE** permits database resets via the API. This should typically be FALSE in production.
  • **ANONYMIZED_TELEMETRY=FALSE** disables usage telemetry.

Allow firewall access

sudo ufw allow 8000

Verify Chroma DB

Open the following URLs in a browser:

http://192.168.16.7:8000/api/v2/heartbeat
http://192.168.16.7:8000/docs

 

8. Reconfigure Open WebUI to use Chroma DB

Stop and remove the existing Open WebUI container:

sudo docker stop open-webui
sudo docker rm open-webui

Recreate the container with Chroma DB configuration:

sudo docker run -d\
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-e VECTOR_DB=chroma \
-e CHROMA_HTTP_HOST=host.docker.internal \
-e CHROMA_HTTP_PORT=8000 \
-e CORS_ALLOW_ORIGIN="http://localhost:3000;http://192.168.16.7:3000"\
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
 ghcr.io/open-webui/open-webui:latest

Monitor startup logs:

sudo docker logs open-webui -f

 

9. Create a Workspace Knowledge Collection with your RAG content

In Open WebUI, select Workspace from the sidebar, then choose Knowledge.

Create a new knowledge collection by providing a name and description. - Click the Workspace tab in the left-hand navigation sidebar. - Select Knowledge from the menu. - Name the collection and save it.

  • Upload the files you want to add to your collection.

Test

After processing completes and Chroma DB is updated, start a new chat. You will need to attach the knowledge collection.

(Optional) Python solution

The Python script below lists all Knowledge collections, prompts you to select one, and then asks for a prompt to run against the selected RAG collection.

requirments.txt

requests>=2.32.4

Python Code: ollama_open-webui_ws.py

import requests
import json
API_KEY ="sk-2a34ed40135f47d1aa42e6f82e8667b1"# paste the API key you generated
BASE_URL ="http://192.168.16.7:3000"# e.g. your Open WebUI URL
KNOWLEDGE_URL =f"{BASE_URL}/api/v1/knowledge/"
CHAT_COMPLETIONS_URL =f"{BASE_URL}/api/chat/completions"
MODEL_NAME ="gemma3:4b"# e.g. your model name (as shown in Open WebUI or /api/models)
def get_headers():
"""
 Returns the authorization headers for API requests.
 """
return {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
 }
def list_knowledge_bases():
"""
 Lists all available knowledge bases from Open WebUI and returns them.
 """
try:
 response = requests.get(KNOWLEDGE_URL, headers=get_headers())
if response.status_code ==200:
 knowledge_bases = response.json()
 items = knowledge_bases.get('items', [])
ifnot items:
print("No knowledge bases found.")
returnNone
print(f"\n{'#':<3} | {'Id':<40} | {'Name':<20} | {'Description'}")
print("-"*110)
for i, kb inenumerate(items, 1):
id= kb.get('id', 'N/A')
 name = kb.get('name', 'Unknown')
 description = kb.get('description', '') or"No description"
print(f"{i:<3} | {id:<40} | {name:<20} | {description}")
return items
elif response.status_code ==401:
print("Error: Unauthorized. Please check your API Key.")
else:
print(f"Error: Received status code {response.status_code}")
print("Response:", response.text)
returnNone
except requests.exceptions.ConnectionError:
print(f"Error: Could not connect to Open WebUI at {BASE_URL}. Is the server running?")
exceptExceptionas e:
print(f"An unexpected error occurred: {e}")
returnNone
def select_knowledge_base(knowledge_bases):
"""
 Prompts the user to select a knowledge base from the list.
 """
whileTrue:
try:
 selection =int(input(f"\nSelect a knowledge base (1-{len(knowledge_bases)}): "))
if1<= selection <=len(knowledge_bases):
 selected_kb = knowledge_bases[selection -1]
print(f"\nSelected: {selected_kb.get('id', None)}")
return selected_kb
else:
print(f"Invalid selection. Please enter a number between 1 and {len(knowledge_bases)}.")
exceptValueError:
print("Invalid input. Please enter a valid number.")
def query_knowledge_base(kb_id, prompt):
"""
 Executes a query against the specified knowledge base.
 """
 payload = {
"model": MODEL_NAME, 
"messages": [
 {"role": "user", "content": prompt}
 ],
"files": [
 {"type": "collection", "id": kb_id}
 ]
 }
try:
 response = requests.post(CHAT_COMPLETIONS_URL, headers=get_headers(), json=payload)
if response.status_code ==200:
 result = response.json()
print("\nQuery Result:")
print(json.dumps(result, indent=2))
return result
else:
print(f"Error: Received status code {response.status_code}")
print("Response:", response.text)
returnNone
except requests.exceptions.ConnectionError:
print(f"Error: Could not connect to Open WebUI at {BASE_URL}")
exceptExceptionas e:
print(f"An unexpected error occurred: {e}")
returnNone
if__name__=="__main__":
print("Connecting to Open WebUI...")
# List and get knowledge bases
 knowledge_bases = list_knowledge_bases()
ifnot knowledge_bases:
 exit()
# Let user select a knowledge base
 selected_kb = select_knowledge_base(knowledge_bases)
ifnot selected_kb:
 exit()
# Prompt for query
 user_prompt =input("\nEnter your prompt: ")
ifnot user_prompt.strip():
print("No prompt provided.")
 exit()
# Execute query
print("\nProcessing query...")
 kb_id = selected_kb.get('id')
 query_knowledge_base(kb_id, user_prompt)

 

10. systemctl Cheat Sheet

Command Description
systemctl start <unit> Starts (activates) a unit (e.g., service, socket).
systemctl stop <unit> Stops (deactivates) a unit.
systemctl restart <unit> Stops and then starts a unit.
systemctl reload <unit> Reloads the configuration of a unit without restarting it (if supported).
systemctl status <unit> Shows the current status of a unit (active/inactive, logs, etc.).
systemctl enable <unit> Enables a unit to start automatically at boot.
systemctl disable <unit> Disables a unit from starting automatically at boot.
systemctl is-enabled <unit> Checks whether a unit is enabled to start at boot.
systemctl is-active <unit> Checks whether a unit is currently active (running).
systemctl list-units \| grep <unit> Filters out the unit from the full list
  • Replace <unit> with the name of the service or unit (e.g., ollama.service or just ollama).

 

11. UFW Cheat Sheet

Command Description
sudo ufw enable Enable UFW
sudo ufw disable Disable UFW
sudo ufw allow 22/tcp
sudo ufw allow 'OpenSSH'
Allow a port/service
sudo ufw deny from 192.168.1.100
sudo ufw deny 23
Deny a port or IP
sudo ufw delete allow 80 Delete a rule
sudo ufw status verbose Check status
sudo ufw default deny incoming
sudo ufw default allow outgoing
Set default policy
  • Replace <model> with the name of the mode (e.g., gemma:4b).

 

12. Ollama Cheat Sheet

Command Description
ollama serve Starts Ollama on your local system.
ollama create <new_model> Creates a new model from an existing one for customization or training.
ollama show <model> Displays details about a specific model, such as its configuration and release date.
ollama run <model> Runs the specified model, making it ready for interaction.
ollama pull <model> Downloads the specified model to your system.
ollama list Lists all the downloaded models. The same as ollama ls
ollama ps Shows the currently running models.
ollama stop <model> Stops the specified running model.
ollama rm <model> Removes the specified model from your system.
ollama help Provides help about any command.

 

13. Docker Cheat Sheet

Command Description
docker --help Help
docker ps (--all) List running containers (including stopped ones)
docker info Display System information
docker pull <image_name> Pull an image from a Docker Hub
docker search <image_name> Search Hub for an image
docker run --name <container_name> <image-name> Create and run a container from an image, with a custom name
docker start <container_name> Start container
docker start <container_name> Stop container
docker rm <container_name> Remove stopped container
docker logs -f <container_name> Output logs related to container
docker inspect <container_name> Inspect a running container

 

14. (Optional) Working with Chroma DB

This proof‑of‑concept demonstrates how to read a Chroma DB directly. We copied all files from the Linux directory /llm/pdf_index/chroma_db to a Windows folder ./chromadb inside a VS Code project.

The Python script below lists collections and allows you to query them.

requirments.txt

chromadb>=1.4.1

Python Code: chromsdb_query.py

Note that at the time of writing, Python v 3.14 has package issues (see https://youtu.be/v5hipvLeo2E for information on how to work with multiple Python versions in Windows.)


#####################################################################
# # Python version used: 3.13.5
# See: https://github.com/chroma-core/chroma/issues/5996
#####################################################################
import chromadb
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
import os
# Connect to the persistent Chroma DB
db_path ="./chromadb"
ifnot os.path.exists(db_path):
print(f"Error: Database path {db_path} does not exist.")
 exit(1)
try:
 client = chromadb.PersistentClient(
 path=db_path,
 settings=Settings(),
 tenant=DEFAULT_TENANT,
 database=DEFAULT_DATABASE,
 )
exceptExceptionas e:
print(f"Error connecting to ChromaDB: {e}")
 exit(1)
# List collections
try:
 collections = client.list_collections()
exceptExceptionas e:
print(f"Error listing collections: {e}")
 exit(1)
ifnot collections:
print("No collections found in the database.")
 exit(1)
print("Available collections:")
for idx, col inenumerate(collections):
print(f"{idx}: {col.name}")
# Let user choose a collection
try:
 choice =int(input("\nSelect a collection by number: "))
if choice <0or choice >=len(collections):
print("Invalid selection.")
 exit(1)
 collection_name = collections[choice].name
exceptValueError:
print("Invalid input. Please enter a number.")
 exit(1)
# Get the collection
try:
 collection = client.get_collection(name=collection_name)
print(f"\nUsing collection: {collection_name}")
exceptExceptionas e:
print(f"Error getting collection: {e}")
 exit(1)
# Query text
query =input("\nEnter your query: ")
try:
 results = collection.query(
 query_texts=[query],
 n_results=5
 )
exceptExceptionas e:
print(f"Error querying collection: {e}")
 exit(1)
# Display results
if results and results.get("documents"):
 docs = results["documents"][0]
if docs:
print(f"\nFound {len(docs)} results:")
for i, doc inenumerate(docs, 1):
print(f"\nResult {i}:")
print(doc)
else:
print("No documents found matching your query.")
else:
print("No results returned.")
 
 
Follow This, That and (Maybe), the Other:




Comments

Popular posts from this blog

How to clone and synchronise a GitHub repository on Android

The complete guide to installing, configuring, and managing Plex Media Server on an Ubuntu Server