How to create a local RAG-enabled LLM server that provides safe access to your documents
This tutorial explains how to set up a headless RAG-enabled large language model (LLM) on an Ubuntu server. By the end, you will be able to chat with your LLM running locally on your own infrastructure. Your conversations will not be sent to third-party servers, ensuring that your prompts, responses, and any uploaded files remain private and under your control.
A quick word on security
Effective network security begins with a strong perimeter defence. One of the most critical steps is configuring a firewall on your router and hosts to explicitly deny all unsolicited inbound traffic from the internet. This “default deny” posture ensures that no internal services are accidentally exposed, creating a controlled environment where all communication is initiated from within your trusted network.
A common and potentially dangerous misconception is that Network Address Translation (NAT) alone provides sufficient security. While NAT obscures internal devices, it is a routing mechanism rather than a security control. Meaningful protection comes only from a deliberately configured firewall. Relying solely on NAT can leave your network vulnerable to determined attacks that bypass this superficial barrier.
Index
This tutorial is divided into the following sections:
- The Ubuntu Server
- (Optional) Python scripts
- Install and configure Ollama
- Choosing your model
- Install Docker
- Install and configure Open WebUI
- Install and configure Chroma DB
- Reconfigure Open WebUI to use the Chroma DB vector database
- Create a Workspace Knowledge Collection with your RAG content
- systemctl Cheat Sheet
- ufw Cheat Sheet
- Ollama Cheat Sheet
- Docker Cheat Sheet
- (Optional) Working with Chroma DB
Each topic is substantial, and this tutorial focuses on the essentials required to achieve the desired outcome. Along the way, we validate progress at each stage. You are encouraged to comment on and extend this work to help others. If you find it useful, please like, share, and subscribe.
1. The Ubuntu Server
All examples in this guide assume a CPU-only setup. If you have a GPU available—either on bare metal or passed through to a virtual machine—you will need to adjust parameters accordingly to take advantage of that hardware.
The server used in this tutorial is a virtual machine with the following specifications:
- 32 GB RAM
- 128 GB disk
- 16 CPU cores
- No GPU
- OS: Ubuntu 24.04.3 LTS
During installation, the OpenSSH server was installed (this can also be done post-install). All system packages are fully up to date.
The firewall is enabled, with SSH access allowed on port 22 from the local LAN only.
sudo ufw allow from 192.168.16.0/24 to any port 22
The server has a static IP address of 192.168.16.7.
2. (Optional) Python scripts
Throughout the tutorial, optional Python scripts are provided. These are not required to follow the core setup steps, but they may be useful if you intend to build custom applications or integrations around your locally hosted LLM.
All scripts were tested using Visual Studio Code. You should create a Python virtual environment. A requirements.txt file is provided alongside each example to document the required dependencies.
All code has been tested. If you encounter problems when applying the requirements.txt file, the likely problem are incompatibilities with the python version. What this video to understand how to manage this: https://youtu.be/v5hipvLeo2E.
3 Install and configure Ollama
Ollama website: https://ollama.com/
Download and install Ollama as described in the official documentation:
curl-fsSL https://ollama.com/install.sh |sh
This installs Ollama and configures it as a systemd service (ollama.service).
Verify the installation
ollama--version
Check the Ollama service status
systemctl status ollama.service
or
systemctl is-active ollama.service
Test Ollama locally
curl-w"\n" http://127.0.0.1:11434/api/tags
As no models are installed, the output should be:
{"models":[]}
Configure Ollama
By default, Ollama listens on 127.0.0.1:11434. To allow connections from other machines on the local network, the service configuration must be modified.
sudo systemctl edit ollama.service
Add the following:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=-1"
OLLAMA_KEEP_ALIVE=-1keeps loaded models in memory indefinitely.OLLAMA_HOST=0.0.0.0:11434allows connections from other hosts on the LAN.
Restart the service:
sudo systemctl restart ollama.service
systemctl status ollama.service
Enable the firewall
sudo ufw enable
(Optional) Disable IPv6
sudo nano /etc/default/ufw
Change IPV6=yes to IPV6=no.
Allow access to Ollama
sudo ufw allow 11434
Verify firewall rules
sudo ufw status
Test from a browser
Navigate to:
http://192.168.16.7:11434/api/tags
You should see the same JSON response indicating that no models are installed.
4. Choosing your model
Ollama provides access to a wide range of open-source LLMs, including families such as Llama (Meta), Mistral, and Gemma (Google). These models vary significantly in size, from compact 3B parameter models to very large 70B+ parameter models. Model size directly impacts hardware requirements and performance.
Smaller models are faster, require less memory, and can run acceptably on CPU-only systems, making them suitable for laptops, desktops, or lightweight servers. Larger models offer improved reasoning and language capabilities but require significantly more memory and typically benefit from a GPU with ample VRAM.
Selecting a model is a balance between task complexity and available hardware. Models also differ by specialization: some are optimized for general conversation, others for coding, mathematics, or domain-specific tasks. Quantized variants (for example, Q4 or Q5) reduce memory usage at some cost to accuracy.
For experimentation or CPU-only systems with limited memory, a 3–4B model is a practical starting point. In this tutorial, we use Google’s gemma3:4b model.
Download the model
ollama pull gemma3:4b
List installed and running models:
ollama list # lists installed models
ollama ps # lists running models
Preload the model into memory
curl http://localhost:11434/api/generate -d'{"model": "gemma3:4b", "keep_alive": -1}'
At this point, the model is loaded and ready to accept requests.
(Optional) Chat with your model
The code below demonstrates how to interact with your model programmatically. This approach allows you to embed AI capabilities directly into custom applications or services.
requirments.txt
ollama>=0.6.1
Python Code: ollama_query_client.py
from ollama import Client
HOST ="http://192.168.16.7:11434"# default port 11434
MODEL ="gemma3:4b"# change to your pulled model
PROMPT ="Why is the sky blue?"
client = Client(host=HOST)
# Chat-style
resp = client.chat(model=MODEL, messages=[{"role": "user", "content": PROMPT}], stream=False)
print(resp["message"]["content"])
# OR simple generate
resp2 = client.generate(model=MODEL, prompt=PROMPT, stream=False)
print(resp2["response"])
5. Install Docker
Docker website: https://www.docker.com/
We will use Open WebUI and Chroma DB, both of which run inside Docker containers. As a result, Docker must be installed on the host system.
Set up Docker’s apt repository.
# Add Docker's official GPG key:
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release &&echo"${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF
sudo apt update
Install the latest version of Docker
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Verify the Docker installation:
docker--version
6. Install and Configure Open WebUI
Run the Open WebUI container
sudo docker run -d\
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-e CORS_ALLOW_ORIGIN="http://localhost:3000;http://192.168.16.7:3000"\
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:latest # --gpus=all # for GPU support
Notes:
**CORS_ALLOW_ORIGIN**restricts web access to the specified origins and prevents CORS-related errors.**-v ollama:/root/.ollama**persists Ollama models and data in a Docker volume namedollama.**-v open-webui:/app/backend/data**persists Open WebUI application data.**--restart always**ensures the container restarts automatically after crashes or system reboots.
These settings ensure that application state and models are preserved across restarts.
sudo docker ps
Verify Open WebUI responsiveness
curl http://localhost:3000
You should see the HTML output for the Open WebUI web interface.
Allow LAN access through the firewall
sudo ufw allow 3000/tcp
Log in to Open WebUI
Navigate to http://192.168.16.7:3000 in your browser. The Open WebUI login page should appear.
Create an administrator account.
Open WebUI should automatically detect the local Ollama instance. If only one model is installed (for example, gemma3), it will be selected as the default. If multiple models are available, you can choose a default model from the settings.
Open WebUI ships with reasonable default settings. Depending on the model in use, these may be tuned for improved performance or accuracy. For example, when using gemma3, increasing the Context Length to more than 8192 tokens helps prevent RAG truncation.
(Optional) Python API Example
The following Python script demonstrates how to query Open WebUI using its API. In production environments, API keys should be issued to users with a User role rather than an Admin role.
Enable API access in Open WebUI
- In the admin panel, open Settings → General and enable API Keys. Save the changes.
- Navigate to Users → Groups and create a group (for example,
Python). Enable API access for this group. - Add users to the group.
- Each user must log in and generate an API key under Settings → Account.
Requirements
requests>=2.32.4
Example Python client: ollama_query_client.py
import requests
API_KEY ="sk-2a3*****"# paste the API key you generated
API_URL ="http://192.168.16.7:3000/api/chat/completions"# your Open WebUI URL
MODEL_NAME ="gemma3:4b"# your model name (as shown in Open WebUI or /api/models)
# Prepare the authorization header with the Bearer token
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Define your prompt message and target model
prompt ="Why is the sun yellow?"
payload = {
"model": MODEL_NAME,
"messages": [
{"role": "user", "content": prompt}
]
}
# Send the POST request to the chat completions endpoint
response = requests.post(API_URL, headers=headers, json=payload)
# Print the assistant's reply (the response is in JSON format)
result = response.json()
print(result)
7. Install and configure Chroma DB
Chroma DB website: https://www.trychroma.com/
Pull the latest Chroma DB image and start the container:
sudo docker pull Chroma DB/chroma:latest
sudo docker run -d\
-p 8000:8000 \
-v /llm/pdf_index/chroma_db:/data \
-e IS_PERSISTENT=TRUE \
-e PERSIST_DIRECTORY=/data \
-e ALLOW_RESET=TRUE \
-e ANONYMIZED_TELEMETRY=FALSE \
--name Chroma DB_server \
--restart always \
Chroma DB/chroma:latest
Configuration notes
**IS_PERSISTENT=TRUE**enables disk-backed storage instead of in-memory operation.**PERSIST_DIRECTORY=/data**defines the internal storage path, mapped to/llm/pdf_index/chroma_dbon the host.**ALLOW_RESET=TRUE**permits database resets via the API. This should typically be FALSE in production.**ANONYMIZED_TELEMETRY=FALSE**disables usage telemetry.
Allow firewall access
sudo ufw allow 8000
Verify Chroma DB
Open the following URLs in a browser:
http://192.168.16.7:8000/api/v2/heartbeat
http://192.168.16.7:8000/docs
8. Reconfigure Open WebUI to use Chroma DB
Stop and remove the existing Open WebUI container:
sudo docker stop open-webui
sudo docker rm open-webui
Recreate the container with Chroma DB configuration:
sudo docker run -d\
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-e VECTOR_DB=chroma \
-e CHROMA_HTTP_HOST=host.docker.internal \
-e CHROMA_HTTP_PORT=8000 \
-e CORS_ALLOW_ORIGIN="http://localhost:3000;http://192.168.16.7:3000"\
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:latest
Monitor startup logs:
sudo docker logs open-webui -f
9. Create a Workspace Knowledge Collection with your RAG content
In Open WebUI, select Workspace from the sidebar, then choose Knowledge.
Create a new knowledge collection by providing a name and description. - Click the Workspace tab in the left-hand navigation sidebar. - Select Knowledge from the menu. - Name the collection and save it.
- Upload the files you want to add to your collection.
Test
After processing completes and Chroma DB is updated, start a new chat. You will need to attach the knowledge collection.
(Optional) Python solution
The Python script below lists all Knowledge collections, prompts you to select one, and then asks for a prompt to run against the selected RAG collection.
requirments.txt
requests>=2.32.4
Python Code: ollama_open-webui_ws.py
import requests
import json
API_KEY ="sk-2a34ed40135f47d1aa42e6f82e8667b1"# paste the API key you generated
BASE_URL ="http://192.168.16.7:3000"# e.g. your Open WebUI URL
KNOWLEDGE_URL =f"{BASE_URL}/api/v1/knowledge/"
CHAT_COMPLETIONS_URL =f"{BASE_URL}/api/chat/completions"
MODEL_NAME ="gemma3:4b"# e.g. your model name (as shown in Open WebUI or /api/models)
def get_headers():
"""
Returns the authorization headers for API requests.
"""
return {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
def list_knowledge_bases():
"""
Lists all available knowledge bases from Open WebUI and returns them.
"""
try:
response = requests.get(KNOWLEDGE_URL, headers=get_headers())
if response.status_code ==200:
knowledge_bases = response.json()
items = knowledge_bases.get('items', [])
ifnot items:
print("No knowledge bases found.")
returnNone
print(f"\n{'#':<3} | {'Id':<40} | {'Name':<20} | {'Description'}")
print("-"*110)
for i, kb inenumerate(items, 1):
id= kb.get('id', 'N/A')
name = kb.get('name', 'Unknown')
description = kb.get('description', '') or"No description"
print(f"{i:<3} | {id:<40} | {name:<20} | {description}")
return items
elif response.status_code ==401:
print("Error: Unauthorized. Please check your API Key.")
else:
print(f"Error: Received status code {response.status_code}")
print("Response:", response.text)
returnNone
except requests.exceptions.ConnectionError:
print(f"Error: Could not connect to Open WebUI at {BASE_URL}. Is the server running?")
exceptExceptionas e:
print(f"An unexpected error occurred: {e}")
returnNone
def select_knowledge_base(knowledge_bases):
"""
Prompts the user to select a knowledge base from the list.
"""
whileTrue:
try:
selection =int(input(f"\nSelect a knowledge base (1-{len(knowledge_bases)}): "))
if1<= selection <=len(knowledge_bases):
selected_kb = knowledge_bases[selection -1]
print(f"\nSelected: {selected_kb.get('id', None)}")
return selected_kb
else:
print(f"Invalid selection. Please enter a number between 1 and {len(knowledge_bases)}.")
exceptValueError:
print("Invalid input. Please enter a valid number.")
def query_knowledge_base(kb_id, prompt):
"""
Executes a query against the specified knowledge base.
"""
payload = {
"model": MODEL_NAME,
"messages": [
{"role": "user", "content": prompt}
],
"files": [
{"type": "collection", "id": kb_id}
]
}
try:
response = requests.post(CHAT_COMPLETIONS_URL, headers=get_headers(), json=payload)
if response.status_code ==200:
result = response.json()
print("\nQuery Result:")
print(json.dumps(result, indent=2))
return result
else:
print(f"Error: Received status code {response.status_code}")
print("Response:", response.text)
returnNone
except requests.exceptions.ConnectionError:
print(f"Error: Could not connect to Open WebUI at {BASE_URL}")
exceptExceptionas e:
print(f"An unexpected error occurred: {e}")
returnNone
if__name__=="__main__":
print("Connecting to Open WebUI...")
# List and get knowledge bases
knowledge_bases = list_knowledge_bases()
ifnot knowledge_bases:
exit()
# Let user select a knowledge base
selected_kb = select_knowledge_base(knowledge_bases)
ifnot selected_kb:
exit()
# Prompt for query
user_prompt =input("\nEnter your prompt: ")
ifnot user_prompt.strip():
print("No prompt provided.")
exit()
# Execute query
print("\nProcessing query...")
kb_id = selected_kb.get('id')
query_knowledge_base(kb_id, user_prompt)
10. systemctl Cheat Sheet
| Command | Description |
|---|---|
systemctl start <unit> |
Starts (activates) a unit (e.g., service, socket). |
systemctl stop <unit> |
Stops (deactivates) a unit. |
systemctl restart <unit> |
Stops and then starts a unit. |
systemctl reload <unit> |
Reloads the configuration of a unit without restarting it (if supported). |
systemctl status <unit> |
Shows the current status of a unit (active/inactive, logs, etc.). |
systemctl enable <unit> |
Enables a unit to start automatically at boot. |
systemctl disable <unit> |
Disables a unit from starting automatically at boot. |
systemctl is-enabled <unit> |
Checks whether a unit is enabled to start at boot. |
systemctl is-active <unit> |
Checks whether a unit is currently active (running). |
systemctl list-units \| grep <unit> |
Filters out the unit from the full list |
- Replace
<unit>with the name of the service or unit (e.g.,ollama.serviceor justollama).
11. UFW Cheat Sheet
| Command | Description |
|---|---|
sudo ufw enable |
Enable UFW |
sudo ufw disable |
Disable UFW |
sudo ufw allow 22/tcpsudo ufw allow 'OpenSSH' |
Allow a port/service |
sudo ufw deny from 192.168.1.100sudo ufw deny 23 |
Deny a port or IP |
sudo ufw delete allow 80 |
Delete a rule |
sudo ufw status verbose |
Check status |
sudo ufw default deny incomingsudo ufw default allow outgoing |
Set default policy |
- Replace
<model>with the name of the mode (e.g.,gemma:4b).
12. Ollama Cheat Sheet
| Command | Description |
|---|---|
ollama serve |
Starts Ollama on your local system. |
ollama create <new_model> |
Creates a new model from an existing one for customization or training. |
ollama show <model> |
Displays details about a specific model, such as its configuration and release date. |
ollama run <model> |
Runs the specified model, making it ready for interaction. |
ollama pull <model> |
Downloads the specified model to your system. |
ollama list |
Lists all the downloaded models. The same as ollama ls |
ollama ps |
Shows the currently running models. |
ollama stop <model> |
Stops the specified running model. |
ollama rm <model> |
Removes the specified model from your system. |
ollama help |
Provides help about any command. |
13. Docker Cheat Sheet
| Command | Description |
|---|---|
docker --help |
Help |
docker ps (--all) |
List running containers (including stopped ones) |
docker info |
Display System information |
docker pull <image_name> |
Pull an image from a Docker Hub |
docker search <image_name> |
Search Hub for an image |
docker run --name <container_name> <image-name> |
Create and run a container from an image, with a custom name |
docker start <container_name> |
Start container |
docker start <container_name> |
Stop container |
docker rm <container_name> |
Remove stopped container |
docker logs -f <container_name> |
Output logs related to container |
docker inspect <container_name> |
Inspect a running container |
14. (Optional) Working with Chroma DB
This proof‑of‑concept demonstrates how to read a Chroma DB directly. We copied all files from the Linux directory /llm/pdf_index/chroma_db to a Windows folder ./chromadb inside a VS Code project.
The Python script below lists collections and allows you to query them.
requirments.txt
chromadb>=1.4.1
Python Code: chromsdb_query.py
Note that at the time of writing, Python v 3.14 has package issues (see https://youtu.be/v5hipvLeo2E for information on how to work with multiple Python versions in Windows.)
#####################################################################
# # Python version used: 3.13.5
# See: https://github.com/chroma-core/chroma/issues/5996
#####################################################################
import chromadb
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
import os
# Connect to the persistent Chroma DB
db_path ="./chromadb"
ifnot os.path.exists(db_path):
print(f"Error: Database path {db_path} does not exist.")
exit(1)
try:
client = chromadb.PersistentClient(
path=db_path,
settings=Settings(),
tenant=DEFAULT_TENANT,
database=DEFAULT_DATABASE,
)
exceptExceptionas e:
print(f"Error connecting to ChromaDB: {e}")
exit(1)
# List collections
try:
collections = client.list_collections()
exceptExceptionas e:
print(f"Error listing collections: {e}")
exit(1)
ifnot collections:
print("No collections found in the database.")
exit(1)
print("Available collections:")
for idx, col inenumerate(collections):
print(f"{idx}: {col.name}")
# Let user choose a collection
try:
choice =int(input("\nSelect a collection by number: "))
if choice <0or choice >=len(collections):
print("Invalid selection.")
exit(1)
collection_name = collections[choice].name
exceptValueError:
print("Invalid input. Please enter a number.")
exit(1)
# Get the collection
try:
collection = client.get_collection(name=collection_name)
print(f"\nUsing collection: {collection_name}")
exceptExceptionas e:
print(f"Error getting collection: {e}")
exit(1)
# Query text
query =input("\nEnter your query: ")
try:
results = collection.query(
query_texts=[query],
n_results=5
)
exceptExceptionas e:
print(f"Error querying collection: {e}")
exit(1)
# Display results
if results and results.get("documents"):
docs = results["documents"][0]
if docs:
print(f"\nFound {len(docs)} results:")
for i, doc inenumerate(docs, 1):
print(f"\nResult {i}:")
print(doc)
else:
print("No documents found matching your query.")
else:
print("No results returned.")
























Comments
Post a Comment