I have a local server with Nvidia GPUs which I bought off ebay for $800. The GPU are RTX but there are 4 of them in the server. I run ollama on it and downloaded a few models that I use mainly to ask them how to configure other applications. I have multiple laptops where I run docker containers with open-webui and ollama-webui as shown below. Open-webui allows saving chats while ollam-webui does not.
The performance is about as good as any online LLM server. I routinely get around 20+ Tokens/sec i.e. it is not a long wait to get your queries answered and stop and add information. In other words, it is usable.
The answers are accurate and not all that different from using online version. Give it a try! It is fun and really really cheap.
Appendix
I got the setup and asked it to write in a blog fashion all the steps necessary. I haven't validated them, but they look roughly what I did.
Title: Setting Up OLLAMA and Open-WebUI Across Server and Windows Laptop
In this blog post, we'll walk through the process of setting up OLLAMA, a powerful language model, on a server equipped with four NVIDIA GPUs, and connecting it to Open-WebUI, an intuitive user interface for large language models, on a Windows laptop using WSL (Windows Subsystem for Linux).
**Part 1: Server Setup**
First, let's set up OLLAMA on the server. Begin by updating the package lists and installing required dependencies:
```bash
sudo apt-get update && sudo apt-get upgrade -y
pip install ollama[all]
```
Now, we'll download and pull your favorite models from Hugging Face Model Hub. Here, we'll use DeepSea, Mistral, and QWEN:
```bash
ollama pull deepseek
ollama pull mistral
ollama pull qwen
```
To configure OLLAMA as a service, create a systemd unit file in `/etc/systemd/system`:
```bash
sudo nano ollama.service
```
Add the following content and save the file:
```ini
[Unit]
Description=OLLAMA Service
Requires=nvidia-smi.service
After=nvidia-smi.service
[Service]
User=<username>
ExecStart=/usr/local/bin/ollama start
Restart=always
EnvironmentFile=-/home/<username>/.ollama/config
WorkingDirectory=/home/<username>/.cache/ollama
[Install]
WantedBy=multi-user.target
```
Next, enable and start the service:
```bash
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
```
Configure OLLAMA to use the GPU devices and expose it over a public IP using port forwarding. Update the `/home/<username>/.ollama/config` file accordingly:
```bash
# In [general] section
port = 8000
# In [gpus] section (add the necessary GPU IDs)
gpus = 0,1,2,3
```
**Part 2: Windows Laptop Setup**
Install WSL and Ubuntu if not already done. Open a Ubuntu terminal and update the package lists:
```bash
sudo apt-get update && sudo apt-get upgrade -y
pip install openwebui
```
Connect to the OLLAMA service on the server using the public IP and port:
```bash
openwebui connect <server_public_ip>:<port>
```
Once connected, Open-WebUI will launch, providing a user-friendly interface for interacting with your language models.
By following this guide, you've successfully set up OLLAMA on a server with multiple GPUs and connected it to Open-WebUI on a Windows laptop using WSL, enabling seamless access to powerful AI models from anywhere.