I have a local server with Nvidia GPUs which I bought off ebay for $800. The GPU are RTX but there are 4 of them in the server. I run ollama on it and downloaded a few models that I use mainly to ask them how to configure other applications. I have multiple laptops where I run docker containers with open-webui and ollama-webui as shown below. Open-webui allows saving chats while ollam-webui does not.
The performance is about as good as any online LLM server. I routinely get around 20+ Tokens/sec i.e. it is not a long wait to get your queries answered and stop and add information. In other words, it is usable.
All the models are free and open source - so I pay nothing to use them. The only out-of-pocket expense was this server that I purchase. I could have made one with $99 2 socket intel motherboard (from Aliexpress), but I did not have time, so I bought it off some student off ebay.
The answers are accurate and not all that different from using online version. Give it a try! It is fun and really really cheap.
Appendix
I got the setup and asked it to write in a blog fashion all the steps necessary. I haven't validated them, but they look roughly what I did.
Title: Setting Up OLLAMA and Open-WebUI Across Server and Windows Laptop
In this blog post, we'll walk through the process of setting up OLLAMA, a powerful language model, on a server equipped with four NVIDIA GPUs, and connecting it to Open-WebUI, an intuitive user interface for large language models, on a Windows laptop using WSL (Windows Subsystem for Linux).
**Part 1: Server Setup**
First, let's set up OLLAMA on the server. Begin by updating the package lists and installing required dependencies:
```bash
sudo apt-get update && sudo apt-get upgrade -y
pip install ollama[all]
```
Now, we'll download and pull your favorite models from Hugging Face Model Hub. Here, we'll use DeepSea, Mistral, and QWEN:
```bash
ollama pull deepseek
ollama pull mistral
ollama pull qwen
```
To configure OLLAMA as a service, create a systemd unit file in `/etc/systemd/system`:
```bash
sudo nano ollama.service
```
Add the following content and save the file:
```ini
[Unit]
Description=OLLAMA Service
Requires=nvidia-smi.service
After=nvidia-smi.service
[Service]
User=<username>
ExecStart=/usr/local/bin/ollama start
Restart=always
EnvironmentFile=-/home/<username>/.ollama/config
WorkingDirectory=/home/<username>/.cache/ollama
[Install]
WantedBy=multi-user.target
```
Next, enable and start the service:
```bash
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
```
Configure OLLAMA to use the GPU devices and expose it over a public IP using port forwarding. Update the `/home/<username>/.ollama/config` file accordingly:
```bash
# In [general] section
port = 8000
# In [gpus] section (add the necessary GPU IDs)
gpus = 0,1,2,3
```
**Part 2: Windows Laptop Setup**
Install WSL and Ubuntu if not already done. Open a Ubuntu terminal and update the package lists:
```bash
sudo apt-get update && sudo apt-get upgrade -y
pip install openwebui
```
Connect to the OLLAMA service on the server using the public IP and port:
```bash
openwebui connect <server_public_ip>:<port>
```
Once connected, Open-WebUI will launch, providing a user-friendly interface for interacting with your language models.
By following this guide, you've successfully set up OLLAMA on a server with multiple GPUs and connected it to Open-WebUI on a Windows laptop using WSL, enabling seamless access to powerful AI models from anywhere.