Network of Things

Saturday, May 30, 2026

Prompt is the new Config, Workflow is the new Product

Prompt are dynamic configurations that can adapt to incoming requests. Today configurations are static files which may miss a lot of corner cases. Enter an LLM and now we can reason with APIs and dynamically workaround blocks. This one ability of the LLMs can create panic among all gateways to enterprise. A gateway allows/deny based on a static policy. What if we embed this intelligence in them so they adapt continuously to incoming traffic. This is atleast one reason why firewalls/cybersec products are receiving scrutiny from CIOs. A prompt injection can completely change the security posture of an enterprise.

When you have a wild and powerful entity on the loose, you need to build harness to channel this energy. That is the workflow. More and more startups are now focusing on workflows instead of just functionality. They called this harness a wrapper around LLM. These wrappers are the new gateways into the enterprise.

Tuesday, May 26, 2026

Token Pricing: Reasoning Tax,, GPU Utilization & GPU Recency

According to newly published TPI (Token Price Index), average cost of 1M tokens is just over $2. It rose over 75% in 1 year. So does a model serving provider make money off this pricing? Yes and No. The marginal cost of power + infra + facilities ranges between 6 cents to 1.60 cents per million token. The key factors driving costs higher are (a) low utilization of GPU (currently around 15%), (b) reasoning tax (the token generated inside to support customer tokens (c) GPU recency.

On the latest GPU, the cost of generating token is lower creating an incentive for providers to move up to the latest GPU. But this also means they pay for supply constraint which will likely stay forever because the GPUs that are one generation behind are not used. So the amortization on those GPUs which assume 100% consumption for the full 6 years is just too low i.e. margins are artificially high.

The GPU utilization relies on many techniques at deployment time but batching is the one which makes a meaningful impact. Assuming 100% of the input is batched is euphoric. We are average 15-25% on recent GPU and almost 1-2% on older GPUs. No one wants to run on legacy GPUs. Net net this factor is also artificially inflating margins.

The reasoning tax comes from ordinary queries but especially from agentic AI. Agents generate output that is sent back to the model which then starts generating reasoning tokens which are currently not charged to the customer. These reasoning token cost is eaten by the provider and is referred to as "Reasoning Tax".

Saturday, April 11, 2026

OpenClaw is MicroSaaS

This OpenClaw has a annoyingly long setup, but after all the effort, it is worth it. I set it up to routinely check my gmail. The process is well documented on openclaw but it does require account with Google Cloud, Installation of gcloud SDK, Go, GoG. And of course a lot of knowledge of how Linux works. I did it on WSL on windows.

Finally, it gave me all my emails which is not the point. You can ask it questions which you can't in a email client. Check this out.

Question: Who keeps sending me email? In the last week who has sent me the most emails?

Question: Can you summarize all the emails from Federal Reserve Board?

Question: Can you summarize the April 8th FOMC Minutes?

Ok, so all of this is pretty straightforward. But note that I haven't used any tokens or have I? What about monetization? Who is making money with all this?

It turns out LLM providers like Anthropic, Gemini are charging fees for tokens usage. Cloud hosting services charge for any managed instances (I am using my laptop). ClawHub skills for specific vertical (read custom integration) get paid. Skills vendors are making some money (not enough to quit your job yet but good for Happy Meals from McD).

Wednesday, December 24, 2025

Local LLMs does not cut it

After using ollama on a local LLMs, I found it is not very useful and I was constantly going to online versions of the same model or proprietary models. My local setup was under $1K and the results from the various queries over last 6 months were not any better than a web search. There are two problems with local inferencing (I am going to skip the most obvious one i.e. the hardware is not as powerful).

The first one is distribution shift where the data on which the LLM is trained is not the same and sometimes differs significantly to the prompt query. This results in incomplete or superficial answers to queries where online models provide significantly more detail. This is kind of like talking to a person who is not in the know but knows enough to sound dangerous.

The second is overfitting. This is where the weights have decided to converge on an answer for any set of prompts that are similar. It does not find accurate answers that fit the prompt so it serves the generalized version. This is kind of like thinking in stereotypes.

The key point is the real slim shady (yes eminem) here is the data itself. If you train on prior exams and then take an exam that was set by some external body, you will find yourself not as prepared to take the exam as you thought you were. The content and theories have not changed, just the way to test is different. Failure to detect and correct these shifts can be disastrous because sometime it means the model has to be trained from scratch. Yep that $150M used to train the model was flushed down the toilet. Hopefully, in 2026, we focus more on the data and how to partition and reassemble weights and not as much on GPU/VRAM. The latter is really a commodity and in abundant supply.

BTW, B200 only supports certain versions of PyTorch. If this continues we will get fragmentation in jobs as with every new release of GPU, all software would have to be upgraded. In the end GenAI is not much different from Natural AI.

Happy Holidays.

Friday, May 23, 2025

Local LLM using Ollama and open-webUI

I have a local server with Nvidia GPUs which I bought off ebay for $800. The GPU are RTX but there are 4 of them in the server. I run ollama on it and downloaded a few models that I use mainly to ask them how to configure other applications. I have multiple laptops where I run docker containers with open-webui and ollama-webui as shown below. Open-webui allows saving chats while ollam-webui does not.

The performance is about as good as any online LLM server. I routinely get around 20+ Tokens/sec i.e. it is not a long wait to get your queries answered and stop and add information. In other words, it is usable.

All the models are free and open source - so I pay nothing to use them. The only out-of-pocket expense was this server that I purchase. I could have made one with $99 2 socket intel motherboard (from Aliexpress), but I did not have time, so I bought it off some student off ebay.

The answers are accurate and not all that different from using online version. Give it a try! It is fun and really really cheap.

Appendix

I got the setup and asked it to write in a blog fashion all the steps necessary. I haven't validated them, but they look roughly what I did.

Title: Setting Up OLLAMA and Open-WebUI Across Server and Windows Laptop

In this blog post, we'll walk through the process of setting up OLLAMA, a powerful language model, on a server equipped with four NVIDIA GPUs, and connecting it to Open-WebUI, an intuitive user interface for large language models, on a Windows laptop using WSL (Windows Subsystem for Linux).

**Part 1: Server Setup**

First, let's set up OLLAMA on the server. Begin by updating the package lists and installing required dependencies:

```bash

sudo apt-get update && sudo apt-get upgrade -y

pip install ollama[all]

```

Now, we'll download and pull your favorite models from Hugging Face Model Hub. Here, we'll use DeepSea, Mistral, and QWEN:

```bash

ollama pull deepseek

ollama pull mistral

ollama pull qwen

```

To configure OLLAMA as a service, create a systemd unit file in `/etc/systemd/system`:

```bash

sudo nano ollama.service

```

Add the following content and save the file:

```ini

[Unit]

Description=OLLAMA Service

Requires=nvidia-smi.service

After=nvidia-smi.service

[Service]

User=<username>

ExecStart=/usr/local/bin/ollama start

Restart=always

EnvironmentFile=-/home/<username>/.ollama/config

WorkingDirectory=/home/<username>/.cache/ollama

[Install]

WantedBy=multi-user.target

```

Next, enable and start the service:

```bash

sudo systemctl daemon-reload

sudo systemctl enable ollama

sudo systemctl start ollama

```

Configure OLLAMA to use the GPU devices and expose it over a public IP using port forwarding. Update the `/home/<username>/.ollama/config` file accordingly:

```bash

# In [general] section

port = 8000

# In [gpus] section (add the necessary GPU IDs)

gpus = 0,1,2,3

```

**Part 2: Windows Laptop Setup**

Install WSL and Ubuntu if not already done. Open a Ubuntu terminal and update the package lists:

```bash

sudo apt-get update && sudo apt-get upgrade -y

pip install openwebui

```

Connect to the OLLAMA service on the server using the public IP and port:

```bash

openwebui connect <server_public_ip>:<port>

```

Once connected, Open-WebUI will launch, providing a user-friendly interface for interacting with your language models.

By following this guide, you've successfully set up OLLAMA on a server with multiple GPUs and connected it to Open-WebUI on a Windows laptop using WSL, enabling seamless access to powerful AI models from anywhere.

Saturday, February 01, 2025

DRS1 = DSV3 + GRPO + VR

DSR1 is actually a reasoning model which also does chat. But the surprise that it outperformed o2 and others is because nobody paid much attention to this company and its publications. After the DSR1 announcement, I found these two papers: DeepSeekMath (April, 24) and DeepSeekCoder (June, 24). Had I read this last year, I would have been waiting for the actual DSR1. In fact, the real contribution was in V3 model from December, 24. DSR1 is easier to reproduce from DSV3. Getting to DSV3 is what is difficult and amazing that used older and crippled infra. (2.8M GPU hours) or roughly 2.8M*$3.5/hr = $9.8M. Compare that to several billions spent in pre-training current proprietary and open source models.

To get to DSR1, you start with DSV3 and apply GRPO (defined in DeepSeekMath paper). It is not SFT with HF, it is RL with verified rewards (VR) - essentially no humans involved. The key contribution of GRPO actually came out in April in DSMath paper that is linked above.

To reduce cost, they use FP8 and DualPipe algorithm which helps them reduce GPU memory consumption. Recall successive GPUs from both Nvidia and AMD are simply adding more HBMe memory to the GPU module. Also recall the H800 was crippled by more than halving its bandwidth to the VRAM (memory on GPU module). The training parameters they optimized on are compute to communication ratio and near zero all-to-all communication in a GPU cluster. They avoided tensor level parallelism and focused on traditional pipeline bubble removal. They developed middleware that optimized on inter-node communication that understood the underlying transport as IB or NVLink.

In summary, all the cost efficiencies are standard HPC techniques and their key innovations were published months before they achieved DSR1. The only difference is they believed in their approach and those who actually read the papers in April, 2024 were high on we need more GPUs to pre-train and ignored their innovations.

The constrained infra in China pushed them in this direction, but not we should thank that constraint because we now know that even models exhibit emergent behavior when trained in constrained environment. You want sweet grapes, don't water the vine!

Friday, January 31, 2025

DeepSeek is On-Prem and "" real value is between ""

High Flyer is a hedge fund which was early adopter of GPU acceleration in finance. The expertise they built in that field helped them launch a subsidiary Deepseek which recently released R1 after two other LLMs. Everybody now knows what Deepseek is but not many know that it is actually not running in any cloud. It is all on-prem.

In China, you can't get H100, but you can get H800 which is BW limited and H20 which is crippled version but enough to train and infer 680B parameter model. Some suggest that Deepseek innovated due to constraints. It may be, but it could just as well be that they figured out first that one does not need to load the whole model into memory for inference. The latest figures are that DSR1 only load 30B parameters which you can load on a Laptop GPU with only 12GB of VRAM.

Since the new on DSR1 hit the wires, the consumer grade GPUs have been flying off the shelf. It is super easy to run DSR1 using ollama (literally just type %ollama run deepseek-r1). Most of the value, I derive from DSR1 is the bit inside. For example, I asked DSR1 how to compress and load a LLM into VRAM. The answer was obvious as shown below:

In summary, my approach would be:
1. Start by quantizing the model parameters to lower bit precision (e.g., 8-bit or 16-bit).
2. Apply pruning to remove unnecessary weights after quantization.
3. Use techniques like dynamic or static quantization for further compression during inference.
4. Implement checkpointing strategies to optimize memory usage during the inference process.
5. Possibly combine with other methods like knowledge distillation if there's excess capacity.

but the real insight was in between tags <think>. Some of the tips there are knowledge distillation (not the same as model distillation. Use of fixed point math (something we did like 30 years ago to draw pictures using postscript). I particularly liked the way it went from top-of-the-head response of quantization to fixed point to knowledge distillation and went on to (eventually reject) sparsity techniques.