How to set up an inference server to serve AI models

In this post I'll walk you through how the community's inference servers are set up: the hardware we use, the stack we run, and the models we serve.

Just a couple of weeks ago I put out this tweet as a sort of "crazy idea" to see if anyone was interested in setting up an inference server to run local models with no token limits.

I guess the increasingly low and increasingly expensive limits from private model providers (Claude, GPT, etc.) have pushed people to take the need to try alternatives more and more seriously.

That's how NaN was born: a community of builders who genuinely need to burn through tokens, where the limits of private models were starting to fall short or get too expensive. In this post I'll tell you how the community's inference servers are set up.

Hardware

For now, all of our servers will have this configuration:

Hardware per server

To access and control the GPU, since it belongs to the Blackwell family, we installed the drivers from the nvidia-driver-570-open package and the cuda-toolkit

Models

To keep things short, we tested several models but in the end we're serving the following ones, although every 3 months we'll put them up for a vote to either rotate in newer/more powerful ones or keep them:

Server models

Inference stack

All right, with that context out of the way, let's get to the good part. This is the architecture the server has:

Inference server architecture
⚠️ This diagram shows the architecture of a single server, but using this same foundation you can easily scale it horizontally. That's exactly what we're doing at NaN

The server has 4 layers, from lowest to highest level:

Models: the downloaded models that will run on the server. In our case only the chosen LLM is loaded into vRAM and runs on the GPU. The rest, since they have much lower demand and we don't need huge speed, run on the CPU and load into system RAM.

Model Serving: Depending on the model, we use a specific server. I'll focus on vLLM since it's the key piece that lets us parallelize requests, handles memory management, and helps us with the actual inference of the LLM. On Hugging Face you can find the recommended parameters for serving the model, for example, for our current model: link. So as not to drag the post out, I'll mention the most important parameters we configured to improve the model's inference:

--max-model-len 131072: Qwen 3.6 has a native context window of 128K tokens.

--reasoning-parser qwen3: We set this one, but you need to make sure you have the right parsers so the model can correctly interpret everything it receives. Otherwise you can end up with clients seeing the chain of thought mixed in with the response, along with other problems.

--mm-encoder-attn-backend TORCH_SDPA: Required for Qwen3.6's multimodal (vision) encoder. Without it, the encoder uses an attention backend that may not be available on all GPU architectures.

--chat-template chat_template_fixed.jinja: This template is used by the model to know what's happening during a chat based on the XML tags it receives. For example <think> With this, the model tags its chain of thought. This template has a bug that affects earlier versions of Qwen but also models from other families like Kimi. The fix applied is the same one that's waiting to be merged into vLLM: bug

--speculative-config '{"method":"mtp","num_speculative_tokens": 2}': This parameter is pure fine-tuning. To sum it up, this parameter lets the model "guess" 2 extra tokens at each generation step. If the prediction is correct (which happens 80-92% of the time), they're all accepted at once, doubling the token generation speed with no additional memory cost. Thanks to this parameter we've achieved:

  • ~2x throughput boost
  • 80-92% acceptance rate for speculated tokens
  • Average acceptance length of ~2.6 tokens

API GW: LiteLLM is the element that lets us serve all the models under a single API. It also gives us the authentication/authorization layer so we can provide an API Key to each community member.

Not only that, it also lets us measure and add some hardening at the inference level to prevent DDoS, abuse, or someone trying to hog the GPU for themselves and affect everyone else. Some of the most important per-API-Key limits we have in place are (we have users who burn through +300 million tokens a day and still don't hit these limits):

  • 100 RPM
  • 5 parallel requests max per minute

There are NO token limits, nor usage limits like the private providers have

Reverse Proxy: In short, this web server lets us expose to the internet the endpoint that community members need in order to consume the models, while blocking access to all the endpoints that aren't essential for providing the inference service.

The diagram doesn't show three important security layers:

  1. Cloudflare: where we have strict SSL with the server and tunnels (this one isn't directly for inference, but I'm working on it so that soon the server won't be inference only, I'll tell you more later).
  2. VPN: access to all the dashboards, charts, and private endpoints is done over VPN. There's nothing more than the inference endpoint exposed to the internet.
  3. Hardening: like any other server, you have to apply the basic security rules (blocking password access, disabling root access, etc.) I'll skip this part to keep the post short, and also because there are 400 posts out there doing this for any Linux server.

Monitoring

The server is fully monitored, from the most basic in LiteLLM (chart of tokens burned over 5 days):

Chart of total tokens processed over 5 days

all the way to the status and usage of the GPU:

GPU Metrics

and even usage metrics by model, users, clients, etc.

Usage metrics by user, models, clients, etc.

Each dashboard has a lot more information, but the screenshots are just a small glimpse of everything that's being monitored. The stack we use is:

  • Prometheus (metrics)
  • Grafana (dashboards and alerts)
  • LiteLLM itself with its own metrics

We have alerts of all kinds to detect errors and act proactively.

Privacy

In LiteLLM we have any input/output tracking disabled beyond basic metrics to monitor the server's state (tokens/s, requests, most called models, etc.), but under NO circumstances is any kind of data recorded, neither on the server nor in logs. It's all pure inference, and the context lives only while you keep an active session. After that it's gone.

Since there's no data, nothing that happens on the server is used for training or any later processing. Which is something that does happen when you use private model clients or providers.

Final thoughts

As you can see, setting up an inference server is nothing out of this world. The "difficulty", however, lies in knowing how to spot errors. Whether it's at the "Model Serving" layer, which is usually the most common case with parsers and the like, or it's also very common for something specific to fail in the client/agent/IDE of the day and you have to track down a particular configuration to avoid these problems.

In the community, different members have been sharing their configurations, and we're trying to gather most of them in our documentation:

https://nan.builders/docs/examples

On Twitter , I post every day about the different tests I run and their results, the good ones and the bad ones alike. I'd be happy to hear your feedback or ideas about all of this.

Remember that we'll keep adding more and more servers to the community so we can all burn through tokens without fear of hitting limits. At NaN we're already +50 members, with +150 people on the waiting list. Things are about to get interesting!

Until next time! 🖖