Hardware Resource Guide

For Userful On-Premise

Introduction

Userful on-premise servers are versatile, and their flexibility is unmatched by any other solution on the market. Because of this flexibility, you need to know the capabilities of your system during project planning. Server deployments can be tailored to fit your setup and requirements. Userful specialists can help you determine the level and number of Userful servers you need for your project.

  • All servers deployed on-premises must have adequate cooling and accessibility in order to maintain expected behaviour. The average operating temperature of any server must not exceed 80 degrees Celcius, as it will result in automatic shutdowns and degraded performance. Userful recommends data-center best practices for deployment and operation. Please speak with your Sales Engineer during deployment if you have further questions.

  • Server capabilities can not be defined by listing specifications as it is heavily content-dependent, a single server may support more than 100 screens, or the same server may be overwhelmed by multiple sources running on a single screen.

  • Kindly, consult Userful consultants before implementation, as software updates significantly improve the application's functionality and help omit limitations.

Certified Systems

The information here can be interpreted in many different ways and translated into all kinds of server and computer-building ideas. If you plan to go your own way and order a Userful server from your own channels, the Self-Sourcing a Userful Server article is a great place to start.

GPUs

Userful has used relatively inexpensive NVIDIA GTX consumer GPUs in conjunction with Zero Client devices. Since uClients are expected to gradually replace Zero clients, this guide omits non-Quadro GPUs for planning and brevity.

Certified GPU models Userful uses as a part of its solution are as follows:

  • RTX 4000

  • RTX 5000

  • RTX 6000

  • RTX A6000

  • RTX A5000

  • RTX A4000

As of Userful 10.6, only one NVIDIA GPU can be used in a Userful system, although there are plans for an architectural change in the near future.

Ingestion, Processing, and Output

Resource usage must be considered in relation to the three main tasks that a Userful server must perform.

  • Ingesting: In this process, content either directly from a GPU decoder (that receives and decodes video content), or from an application running on an interactive source, must be sent to the GPU for processing. This is the most technically complex part of the process.

  • Processing: It means transforming and scaling the content from its raw state, or changing the data stream to fit the screen. Userful uses NVIDIA CUDA for this.

  • Encoding: It is the process of sending the transformed data to the display.

Data Flow Diagram

This is a graphical representation of the Userful server data flow. Content is taken from various parts of the system, passed through the PCIe bus, transformed by NVIDIA, and then encoded and sent to the endpoint. Each source type has its own path and different impact on resources.

Video Files, Streams, and the GPU Decoder

The direct source is a file or video stream. This is the media content that is decoded directly by the GPU decoder.

All supported GPU decoders can handle an 8K resolution workload at 60 frames per second. You can also think of this workload divided by resolution and frame rate as a canvas for the 8K60 limit. This is meant to illustrate the available bandwidth and is not limited to the example shown.

However, this model has the following three limitations:

  • The first limitation is that Userful is configured by default to only use hardware decoding for four direct sources at once. So even if your only direct source is five 1080p 30 streams, the fifth will be routed to the CPU instead of the GPU. This can be changed by a Userful engineer in a configuration file on the system.

  • The second limitation is load, and the simple reality is that if you design a system to run at 100% capacity 24/7, there is no room for flexibility or growth.

  • The third limitation is the file format. There are only a limited number of formats supported by NVIDIA's hardware decoder. Of all the major components, the main direct source limitation is the GPU decoder. However, with Userful's architecture, you can see that it is very flexible as long as you don't exceed the GPU's capabilities.

Overload Conditions

If the above limit is exceeded, the Userful application attempts to decode all remaining streams with software decoding. This is a tactful way of saying that it will use the CPU to decode the video. This is undesirable because we want to reserve the CPU for system operations and interactive sources. In addition, software decoding cannot match the quality or latency of hardware decoding. Also, overloading the CPU resources can make the system unresponsive and potentially unstable which is less efficient than GPU decoding.

Exceptions

The Blackmagic HDMI Capture sources do not need to be decoded. They are passed directly from the Blackmagic Capture driver to the GPU. Using Forward and Store with Signage Player eliminates the need for live decoding. However, this only applies to Signage Player, and this delivery model is fundamentally incompatible with live streaming content.

Web Browsers and Applications

Interactive sources are instances of applications that run directly from the Userful server. They are powered by the server's operating system and do not consume decoder resources.

Interactive sources consume CPU, RAM, and PCI Express bandwidth.

CPU and RAM: Userful servers are equipped with powerful i7 and Xeon processors and 32GB or 64GB of user-expandable RAM to handle multiple instances of most applications (web browsers, VNC, and RDP clients) under normal circumstances. The exact number of these sources is largely limited by the applications running on it; a web browser displaying a KPI dashboard uses far fewer CPU cycles than a browser playing a YouTube video. For this reason, we do not recommend using a web browser or any other interactive source to play the video. GPU acceleration is not available, as it is for web browsers running on desktop PCs. The only exception is Session Acceleration, which can only accelerate one session at a time.

There is no fixed or inherent limit to the number of interactive sources that can run on a Userful server. What you do need to be aware of are the limitations of PCI Express and the encoder.

PCI Express Upload to the GPU

Frames from the interactive source are sent to the GPU for processing and encoding. The amount of data passing over the PCI Express bus is calculated using the following formula:

Bandwidth = Width * Height * bpp * FPS /8

Width: Resolution width

Height: Resolution height

bpp: Bits Per Pixel (color depth)

FPS: Frames Per Second

/8: 8 bits to a Byte

For Example, A Web Browser source, running at 4K 30FPS has the following specifications:

Width=3840, Height=2160, Web Browser bpp=32, FPS=30, /8: 8 bits to a Byte.

Its Bandwith = 3840 * 2160 * 32 * 30 / 8 = 995328000 i.e., about 1GB/s.

In other words, for a Userful system with current specifications (as of February 2021), the PCIe 3.0 bandwidth of a GPU running at full 16x bandwidth would be able to support about 12 of these sources. In future systems with PCI Express 4.0 and 5.0 motherboards and GPUs, this limitation will increase.

Different Sources vary in their color depth, these are as follows:

Here is a handy fact sheet with the most commonly used and recommended values for interactive sources.

At this point, you may be wondering why we haven't talked about these sources along with the direct sources and decoders. As such, it is not applicable in this case. The bandwidth used to load media resources onto the GPU itself is negligible compared to system operations.

CUDA Processing

When content is decoded or loaded, it temporarily resides in the GPU's VRAM. It doesn't take much, but this is where the real magic of Userful happens. We use the NVIDIA CUDA platform to run a series of tasks on your content. These are as follows in order of execution:

  • Color Transformation: The content is converted to the BGRA color space.

  • Compositing: If you're using multi-window, picture-in-picture, or Command & Control, this is where all the stitching, scaling, and compositing happens. These are the most computationally intensive operations.

  • Cropping and scaling: If you are using a zero client, this is where the content is scaled up or down to fit the required workspace.

  • JPEG encoding: If you're using a zero client, the content is directly encoded and loaded as JPEGs.

These tasks are very difficult to specify in writing in terms of the amount of CUDA compute engine resources they consume. Each application has a different impact on these metrics. This is the main difference in performance between the three cards, which is also the reason why higher levels of Userful servers have been equipped with higher-performance GPUs with a proportionally increasing number of CUDA cores (all of the listed cards have identical encoding/decoding capabilities and PCI Express buses).

You may have noticed that the Quadro RTX 8000 is not included in this list. This GPU has a significant increase in VRAM over the RTX 6000, but other than that it is the same for CUDA cores and encoding/decoding capabilities. Real-time operations don't use as much VRAM.

Encoding

Once the content has been downloaded and processed, the next step is to send it to the client and display it. There are two ways to do this, depending on the type of client you are using.

Zero Clients

The Zero client receives the content as a JPEG file. Converting the content to a JPEG file is not computationally expensive, so this step is done in the CUDA processing stage. All that remains is to send the data from the GPU to the zero client driver and then to the Zero client via the network card.

The network bandwidth available for this process is highly dependent on the content being transferred, for example, a video traffic throughput at 60 frames per second for a single zero client can exceed 100 Mbps. On the other hand, static images and slideshows that are updated only a few times per minute generate little traffic. That's why the Zero Client is equipped with a Gigabit Ethernet port, and why the network design requirements for the Zero Client are so stringent.

In addition, Zero Clients have more stringent latency requirements than uClients because they communicate with the server in both directions.

uClients

uClients are a more traditional video endpoint. They receive traffic via a dedicated RTSP stream that carries video in H.264 or H.265 format. The creation of this stream is done by a GPU encoder.

The GPU encoder has the same functionality as the decoder, so it has the same canvas options as the ingest measurement. What is different now is that all sources are eligible for this 8K60 canvas, including direct video, web browsers, and HDMI captures. Each video stream that is sent will consume a portion of this encoder.

Why highlight the streams? Because it is not per screen. For example, if you are playing a 4K60 video on a 3x3 video wall, it will only consume the resources needed to play a 4K60 video, not a 5K60 video. The scaling of this operation is done on the uClient side. And this means that it is theoretically possible to power three identical walls or play the same 4K60 video on a 4x4 or 5x5 wall.

As another example, playing a 1080p60 video on a single display consumes an expected amount of encoders. However, playing the same video on 16 Mirror Group displays would fill up encoder resources because each stream would need to be generated independently.

We are working on a solution to this issue for a future release.

Transmission

When the content is ready to be transmitted, it will be sent as a standard RTSP video stream. The bandwidth of this stream depends on the options set in Zone Settings.

When the option Optimized for Video is selected, Zero Client's JPEG files are compressed to 87% (more compression, less detail), and 4:2:0 color. In this configuration, the Zero Client can generate up to 128 Mbps of traffic.

When the Optimized for Text option is set, the JPEG file is compressed to 92% (less compression, more detail) and at full 4:4:4 color space. With this configuration, the Zero Client can generate up to 480 Mbps of traffic (not recommended unless used with very low FPS applications).

Note that the above configuration does not apply to uClients. uClients will generate about 12Mbps of traffic for a 1080p60 video stream and up to 40Mbps for a 4K60, so use the settings to monitor your network bandwidth wisely. The Optimize for Video/Text option setting does not affect the uClients.

Tips for Managing Resources

There are several ways to optimize Userful server performance.

Use Monitoring to track system resource usage and establish a baseline. This is necessary to see what is happening on the server at any given time, and if any of the various components - CPU, RAM, GPU, PCI Express, network, etc - are reaching capacity. By default, Userful 's server limits video encoding to 30FPS. You can manually enable 60FPS in Performance Settings in the Control Center. Please note that this will double the load on the encoding side, so at least check Monitoring before enabling this option.

The Web Browser source has a built-in frame rate control feature. For dashboards and metrics that are updated very infrequently, you can set the FPS down to 0.1, which can save a lot of resources.

Conclusion

So far, we've covered the basics of how Userful On-Premise captures, processes, and distributes content. These tips are intended to help Userful partners and customers plan a successful implementation.

We strongly recommend consulting with a Userful account manager or sales engineer when planning your implementation, and as a second set of eyes, our experience and expertise are always available to our customers.

Last updated