date:20220513

Re: [apache/tvm-rfcs] [RFC] A proposed update to the Docker images ci_* tag pattern (PR #66)

2022-05-13 Thread Gustavo Romero

> cc @Mousius @gromero can you approve?

@leandron Hi. Are you still planing to change 
https://github.com/apache/tvm-rfcs/pull/66#discussion_r854052265 ? 

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/66#issuecomment-1125764615
You are receiving this because you are subscribed to this thread.

Message ID:

[apache/tvm] [Bug] [CUDA] TVM does not release all the GPU memory and does not turn off the maximum performance mode after the inference has been completed. (Issue #11307)

2022-05-13 Thread Alexey Gladyshev

### Research that led to the problem

While conducting measurement experiments to evaluate performance on a GPU 
(`NVidia Tesla T4`), we noticed that the temperature of the GPU affects the 
final performance evaluation.

Below is a demonstration of this effect using 
[GPT-2](https://github.com/onnx/models/tree/main/text/machine_comprehension/gpt-2)
 as an example. 



GPU Temperature (°C) | Performance (ms)
-- | --
~35 | 3,5713
~45 | 3,6958
~60 | 3,7856
~70 | 3,8195
~80 | 3,8904

_* This table is not aimed at demonstrating the exact values, it only shows the 
general dynamics._

The table shows that when the GPU goes from a cold state before run (35°C) to a 
state of long operation (80°C), there is a decrease in performance by ~9%.

The main suggestion about the reason for this behavior was related to the drop 
in the frequency of the GPU when it was heated.

Using the officially NVidia tool (`nvidia-smi -q -d CLOCK`), we can find out 
the maximum GPU frequency:
```Shell
Max Customer Boost Clocks
Graphics  : 1590 MHz
```

Also, using the same tool (`nvidia-smi dmon`), we can trace the relationship 
between GPU temperature and its frequency (third and last columns):
```Shell
# gpu   pwr gtemp mtempsm   mem   enc   dec  mclk  pclk
# Idx W C C % % % %   MHz   MHz
06963 -8956 0 0  5000  1545
07063 -8955 0 0  5000  1545
06964 -8955 0 0  5000  1530
06964 -8955 0 0  5000  1515
```

When the temperature changes from 35°C to 80°C, the GPU frequency drops from 
1590 MHz to ~1430 MHz (~10%).

Based on this, we can conclude that the drop in performance for the `GPT-2` is 
associated with a drop in the frequency of the GPU when it is heated.

### Description of the problem

In order to reduce the effect of heat on performance, it is necessary to make 
sure that the GPU cools down at those moments when its frequency begins to 
decrease due to heat. The easiest way is to call sleep after the next inference 
is completed, when the temperature reaches a certain value. 

However, this solution will not work. This is due to the fact that TVM does not 
release all the GPU memory and does not turn off the maximum performance mode 
after the inference has been completed. Because of what, the GPU continues to 
heat up even in those moments when inference does not occur. The release of 
resources occurs only after the main Python process, in which the measuring 
experiment was launched, is completed.

### Expected behavior

After topology inference is complete, and all objects that could be handled by 
the GPU have been destructed, the TVM releases all GPU resources and puts the 
GPU into low power mode.

### Actual behavior

TVM does not release all the GPU memory and does not turn off the maximum 
performance mode after the inference has been completed. The release of 
resources occurs only after the main Python process, in which the measuring 
experiment was launched, is completed.

### Primary investigations

It should be said right away that these are not memory leaks in TVM. This is a 
feature of working with the CUDA Runtime API.

This is due to the fact that the CUDA Runtime API call on any thread which 
requires an active context will trigger the initialization of that device's 
primary context.

Primary 
[context](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context)
 will remain active until they are explicitly deinitialized using 
`cudaDeviceReset()`.  The function `cudaDeviceReset()` will deinitialize the 
primary context for the calling thread's current device immediately.

In view of this, in order to return the device to its original state, it is 
necessary to call the specified function after the end of the inference.

However, it seems that this cannot be done automatically (inside TVM) due to 
the following basic problems that may arise:

* When working in multi-threaded mode, calling this function will reset the 
device for all threads. Therefore, all data on the GPU owned by other threads 
will be destroyed;

* In single-threaded applications, resetting the GPU can clear data that is 
still needed to be used after the inference;

* There may also be problems with delayed destruction of the object whose 
destructor should have a call to this function (when the Garbage Collector 
destroys the `GPU using object` after a new `GPU using object` has been 
created, in which case resetting the device in the first object will affect the 
second object).

Based on the problems described, we can conclude that this function call should 
not be placed inside a TVM. However, it is possible to add an additional global 
function (`TVM_REGISTER_GLOBAL`) to `cuda_device_api.cc` that will reset the 
GPU. This function will be called only in those cases when the user/programmer 
explicitly writes its call.

The implement

Re: [apache/tvm-rfcs] [RFC] UMA Universal Modular Accelerator Interface (PR #60)

2022-05-13 Thread PaulPalomeroBernardo

So here is my take on A2:

In the accelerator-specific backend, a user would register target attribute 
names e.g.
```
class UltraTrailBackend(UMABackend):
def __init__(self):
super(UltraTrailBackend, self).__init__()

###
# Target configuration
###
self._register_target_attr("ultra_trail_attr_1")
self._register_target_attr("ultra_trail_attr_2")
```
They can be used during target creation similar to other sub_target strings
```
ut_target = tvm.target.Target("ultra_trail -ultra_trail_attr_1=attr1 
-ultra_trail_attr_2=attr2")
```
This is basically implemented by passing a list of attribute names to the 
target kind registration
```
TVM_REGISTER_GLOBAL("relay.backend.contrib.uma.RegisterTarget")
.set_body_typed([](String target_name, Array attr_names){
auto target_kind = ::tvm::TargetKindRegEntry::RegisterOrGet(target_name)
.set_name()
.set_device_type(kDLCPU)
.add_attr_option>("keys")
.add_attr_option("tag")
.add_attr_option("device")
.add_attr_option("model")
.add_attr_option>("libs")
.add_attr_option("host")
.add_attr_option("from_device")
.set_attr("RelayToTIR", 
relay::contrib::uma::RelayToTIR(target_name))
.set_attr("TIRToRuntime", 
relay::contrib::uma::TIRToRuntime);

for (auto &attr_name : attr_names) {
target_kind.add_attr_option(attr_name);
}
});
```
The main downside I see with this, is that all attributes are treated as 
strings since the type is hardcoded. However, I'm not sure if we can avoid this 
at all.

What do you think?

For A1:
I would like to keep the phases. They definitely need proper documentation, but 
I think a handfull of phases (e.g., `PRE_PARTITIONING`, `POST_PARTITIONING`, 
...) provide more orientation for new users than having to explicitly define 
the dependencies to other passes. We could think of also supporting the 
`before` and `after` options to provide more flexibility for experienced users.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/60#issuecomment-1126285691
You are receiving this because you are subscribed to this thread.

Message ID:

Re: [apache/tvm-rfcs] [RFC] A proposed update to the Docker images ci_* tag pattern (PR #66)

[apache/tvm] [Bug] [CUDA] TVM does not release all the GPU memory and does not turn off the maximum performance mode after the inference has been completed. (Issue #11307)

Re: [apache/tvm-rfcs] [RFC] UMA Universal Modular Accelerator Interface (PR #60)

3 matches

Site Navigation

Mail list logo

Footer information