Skip to content

Errors Encountered When Running Yolov10 Model with Rust Candle Framework on GPU

kingzcheung
Published date:
Edit this post

Previously, I completely implemented the Yolov10 model from scratch and successfully ran it on CPU. However, when using CUDA acceleration, the following error occurred:

DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")

Through error stack trace investigation, I eventually pinpointed the issue to the topk function.

Specifically, Yolov10 uses the topk function twice in the v10postprocess module.

The first topk is primarily used to select detection results with the highest confidence:

max_scores, index = torch.topk(max_scores, max_det, dim=-1)

The second topk is used again to sort by score among the already selected max_det prediction boxes.

scores, index = torch.topk(scores.flatten(1), max_det, dim=-1)

In the candle framework, the topk function is roughly implemented as follows:


pub trait TopKLastDimOp {
    /// Note: this implements torch.topk with sorted=True.
    fn topk(&self, topk: usize) -> Result<TopKOutput>;

    /// Note: this implements torch.topk with sorted=False.
    fn topk_unsorted(&self, topk: usize) -> Result<TopKOutput>;
}

impl TopKLastDimOp for Tensor {
    fn topk(&self, topk: usize) -> Result<TopKOutput> {
        // Sorted descending
        let sorted_indices = self.arg_sort_last_dim(false)?;
        let topk_indices = sorted_indices.narrow(D::Minus1, 0, topk)?.contiguous()?;
        Ok(TopKOutput {
            values: self.gather(&topk_indices, D::Minus1)?,
            indices: topk_indices,
        })
    }

    fn topk_unsorted(&self, topk: usize) -> Result<TopKOutput> {
        // Sorted descending
        let sorted_indices_all = self.arg_sort_last_dim(false)?;
        let topk_indices_sorted = sorted_indices_all
            .narrow(D::Minus1, 0, topk)?
            .contiguous()?;
        let topk_values_sorted = self.gather(&topk_indices_sorted, D::Minus1)?;

        // Reorder the indices ascending
        let reorder_indices = topk_indices_sorted.arg_sort_last_dim(true)?;
        let topk_indices_unsorted = topk_indices_sorted.gather(&reorder_indices, D::Minus1)?;
        let topk_values_unsorted = topk_values_sorted.gather(&reorder_indices, D::Minus1)?;
        Ok(TopKOutput {
            values: topk_values_unsorted,
            indices: topk_indices_unsorted,
        })
    }
}

The topk function itself is not problematic; the issue lies with the arg_sort_last_dim function in the topk implementation. In the first topk, the shape of max_det being processed is [1,8400]. This is a very large tensor. However, arg_sort_last_dim on cuda does not support tensors of this size.

Through the following test, it was discovered that the arg_sort_last_dim function cannot even handle tensors larger than 1024.

fn main() {
    let a = Tensor::zeros(
        1025,
        DType::F32,
        &Device::cuda_if_available(0).unwrap(),
    )
    .unwrap();
    dbg!(&a.arg_sort_last_dim(true));
}

In fact, this issue with arg_sort_last_dim was reported quite a while ago, but almost no one has solved it.

So it’s temporarily unsolvable.

Previous
用 candle 实现腾讯的翻译模型 hy-mt-1.8b 的一些感受
Next
Yolov10 模型使用 Rust Candle 框架在显卡上运行碰到的错误