Taichi中利用bvh做角色动画穿模检测

johnsonlai · 2022 年4 月 27 日 12:58

我正在尝试用Taichi加速以下问题：
对一个batch中的每个character的手部和其他部位（通过关节点joint区分）进行穿模检测。

想请教一下taichi kernel是否只能对一序列计算的最外层for-loop并行化（看文档，kernel无法调用kernel）?
参考以下伪代码，在调用detect_batch_collision的时候，会有好几层的for-loop，按理说每一层的for-loop都可以受益于并行化计算（比如将内部的node_to_node_overlap_test也并行化）。
如果taichi不可以做到多层for-loop并行化的话，是否有其他的工具、方法建议？（CUDA、代码编写方式等）
即使只用到最外层的并行，在保证速度情况下（减少cpu-gpu的overhead）是否需要将kernel中调用的计算代码都转化为taichi-function？

代码：

def detect_batch_collision(batch_character_bvh):
    for hand_bvh, body_bvh in batch_character_bvh:
        detect_character_collision(hand_bvh, body_bvh)


def detect_character_collision(hand_bvh, body_bvh):
    for body_part_bvh in body_bvh:
        detect_bvh_node_collision(hand_bvh, body_part_bvh)


def detect_bvh_node_collision(b1, b2):
    # TODO: use stack to replace recursion
    if not aabb_to_aabb_overlap_test(b1.aabb, b2.aabb):
        return False

    if b1.is_leaf and b2.is_leaf:
        return node_to_node_overlap_test(b1, b2)
    elif b1.is_leaf:
        return detect_bvh_node_collision(b1, b2.left) or \
               detect_bvh_node_collision(b1, b2.right)
    elif b2.is_leaf:
        return detect_bvh_node_collision(b1.left, b2) or \
               detect_bvh_node_collision(b1.right, b2)
    else:
        for b1_child in (b1.left, b1.right):
            for b2_child in (b2.left, b2.right):
                if detect_bvh_node_collision(b1_child, b2_child):
                    return True

    return False


def node_to_node_overlap_test(a, b):
    for a_tri in a.triangles:
        for b_tri in b.triangles:
            if tri_tri_overlap_test_3d(a_tri, b_tri):
                return True

    return False

YuPeng · 2022 年4 月 28 日 04:59

Hi @johnsonlai , 非常欢迎来到太极论坛。

ti.kernel 里面确实只有最外层 for-loop 是并行的。ti.kernel不可以调用其他的ti.kernel，但可以调用 ti.func。
我建议你是否能把所有可以并行的 for-loop 都放到最外层里面。比如

for i in range(10):
     for j in range(20):

转换成： for i, j in ti.ndrange(10, 20)
3. kernel 中调用的函数是需要转化成 ti.func的。

johnsonlai · 2022 年4 月 29 日 02:40

谢谢 @YuPeng 的回复。这里面涉及5个循环，前两个循环花费一些功夫，应该可以合并。detect_bvh_node_collision事实上有一层基于树状层级的update，所以并不能完全并行。但是我又想让node_to_node_overlap_test的两个loop也并行。
我看cuda是可以支持kernel调用kernel的， taichi没有这样设计的原因是什么呢? 将来有没有可能支持呢？

YuPeng · 2022 年4 月 29 日 02:56

CUDA 里面也是分Global func和 Device func的：cuda - Difference between global and device functions - Stack Overflow

请问你为什么不能把你想在kernel中调用的函数设置成 ti.func 呢？

johnsonlai · 2022 年4 月 29 日 03:18

可能我没描述清楚，现在就是将所有函数设置成支持ti.func。但是只有kernel （kenerl函数为detect_batch_collision）的最外层for-loop可以并行化，对于这个例子，如果最内层的两个循环（node_to_node_overlap_test无法一起提到最外层）也并行化，可以极大提高运行效率。所以想咨询下如果是kernel调用kernel的方式（虽然目前没有这种方式），内层循环就可以得到并行？

johnsonlai · 2022 年4 月 29 日 03:29

限于我对CUDA和Taichi都不很了解，也许没能把问题描述清楚。我想要的就是类似CUDA Dynamic Parallelism，利用这个blog的一段话来描述一下我的需求：

Dynamic parallelism is generally useful for problems where nested parallelism cannot be avoided.This includes, but is not limited to, the following classes of algorithms:

algorithms using hierarchical data structures, such as adaptive grids;

algorithms using recursion, where each level of recursion has parallelism, such as quicksort;

algorithms where work is naturally split into independent batches, where each batch involves complex parallel processing but cannot fully use a single GPU.

想咨询下taichi后面是否会支持这个特性。

YuPeng · 2022 年4 月 29 日 08:13

目前Taichi还不支持这些功能。关于Recursion, 需要首先实现Real function（正在开发中），目前Taichi中的ti.func 是inline 函数。

关于BVH，我推荐你看一下另一个BVH的实现，你看一下能给你提供帮助不。

johnsonlai · 2022 年4 月 29 日 08:52

嗯嗯，了解，谢谢。