taichi的PrefixSumExecutor为什么比串行还更加慢一点

Parallel PrefixSumExecutor: 0.016s
Serialize PrefixSum: 0.0011s

import taichi as ti
ti.init(arch=ti.gpu)

num=120000
a=ti.field(int, shape=num)

@ti.kernel
def k():
    for i in range(num):
        a[i] = i

@ti.kernel
def prefix_sum(a: ti.template()):
    ti.loop_config(serialize=True)
    for i in range(1,num):
        a[i]=a[i]+a[i-1]

k()
pse = ti.algorithms.PrefixSumExecutor(num)
pse.run(a)
k()
prefix_sum(a)


import time
k()
start = time.time()
s=0
while s<1e2:
    pse.run(a)
    s+=1
end = time.time()
print(end-start)

k()
start = time.time()
s=0
while s<1e2:
    prefix_sum(a)
    s+=1
end = time.time()
print(end-start)

这里是对比计时代码有问题,prefix_sum那个函数launch了以后没有sync,直接跑到了end = time.time(), 导致计时只记到了一个kernel launch time,没有记下真正的运行时间。在end前加一个ti.sync()就对了

1 个赞