在张铭睿同学的SPH的代码中,用到了一个前缀和的计算,但他是直接调用的函数,该函数似乎只能在nvida显卡下运行,因此我想自己写一个前缀和计算程序,但在用atomic_add和atomic_sub时候发现有以下两个问题:
- cpu或gpu并行计算时,计算出的prefix_num是错误的,且每次结果都不一样
- 并行计算耗时远大于串行计算,且得到错误结果
除此之外,taichi的sph库中,由于particle_system里的counting_sort也用到了atomic_sub,造成每次运算的grid_ids_new不一样,从而造成最终结果每次都轻微不同。
如何使得每次运行结果都相同且正确呢?以下是简单代码:
import taichi as ti
import time
import numpy as np
ti.init(arch=ti.cpu)
n = 102400
data = ti.field(dtype=ti.i32, shape=n)
prefix_sum = ti.field(dtype=ti.i64, shape=n)
@ti.kernel
def init_data():
for i in range(n):
data[i] = i + 1
@ti.kernel
def parallel_prefix_sum():
#ti.loop_config(serialize=True)
_prefix_sum_cur = ti.cast(0,ti.i64)
for i in range(0, n):
prefix_sum[i] = ti.atomic_add(_prefix_sum_cur, data[i])
@ti.kernel
def serialize_prefix_sum():
ti.loop_config(serialize=True)
_prefix_sum_cur = ti.cast(0,ti.i64)
for i in range(0, n):
prefix_sum[i] = ti.atomic_add(_prefix_sum_cur, data[i])
init_data()
parallel_prefix_sum()
a = prefix_sum.to_numpy()
print(a[n-1])
# The print is different each time you run it. Why?
serialize_prefix_sum()
b = prefix_sum.to_numpy()
print(b[n-1]) # Correct: 5242828800
print(np.linalg.norm(a-b))
time1 = time.time()
for i in range(1000):
serialize_prefix_sum()
time2 = time.time()
for i in range(1000):
parallel_prefix_sum()
time3 = time.time()
print("para: ", time3-time2,"serial: ", time2-time1)
# CPU - para: 1.77, serial: 0.0655
# GPU - para: 4.17, serial: 0.0355