核心代码如下:
@ti.kernel
def fd2d(it:int,ww:ti.template(),uu:ti.template(),
fux:ti.template(),fuz:ti.template(),bwx:ti.template(),bwz:ti.template(),
xx:ti.template(),zz:ti.template(),xz:ti.template()):
for x, y in ti.ndrange((1, nzbc - 1), (1, nxbc - 1)):
# print(x,y)
uu[x, y] = temp[x, y] * uu[x, y] + b[x, y] * (xx[x, y + 1] - xx[x, y] + xz[x, y] - xz[x - 1, y])
ww[x, y] = temp[x, y] * ww[x, y] + b[x, y] * (xz[x, y] - xz[x, y - 1] + zz[x + 1, y] - zz[x, y])
ww[2,48]=s[it,0]
for x, y in ti.ndrange((1, nzbc - 1), (1, nxbc - 1)):
fux[x, y] = uu[x, y] - uu[x, y - 1]
fuz[x, y] = uu[x + 1, y] - uu[x, y]
bwx[x, y] = ww[x, y + 1] - ww[x, y]
bwz[x, y] = ww[x, y] - ww[x - 1, y]
# for x, y in ti.ndrange((1, nzbc - 1), (1, nxbc - 1)):
xx[x, y] = temp[x, y] * xx[x, y] + (ca[x, y] * fux[x, y] + cl[x, y] * bwz[x, y])*dtx
zz[x, y] = temp[x, y] * zz[x, y] + (ca[x, y] * bwz[x, y] + cl[x, y] * fux[x, y])*dtx
xz[x, y] = temp[x, y] * xz[x, y] + (cm1[x, y] * (fuz[x, y] + bwx[x, y]))*dtx
for y in ti.ndrange((0, nxbc)):
zz[1,y]=0
for x in ti.ndrange((0, ng)):
seismo_w[it, x] = ww[2,40 + x * 2]
seismo_u[it, x] = uu[2,40 + x * 2]
ti.profiler.clear_kernel_profiler_info()
T1 = time.time()
for it in range(1000):
# print(it)
fd2d(it,ww,uu,fux,fuz,bwx,bwz,xx,zz,xz)
# if it%500==0:
# imagesc(ww.to_torch())
T2 = time.time()
print(T2-T1)
ti.profiler.print_kernel_profiler_info()
0.4355292320251465
Kernel Profiler(count, default) @ CUDA on NVIDIA GeForce GTX 1060
[ % total count | min avg max ] Kernel name
[ 30.70% 0.009 s 1000x | 0.008 0.009 0.040 ms] fd2d_c80_0_kernel_4_range_for
[ 25.89% 0.008 s 1000x | 0.006 0.008 0.026 ms] fd2d_c80_0_kernel_0_range_for
[ 14.28% 0.004 s 1000x | 0.003 0.004 0.022 ms] fd2d_c80_0_kernel_3_range_for
[ 14.16% 0.004 s 1000x | 0.003 0.004 0.022 ms] fd2d_c80_0_kernel_1_serial
[ 12.86% 0.004 s 1000x | 0.003 0.004 0.022 ms] fd2d_c80_0_kernel_2_range_for
[ 0.67% 0.000 s 40x | 0.004 0.005 0.016 ms] jit_evaluator_2_kernel_0_serial
[ 0.53% 0.000 s 34x | 0.004 0.005 0.008 ms] jit_evaluator_3_kernel_0_serial
[ 0.13% 0.000 s 2x | 0.020 0.020 0.021 ms] jit_evaluator_7_kernel_0_serial
[ 0.12% 0.000 s 1x | 0.036 0.036 0.036 ms] jit_evaluator_14_kernel_0_serial
[ 0.11% 0.000 s 3x | 0.005 0.011 0.017 ms] jit_evaluator_20_kernel_0_serial
[ 0.10% 0.000 s 3x | 0.005 0.010 0.016 ms] jit_evaluator_19_kernel_0_serial
[ 0.08% 0.000 s 3x | 0.005 0.008 0.011 ms] jit_evaluator_21_kernel_0_serial
[ 0.08% 0.000 s 4x | 0.004 0.006 0.008 ms] jit_evaluator_9_kernel_0_serial
[ 0.07% 0.000 s 4x | 0.004 0.005 0.007 ms] jit_evaluator_1_kernel_0_serial
[ 0.05% 0.000 s 4x | 0.003 0.004 0.005 ms] jit_evaluator_0_kernel_0_serial
[ 0.03% 0.000 s 2x | 0.004 0.005 0.005 ms] jit_evaluator_4_kernel_0_serial
[ 0.03% 0.000 s 1x | 0.009 0.009 0.009 ms] jit_evaluator_12_kernel_0_serial
[ 0.03% 0.000 s 1x | 0.008 0.008 0.008 ms] jit_evaluator_17_kernel_0_serial
[ 0.03% 0.000 s 1x | 0.008 0.008 0.008 ms] jit_evaluator_16_kernel_0_serial
[ 0.02% 0.000 s 1x | 0.007 0.007 0.007 ms] jit_evaluator_15_kernel_0_serial
[ 0.02% 0.000 s 1x | 0.007 0.007 0.007 ms] jit_evaluator_13_kernel_0_serial
[100.00%] Total execution time: 0.030 s number of results: 21
以上代码做的是弹性波模拟:ti.kernel中用for循环自动并行了波场应力等计算;用外部python循环按时间序列模拟弹性波的传播。
可以看到python time统计的时间比内核时间多很多,随着循环次数的增加二者时间差还会进一步增加。
有没有办法能加快外部调用ti.kernel时的循环速度?
总而言之需求外部循环顺序执行,内部循环并行执行高效运行方式。