能否减少外部python循环带来的性能损失？

``````@ti.kernel
def fd2d(it:int,ww:ti.template(),uu:ti.template(),
fux:ti.template(),fuz:ti.template(),bwx:ti.template(),bwz:ti.template(),
xx:ti.template(),zz:ti.template(),xz:ti.template()):
for x, y in ti.ndrange((1, nzbc - 1), (1, nxbc - 1)):
#         print(x,y)
uu[x, y] = temp[x, y] * uu[x, y] + b[x, y] * (xx[x, y + 1] - xx[x, y] + xz[x, y] - xz[x - 1, y])
ww[x, y] = temp[x, y] * ww[x, y] + b[x, y] * (xz[x, y] - xz[x, y - 1] + zz[x + 1, y] - zz[x, y])

ww[2,48]=s[it,0]

for x, y in ti.ndrange((1, nzbc - 1), (1, nxbc - 1)):
fux[x, y] = uu[x, y] - uu[x, y - 1]
fuz[x, y] = uu[x + 1, y] - uu[x, y]
bwx[x, y] = ww[x, y + 1] - ww[x, y]
bwz[x, y] = ww[x, y] - ww[x - 1, y]

#     for x, y in ti.ndrange((1, nzbc - 1), (1, nxbc - 1)):
xx[x, y] = temp[x, y] * xx[x, y] + (ca[x, y] * fux[x, y] + cl[x, y] * bwz[x, y])*dtx
zz[x, y] = temp[x, y] * zz[x, y] + (ca[x, y] * bwz[x, y] + cl[x, y] * fux[x, y])*dtx
xz[x, y] = temp[x, y] * xz[x, y] + (cm1[x, y] * (fuz[x, y] + bwx[x, y]))*dtx

for y in ti.ndrange((0, nxbc)):
zz[1,y]=0
for x in ti.ndrange((0, ng)):
seismo_w[it, x] = ww[2,40 + x * 2]
seismo_u[it, x] = uu[2,40 + x * 2]
ti.profiler.clear_kernel_profiler_info()
T1 = time.time()
for it in range(1000):
#     print(it)
fd2d(it,ww,uu,fux,fuz,bwx,bwz,xx,zz,xz)
#     if it%500==0:
#         imagesc(ww.to_torch())
T2 = time.time()
print(T2-T1)
ti.profiler.print_kernel_profiler_info()
``````

[100.00%] Total execution time: 0.030 s number of results: 21

loop_config后下一层的for循环能够并行执行啊？

``````import taichi as ti

ti.init(arch=ti.gpu)

@ti.kernel
def f():
for i in ti.static(range(10)):
for j in range(100):
print(i, j)

f()
``````

``````import taichi as ti
import time
ti.init(arch=ti.gpu)

@ti.kernel
def f1():
for i in ti.static(range(100)):
for j in range(100):
print(i, j)
@ti.kernel
def subs():
for j in range(100):
print(j)

def f2():
for i in range(100):
subs()

T1=time.time()
f1()
T2=time.time()
Tf1 = T2-T1
print(Tf1)

T1=time.time()
f2()
T2=time.time()
Tf2 = T2-T1
print(Tf2)
``````

loop unroll 确实是对的，但是如果unroll 100 1000个循环还是没优势
Tf1= 0.6472702026367188
Tf2= 0.06530094146728516

unroll 之后编译时间会比较久，但是有了 cache 之后应该会快的。

[Taichi] version 1.6.0, llvm 15.0.4, commit f1c6fbbd, linux, python 3.10.6
[Taichi] Starting on arch=cuda
0.33597517013549805 ← loop unroll 第一次运行（包含编译）
0.013126373291015625 ← loop unroll 第二次运行
0.023073911666870117 ← 外部 loop