# Problem of high dimensional ti.var

The following code includes several ti.var. The code works fine when `batch_size, input_feature, hidden_feature, out_feature’ are all small. However, if I use some numbers like 64, which is normal in neural networks, the program keeps running and does not give a result.

``````import torch
import torch.nn.functional as F
import math
import torch.nn as nn
import taichi as ti

ti.get_runtime().set_default_fp(ti.f32)
real = ti.f32

@staticmethod
def forward(ctx, input_data, weight_0, bias_0, weight_1, bias_1,
ti_data, ti_weight_0, ti_bias_0, ti_weight_1, ti_bias_1, ti_output_0, ti_output_1, ti_kernel):
ctx.ti_output_1 = ti_output_1
ctx.ti_kernel = ti_kernel
ctx.ti_data = ti_data
ctx.ti_weight_0 = ti_weight_0
ctx.ti_bias_0 = ti_bias_0
ctx.ti_weight_1 = ti_weight_1
ctx.ti_bias_1 = ti_bias_1
ti_data.from_torch(input_data)
ti_weight_0.from_torch(weight_0)
ti_bias_0.from_torch(bias_0)
ti_weight_1.from_torch(weight_1)
ti_bias_1.from_torch(bias_1)
ti_kernel()
return ti_output_1.to_torch()

@staticmethod

None, None, None, None, None, None, None, None

class Linear(nn.Module):
def __init__(self, input_feature, hidden_feature, output_feature):
super(Linear, self).__init__()
# taichi parameter holders
self.ti_data = ti.var(dt=real, shape=(batch_size, input_feature), needs_grad=True)
self.ti_weight_0 = ti.var(dt=real, shape=(input_feature, hidden_feature), needs_grad=True)
self.ti_output_0 = ti.var(dt=real, shape=(batch_size, hidden_feature), needs_grad=True)
self.ti_weight_1 = ti.var(dt=real, shape=(hidden_feature, out_feature), needs_grad=True)
self.ti_output_1 = ti.var(dt=real, shape=(batch_size, out_feature), needs_grad=True)
# torch parameters
self.weight_0 = nn.Parameter(torch.Tensor(input_feature, hidden_feature))
self.bias_0 = nn.Parameter(torch.Tensor(hidden_feature))
self.weight_1 = nn.Parameter(torch.Tensor(hidden_feature, out_feature))
self.bias_1 = nn.Parameter(torch.Tensor(out_feature))
self.weight_0.data.normal_(0, math.sqrt(2. / hidden_feature / input_feature))
self.weight_1.data.normal_(0, math.sqrt(2. / hidden_feature / output_feature))

@ti.classkernel
def linear_kernel(self):
for i in range(batch_size):
for j in ti.static(range(hidden_feature)):
dummy = 0.0
for k in ti.static(range(input_feature)):
dummy += self.ti_data[i, k] * self.ti_weight_0[k, j]
dummy += self.ti_bias_0[j]
self.ti_output_0[i, j] = ti.max(dummy, 0)
for j in ti.static(range(out_feature)):
dummy = 0.0
for k in ti.static(range(hidden_feature)):
dummy += self.ti_output_0[i, k] * self.ti_weight_1[k, j]
dummy += self.ti_bias_1[j]
self.ti_output_1[i, j] = dummy

def forward(self, input_data):
return LinearFunction.apply(input_data, self.weight_0, self.bias_0, self.weight_1, self.bias_1,
self.ti_data, self.ti_weight_0, self.ti_bias_0, self.ti_weight_1, self.ti_bias_1,
self.ti_output_0, self.ti_output_1, self.linear_kernel)

if __name__ == '__main__':
batch_size = 32
input_feature = 64
hidden_feature = 128
out_feature = 64
data = torch.rand(batch_size, input_feature, dtype=torch.float32, requires_grad=True)
linear = Linear(input_feature, hidden_feature, out_feature)

test = gradcheck(linear, data, eps=1e-3, atol=1e-4)
print(test)

``````

Using `ti.static` will force the loop to get unrolled. Here you are unrolling too aggressively. See https://github.com/yuanming-hu/difftaichi/blob/master/examples/rigid_body.py#L112 for an example NN implementation.

Maybe check the cpu conditions in task manager to see if too much resources are taken. I have seen those threads eat up all the cores and cannot even go into “Runtime initialization” …

Is there some rule of thumb that how many dimension is suitable for ti.static? In addition, is there any difference regarding to the unrolled dimensions between CPU and GPU?

Usually < 30 is fine. In the script you provided there are two levels of force-unrolled loops and you are getting an unrolling factor of `hidden_feature x input_feature`. The device selection (CPU/GPU) is not quite related to unrolling, at least during compilation.

Do you mean the total unrolling factor should be < 30? I remember that the first for loop is automatically unrolled. Therefore, I see some code like mpm that you unrolled a loop of 6400.

30 is not a strict bound. Going beyond 30 may lead to longer compilation time and lower performance.
The first for-loop is automatically parallelized instead of unrolled.

Aha, sorry for my poor knowledge of basic computer science.
Does the second for-loop is also parallelized? If no, is there some code for doing this?

No, and there’s unfortunately no way to parallelize it. Given current processor architectures, nested parallelism is generally hard to make efficient…

What if I have two for-loop on the same level?

Then both will get parallelized, if you are talking about the outer-most level.

OK. Yes, that’s what I am talking about.
So a possible way to parallelize two successive for-loop is to write them in the same for-loop and get the index by mod and division?

Exactly!!