2022 08 12 数值计算SIG
TL;DR:
Intel/AMD CPU 上,都是指定 ij 要稍微比默认快一点,ji 非常慢。cuda上面,A100/A6000上,默认和指定 ij 基本一致,但是 ji 快了3倍多
测试环境:集群上申请单块A100,但是申请到的node还有他人在运行其他任务
CPU For loop 了1000次
CPU model: AuthenticAMD AMD EPYC 7713P 64-Core Processor
# 默认
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64
=========================================================================
[ % total count | min avg max ] Kernel name
-------------------------------------------------------------------------
[100.00% 42.677 s 1000x | 41.913 42.677 44.315 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time: 42.677 s number of results: 1
=========================================================================
[W 08/12/22 17:16:08.245 62890] [llvm_offline_cache.cpp:clean_cache@379] Lock /home/luod/.cache/taichi/ticache/metadata.lock failed
# ij
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64
=========================================================================
[ % total count | min avg max ] Kernel name
-------------------------------------------------------------------------
[100.00% 32.060 s 1000x | 30.762 32.060 33.472 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time: 32.060 s number of results: 1
=========================================================================
# ji
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64
=========================================================================
[ % total count | min avg max ] Kernel name
-------------------------------------------------------------------------
[100.00% 148.995 s 1000x | 145.462 148.995 153.182 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time: 148.995 s number of results: 1
=========================================================================
CUDA 11.2.2 For loop 10000 次
# 默认
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA A100-SXM-80GB
=========================================================================
[ % total count | min avg max ] Kernel name
-------------------------------------------------------------------------
[ 92.42% 8.047 s 10000x | 0.801 0.805 0.860 ms] foo_c64_0_kernel_0_range_for
[ 7.58% 0.660 s 1x | 660.375 660.375 660.375 ms] runtime_initialize
[ 0.00% 0.000 s 1x | 0.078 0.078 0.078 ms] runtime_initialize_snodes
[ 0.00% 0.000 s 1x | 0.017 0.017 0.017 ms] runtime_memory_allocate_aligned
-------------------------------------------------------------------------
[100.00%] Total execution time: 8.708 s number of results: 4
=========================================================================
# ij
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA A100-SXM-80GB
=========================================================================
[ % total count | min avg max ] Kernel name
-------------------------------------------------------------------------
[ 92.37% 8.048 s 10000x | 0.801 0.805 0.860 ms] foo_c64_0_kernel_0_range_for
[ 7.63% 0.665 s 1x | 664.905 664.905 664.905 ms] runtime_initialize
[ 0.00% 0.000 s 1x | 0.082 0.082 0.082 ms] runtime_initialize_snodes
[ 0.00% 0.000 s 1x | 0.016 0.016 0.016 ms] runtime_memory_allocate_aligned
-------------------------------------------------------------------------
[100.00%] Total execution time: 8.713 s number of results: 4
=========================================================================
# ji
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA A100-SXM-80GB
=========================================================================
[ % total count | min avg max ] Kernel name
-------------------------------------------------------------------------
[ 73.57% 1.887 s 10000x | 0.187 0.189 0.242 ms] foo_c64_0_kernel_0_range_for
[ 26.43% 0.678 s 1x | 677.757 677.757 677.757 ms] runtime_initialize
[ 0.00% 0.000 s 1x | 0.078 0.078 0.078 ms] runtime_initialize_snodes
[ 0.00% 0.000 s 1x | 0.017 0.017 0.017 ms] runtime_memory_allocate_aligned
-------------------------------------------------------------------------
[100.00%] Total execution time: 2.564 s number of results: 4
=========================================================================
# ji run 2
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA A100-SXM-80GB
=========================================================================
[ % total count | min avg max ] Kernel name
-------------------------------------------------------------------------
[ 73.88% 1.886 s 10000x | 0.184 0.189 0.240 ms] foo_c64_0_kernel_0_range_for
[ 26.11% 0.667 s 1x | 666.616 666.616 666.616 ms] runtime_initialize
[ 0.00% 0.000 s 1x | 0.073 0.073 0.073 ms] runtime_initialize_snodes
[ 0.00% 0.000 s 1x | 0.017 0.017 0.017 ms] runtime_memory_allocate_aligned
-------------------------------------------------------------------------
[100.00%] Total execution time: 2.553 s number of results: 4
=========================================================================