指定 order='ij' 的影响

2022 08 12 数值计算SIG

TL;DR:

Intel/AMD CPU 上,都是指定 ij 要稍微比默认快一点,ji 非常慢。cuda上面,A100/A6000上,默认和指定 ij 基本一致,但是 ji 快了3倍多

测试环境:集群上申请单块A100,但是申请到的node还有他人在运行其他任务

CPU For loop 了1000次
CPU model: AuthenticAMD AMD EPYC 7713P 64-Core Processor

# 默认
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[100.00%  42.677 s   1000x |   41.913    42.677    44.315 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:  42.677 s   number of results: 1
=========================================================================
[W 08/12/22 17:16:08.245 62890] [llvm_offline_cache.cpp:clean_cache@379] Lock /home/luod/.cache/taichi/ticache/metadata.lock failed

# ij
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[100.00%  32.060 s   1000x |   30.762    32.060    33.472 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:  32.060 s   number of results: 1
=========================================================================

# ji 
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[100.00% 148.995 s   1000x |  145.462   148.995   153.182 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time: 148.995 s   number of results: 1
=========================================================================

CUDA 11.2.2 For loop 10000 次

# 默认
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA A100-SXM-80GB
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 92.42%   8.047 s  10000x |    0.801     0.805     0.860 ms] foo_c64_0_kernel_0_range_for
[  7.58%   0.660 s      1x |  660.375   660.375   660.375 ms] runtime_initialize
[  0.00%   0.000 s      1x |    0.078     0.078     0.078 ms] runtime_initialize_snodes
[  0.00%   0.000 s      1x |    0.017     0.017     0.017 ms] runtime_memory_allocate_aligned
-------------------------------------------------------------------------
[100.00%] Total execution time:   8.708 s   number of results: 4
=========================================================================

# ij
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA A100-SXM-80GB
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 92.37%   8.048 s  10000x |    0.801     0.805     0.860 ms] foo_c64_0_kernel_0_range_for
[  7.63%   0.665 s      1x |  664.905   664.905   664.905 ms] runtime_initialize
[  0.00%   0.000 s      1x |    0.082     0.082     0.082 ms] runtime_initialize_snodes
[  0.00%   0.000 s      1x |    0.016     0.016     0.016 ms] runtime_memory_allocate_aligned
-------------------------------------------------------------------------
[100.00%] Total execution time:   8.713 s   number of results: 4
=========================================================================

# ji
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA A100-SXM-80GB
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 73.57%   1.887 s  10000x |    0.187     0.189     0.242 ms] foo_c64_0_kernel_0_range_for
[ 26.43%   0.678 s      1x |  677.757   677.757   677.757 ms] runtime_initialize
[  0.00%   0.000 s      1x |    0.078     0.078     0.078 ms] runtime_initialize_snodes
[  0.00%   0.000 s      1x |    0.017     0.017     0.017 ms] runtime_memory_allocate_aligned
-------------------------------------------------------------------------
[100.00%] Total execution time:   2.564 s   number of results: 4
=========================================================================

# ji run 2
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.5
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA A100-SXM-80GB
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 73.88%   1.886 s  10000x |    0.184     0.189     0.240 ms] foo_c64_0_kernel_0_range_for
[ 26.11%   0.667 s      1x |  666.616   666.616   666.616 ms] runtime_initialize
[  0.00%   0.000 s      1x |    0.073     0.073     0.073 ms] runtime_initialize_snodes
[  0.00%   0.000 s      1x |    0.017     0.017     0.017 ms] runtime_memory_allocate_aligned
-------------------------------------------------------------------------
[100.00%] Total execution time:   2.553 s   number of results: 4
=========================================================================
2 个赞

测试环境:小型独立服务器,除系统任务外,没有其他任务

CPU For loop 1000 times
Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz

# 默认
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.4
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[100.00%  10.065 s   1000x |    5.296    10.065    22.308 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:  10.065 s   number of results: 1
=========================================================================

# ij
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.4
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[100.00%   6.808 s   1000x |    4.053     6.808    34.381 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:   6.808 s   number of results: 1
=========================================================================

# ji
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.4
[Taichi] Starting on arch=x64
=========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[100.00%  71.204 s   1000x |   53.141    71.204   104.714 ms] foo_c64_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:  71.204 s   number of results: 1
=========================================================================

NVIDIA RTX A6000 CUDA Version: 11.7 For loop 1000

# 默认
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.4
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA RTX A6000
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 96.73%   5.717 s  10000x |    0.568     0.572     1.164 ms] foo_c64_0_kernel_0_range_for
[  3.26%   0.193 s      1x |  192.952   192.952   192.952 ms] runtime_initialize
[  0.00%   0.000 s      1x |    0.029     0.029     0.029 ms] runtime_memory_allocate_aligned
[  0.00%   0.000 s      1x |    0.003     0.003     0.003 ms] runtime_initialize_snodes
-------------------------------------------------------------------------
[100.00%] Total execution time:   5.910 s   number of results: 4

# ij
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.4
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA RTX A6000
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 96.77%   5.754 s  10000x |    0.569     0.575     1.039 ms] foo_c64_0_kernel_0_range_for
[  3.23%   0.192 s      1x |  192.219   192.219   192.219 ms] runtime_initialize
[  0.00%   0.000 s      1x |    0.031     0.031     0.031 ms] runtime_memory_allocate_aligned
[  0.00%   0.000 s      1x |    0.003     0.003     0.003 ms] runtime_initialize_snodes
-------------------------------------------------------------------------
[100.00%] Total execution time:   5.947 s   number of results: 4
=========================================================================

# ji
[Taichi] version 1.1.0, llvm 10.0.0, commit f5bb6464, linux, python 3.10.4
[Taichi] Starting on arch=cuda
=========================================================================
Kernel Profiler(count, default) @ CUDA on NVIDIA RTX A6000
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 87.87%   1.392 s  10000x |    0.135     0.139     0.365 ms] foo_c64_0_kernel_0_range_for
[ 12.13%   0.192 s      1x |  192.109   192.109   192.109 ms] runtime_initialize
[  0.00%   0.000 s      1x |    0.033     0.033     0.033 ms] runtime_memory_allocate_aligned
[  0.00%   0.000 s      1x |    0.003     0.003     0.003 ms] runtime_initialize_snodes
-------------------------------------------------------------------------
[100.00%] Total execution time:   1.584 s   number of results: 4
=========================================================================
1 个赞