题 AMD 24核心服务器内存带宽


我需要一些帮助来确定我在服务器上的Linux下看到的内存带宽是否正常。这是服务器规范:

HP ProLiant DL165 G7
2x AMD Opteron 6164 HE 12-Core
40 GB RAM (10 x 4GB DDR1333)
Debian 6.0

运用 mbw 在这台服务器上,我得到以下数字:

foo1:~# mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.58047    MiB: 1024.00000 Copy: 1764.082 MiB/s
1   Method: MEMCPY  Elapsed: 0.58012    MiB: 1024.00000 Copy: 1765.152 MiB/s
2   Method: MEMCPY  Elapsed: 0.58010    MiB: 1024.00000 Copy: 1765.201 MiB/s
AVG Method: MEMCPY  Elapsed: 0.58023    MiB: 1024.00000 Copy: 1764.811 MiB/s
0   Method: DUMB    Elapsed: 0.36174    MiB: 1024.00000 Copy: 2830.778 MiB/s
1   Method: DUMB    Elapsed: 0.35869    MiB: 1024.00000 Copy: 2854.817 MiB/s
2   Method: DUMB    Elapsed: 0.35848    MiB: 1024.00000 Copy: 2856.481 MiB/s
AVG Method: DUMB    Elapsed: 0.35964    MiB: 1024.00000 Copy: 2847.310 MiB/s
0   Method: MCBLOCK Elapsed: 0.23546    MiB: 1024.00000 Copy: 4348.860 MiB/s
1   Method: MCBLOCK Elapsed: 0.23544    MiB: 1024.00000 Copy: 4349.230 MiB/s
2   Method: MCBLOCK Elapsed: 0.23544    MiB: 1024.00000 Copy: 4349.359 MiB/s
AVG Method: MCBLOCK Elapsed: 0.23545    MiB: 1024.00000 Copy: 4349.149 MiB/s

在我的其他一台服务器上(基于Intel Xeon E3-1270):

foo2:~# mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.18960    MiB: 1024.00000 Copy: 5400.901 MiB/s
1   Method: MEMCPY  Elapsed: 0.18922    MiB: 1024.00000 Copy: 5411.690 MiB/s
2   Method: MEMCPY  Elapsed: 0.18944    MiB: 1024.00000 Copy: 5405.491 MiB/s
AVG Method: MEMCPY  Elapsed: 0.18942    MiB: 1024.00000 Copy: 5406.024 MiB/s
0   Method: DUMB    Elapsed: 0.14838    MiB: 1024.00000 Copy: 6901.200 MiB/s
1   Method: DUMB    Elapsed: 0.14818    MiB: 1024.00000 Copy: 6910.561 MiB/s
2   Method: DUMB    Elapsed: 0.14820    MiB: 1024.00000 Copy: 6909.628 MiB/s
AVG Method: DUMB    Elapsed: 0.14825    MiB: 1024.00000 Copy: 6907.127 MiB/s
0   Method: MCBLOCK Elapsed: 0.04362    MiB: 1024.00000 Copy: 23477.623 MiB/s
1   Method: MCBLOCK Elapsed: 0.04262    MiB: 1024.00000 Copy: 24025.151 MiB/s
2   Method: MCBLOCK Elapsed: 0.04258    MiB: 1024.00000 Copy: 24048.849 MiB/s
AVG Method: MCBLOCK Elapsed: 0.04294    MiB: 1024.00000 Copy: 23847.599 MiB/s

这里是我基于英特尔的笔记本电脑的参考:

laptop:~$ mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.40566    MiB: 1024.00000 Copy: 2524.269 MiB/s
1   Method: MEMCPY  Elapsed: 0.38458    MiB: 1024.00000 Copy: 2662.638 MiB/s
2   Method: MEMCPY  Elapsed: 0.38876    MiB: 1024.00000 Copy: 2634.043 MiB/s
AVG Method: MEMCPY  Elapsed: 0.39300    MiB: 1024.00000 Copy: 2605.600 MiB/s
0   Method: DUMB    Elapsed: 0.30707    MiB: 1024.00000 Copy: 3334.745 MiB/s
1   Method: DUMB    Elapsed: 0.30425    MiB: 1024.00000 Copy: 3365.653 MiB/s
2   Method: DUMB    Elapsed: 0.30342    MiB: 1024.00000 Copy: 3374.849 MiB/s
AVG Method: DUMB    Elapsed: 0.30491    MiB: 1024.00000 Copy: 3358.328 MiB/s
0   Method: MCBLOCK Elapsed: 0.07875    MiB: 1024.00000 Copy: 13003.670 MiB/s
1   Method: MCBLOCK Elapsed: 0.08374    MiB: 1024.00000 Copy: 12228.034 MiB/s
2   Method: MCBLOCK Elapsed: 0.07635    MiB: 1024.00000 Copy: 13411.216 MiB/s
AVG Method: MCBLOCK Elapsed: 0.07961    MiB: 1024.00000 Copy: 12862.006 MiB/s

所以根据 mbw  我的笔记本电脑比服务器快3倍! 请帮我解释一下。我也尝试安装ram磁盘并使用dd对它进行基准测试,我得到类似的差异所以我不认为 mbw 是责备。

我检查了BIOS设置,内存似乎全速运行。根据托管公司的说法,模块都可以。

这可能与NUMA有关吗?似乎在此服务器上禁用了Node Interleaving。启用它(从而关闭NUMA)会有所作为吗?

foo1:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8190 MB
node 0 free: 7898 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 12288 MB
node 1 free: 12073 MB
node 2 cpus: 18 19 20 21 22 23
node 2 size: 12288 MB
node 2 free: 12034 MB
node 3 cpus: 12 13 14 15 16 17
node 3 size: 8192 MB
node 3 free: 8032 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10 

更新:

已禁用NUMA(在linux启动时numa = off)并在BIOS中禁用ECC。没有变化,仍然与上面的数字相同。

更新2:

这是根据内存的布局 dmidecode

PROC 1 DIMM 1
PROC 1 DIMM 4
PROC 1 DIMM 7
PROC 1 DIMM 10
PROC 1 DIMM 12

PROC 2 DIMM 1
PROC 2 DIMM 4
PROC 2 DIMM 7
PROC 2 DIMM 10
PROC 2 DIMM 12

这些都是 4GB三星模块(部件号M393B5270CH0-CH9)

我看了看 HP介绍了如何填充此服务器中的内存 如果我理解正确,目前DIMM 12中的模块应放在DIMM 3插槽中。这样的错误配置可以解释我得到的结果吗?

更新3:

我现在已经移除了2个模块,以便在1-4-7-10中每侧(4-4)获得4x4 GB。 不幸的是,我没有看到基准测试中的任何差异。服务器现在不应该能够使用所有四个频道吗?我也试过了 stream 多线程的基准测试结果非常令人失望。我唯一能想到的就是要求托管公司更换整个服务器......

更新4:

当我测试最后一次设置(32 GB)时,我一定做错了 stream 昨天因为今天我看到了很好的结果:

foo1:~# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 24
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 703 microseconds.
   (= 703 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       36873.0022       0.0009       0.0009       0.0010
Scale:      34699.5160       0.0009       0.0009       0.0010
Add:        30868.8427       0.0016       0.0016       0.0017
Triad:      25558.7904       0.0019       0.0019       0.0020
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

(我已经放弃了 mbw 因为它只在单线程模式下运行。它仍然在此服务器上提供相同的糟糕结果)。

所以问题一定是那两个最后4GB模块迫使服务器在单通道模式下运行,就像下面指出的@chx一样。现在唯一剩下的问题是,是否可以使用40 GB并仍然可以获得全带宽?我可以使用2 x 8GB + 6 x 4GB吗?我放置较大模块的通道是否重要?


7
2017-09-25 15:27




好吧 - 你的笔记本电脑没有运行ECC,所以这可以解释。 ECC是否在英特尔服务器上运行? - pauska
AMD服务器中的RAM模块是ECC模块。我不知道英特尔服务器。 dmidecode没有提供它使用的模块的任何信息。但ECC真的可以解释这个巨大的差异吗?谷歌搜索表明ECC RAM会带来几个百分点的惩罚。我在这里看到的远不止这些! - ntherning
记忆的布局是什么?它是否已注册? - Chris S
@ChrisS:我已经更新了有关内存模块和当前布局的更多信息。 - ntherning
错误的组织会造成严重破坏,但你的组织是正确的。我真的不确定这里发生了什么,但我知道目前的Opterons因各种内存基准测试而击败当前的英特尔处理器,因为Opteron有4个通道,英特尔只有3个。单线程性质可能会发生一些变化的 mbw 软件;虽然 dd 显示出类似的结果......不确定,但不对。 - Chris S


答案:


您通过每个CPU使用5-5个模块而不是4-4或8-8强制系统在单通道(!)模式下运行。这就是原因。尝试删除1 - 1并报告回来。

6164是G34插槽CPU,如果内存模块设置正确,它可以进行四通道操作。您的设置是最糟糕的。


7
2017-09-25 21:37



DIMM人口很好! :) - ewwhite
我租用的托管公司已经将模块的布局改为1-4-7-10-3插槽,这正是惠普在手册中所说的。虽然在我的测试中没有明显的差异。我现在要求他们删除每侧最后4GB的模块。 - ntherning
@chx - 看起来你对单通道模式是正确的。请参阅我的上次更新。 - ntherning
每个频道需要四个相同的模块,故事结束。而且因为它是双CPU,实际上你需要八个才能让它们在四通道模式下运行。所以无论是32GB还是64GB。没有中间立场。 - chx
好吧,你可能是对的。但我的书呆子不能让这个! :-)如果你还没有厌倦我,请帮我解释一下:他们现在已经添加了4GB模块,这些模块在再次获得40GB和5-5之前已被取出。流再次给出不好的结果。但我只是尝试删除numa = off boot选项,并且在重新启动流后让我接近我在上次更新时看到的32GB的优异结果。 - ntherning