SIMD 实验 |
|
重要的是,这里的初始实现是 ANSI C,它不使用任何 SSE 指令或多线程。这怎么可能?Lua VM 中的操作码调度会带来不可忽略的开销。但是,如果我们解释每个操作码(指令)一次,并在多个数据元素上执行该操作码,我们预计可以降低操作码调度的相对开销,即使数据在每个操作码中是串行处理的。
当前的实现是一个实验性的原型,尚未完全实现。最初的目标仅仅是估计可能的性能提升。设计优先考虑实现的简单性。事实上,最简单的方法,需要最少的补丁,似乎是重新定义 TValue 类型以包含一个值数组。这显然会给非 SIMD 操作带来开销,但这对于现在的目标来说是可以接受的。(在 luaconf.h 中重新定义 lua_Number 类型可能是另一种比较简单的方法,但它也是一个全局更改。)更精细的实现可能会保留 TValue 的原样,并在 TValue 序列上定义 SIMD 操作,可能使用新的 SIMD 操作码,从而允许同时进行 SIMD 和非 SIMD 操作。
在这个实现中,每个 Lua 数字包含一定数量的子值(这个数字在全局变量 _SIMD_LEN 中定义,可以在 luaconf.h 中调整)。每个 Lua 数字中的单个子值可以使用“打包”运算符进行索引。从语法上讲,这个运算符在这里看起来像一个函数调用,但从语义上讲,它更像是在寄存器词法上的索引运算符,它有自己的操作码。
-- ex1.lua -- If you don't use the SIMD features, numbers will behave -- as usual. This is because any SIMD unaware operations -- (e.g. print) only use the first sub-value of a number. local x = 10 print(x + 50) --> 60 -- Individual subvalues can be accessed with the packed operator: assert(packed(x,1) == 10) -- success packed(x,2) = 20 assert(packed(x,1) == 10) assert(packed(x,2) == 20) local y; packed(y,1) = 100; packed(y,2) = 200; local z = x + y -- this is a SIMD instruction assert(packed(z,1) == 110) assert(packed(z,2) == 220) print(z) --> 110 (print is not currently SIMD-aware) print(_SIMD_LEN) --> prints some number (e.g. 2)
以下是这些操作码的样子。注意新的 GETPACKED 和 SETPACKED 指令。所有其他操作码保持不变。
$ ./src/luac -p -l ex1.lua
main <ex.lua:0,0> (51 instructions, 204 bytes at 0x682930)
0+ params, 5 slots, 0 upvalues, 3 locals, 12 constants, 0 functions
1 [4] LOADK 0 -1 ; 10
2 [5] GETGLOBAL 1 -2 ; print
3 [5] ADD 2 0 -3 ; - 50
4 [5] CALL 1 2 1
5 [8] GETGLOBAL 1 -4 ; assert
6 [8] GETPACKED 2 0 -5 ; 1
7 [8] EQ 1 2 -1 ; - 10
8 [8] JMP 1 ; to 10
9 [8] LOADBOOL 2 0 1
10 [8] LOADBOOL 2 1 0
11 [8] CALL 1 2 1
12 [9] SETPACKED 0 -6 -7 ; 2 20
13 [10] GETGLOBAL 1 -4 ; assert
14 [10] GETPACKED 2 0 -5 ; 1
15 [10] EQ 1 2 -1 ; - 10
16 [10] JMP 1 ; to 18
17 [10] LOADBOOL 2 0 1
18 [10] LOADBOOL 2 1 0
19 [10] CALL 1 2 1
20 [11] GETGLOBAL 1 -4 ; assert
21 [11] GETPACKED 2 0 -6 ; 2
22 [11] EQ 1 2 -7 ; - 20
23 [11] JMP 1 ; to 25
24 [11] LOADBOOL 2 0 1
25 [11] LOADBOOL 2 1 0
26 [11] CALL 1 2 1
27 [13] LOADNIL 1 1
28 [13] SETPACKED 1 -5 -8 ; 1 100
29 [13] SETPACKED 1 -6 -9 ; 2 200
30 [14] ADD 2 0 1
31 [15] GETGLOBAL 3 -4 ; assert
32 [15] GETPACKED 4 2 -5 ; 1
33 [15] EQ 1 4 -10 ; - 110
34 [15] JMP 1 ; to 36
35 [15] LOADBOOL 4 0 1
36 [15] LOADBOOL 4 1 0
37 [15] CALL 3 2 1
38 [16] GETGLOBAL 3 -4 ; assert
39 [16] GETPACKED 4 2 -6 ; 2
40 [16] EQ 1 4 -11 ; - 220
41 [16] JMP 1 ; to 43
42 [16] LOADBOOL 4 0 1
43 [16] LOADBOOL 4 1 0
44 [16] CALL 3 2 1
45 [17] GETGLOBAL 3 -2 ; print
46 [17] MOVE 4 2
47 [17] CALL 3 2 1
48 [19] GETGLOBAL 3 -2 ; print
49 [19] GETGLOBAL 4 -12 ; _SIMD_LEN
50 [19] CALL 3 2 1
51 [19] RETURN 0 1
现在,让我们做一些基准测试:将 1 到 2^28 的整数相加。在标准 Lua 中,它看起来像这样
-- test1-standard.lua local sum = 0 for i=1,2^28 do sum = sum + i end print(sum)
使用这个 SIMD 补丁,我们可以这样重写它
-- test1-simd.lua local N=_SIMD_LEN local j; for k=1,N do packed(j,k) = k end local psum; for k=1,N do packed(psum,k)= 0 end local fi; for k=1,N do packed(fi,k) = k end local fs; for k=1,N do packed(fs,k) = N end for i=fi,2^28,fs do psum = psum + i end local sum = 0 for i=1,N do -- print('partial sum', i, packed(psum,i)) sum = sum + packed(psum,i) end print('sum:', sum)
SIMD 版本计算 N 个部分和(每个部分并行计算),然后将部分和相加。
以下是基准测试运行时间作为编译后的 _SIMD_LEN(N)值的函数。
Lua version run-time ---------------- -------- unpatched 8.095s SIMD patch N=1 9.265s SIMD patch N=2 5.660s SIMD patch N=3 3.566s SIMD patch N=4 3.248s SIMD patch N=8 2.688s SIMD patch N=32 2.406s SIMD patch N=256 2.529s SIMD patch N=1024 2.894s
使用 SIMD 补丁,N=1 有效地否定了补丁,这对于与未打补丁的版本进行比较很有用。当 N 设置为 32 时,运行时间减少了 70%,至少在这个非常简单的例子中是这样。
针对 Lua 5.1.4 的补丁在下面的附录 A 中。
对这个补丁的一个简单扩展是使用真正的 SSE 操作 [2](希望这将进一步提高性能)。
diff -ur lua-5.1.4/src/lbaselib.c lua-5.1.4-simd/src/lbaselib.c
--- lua-5.1.4/src/lbaselib.c 2008-02-14 11:46:22.000000000 -0500
+++ lua-5.1.4-simd/src/lbaselib.c 2009-01-01 22:43:22.671875000 -0500
@@ -642,6 +642,11 @@
lua_setfield(L, -2, "__mode"); /* metatable(w).__mode = "kv" */
lua_pushcclosure(L, luaB_newproxy, 1);
lua_setglobal(L, "newproxy"); /* set global `newproxy' */
+
+ /* SIMD-PATCH */
+ lua_pushnumber(L, LUA_SIMD_LEN);
+ lua_setglobal(L, "_SIMD_LEN");
+
}
diff -ur lua-5.1.4/src/lcode.c lua-5.1.4-simd/src/lcode.c
--- lua-5.1.4/src/lcode.c 2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lcode.c 2009-01-01 22:20:37.890625000 -0500
@@ -317,6 +317,16 @@
e->k = VRELOCABLE;
break;
}
+
+ /* SIMD-PATCH */
+ case VPACKED: {
+ freereg(fs, e->u.s.aux);
+ freereg(fs, e->u.s.info);
+ e->u.s.info = luaK_codeABC(fs, OP_GETPACKED, 0, e->u.s.info, e->u.s.aux);
+ e->k = VRELOCABLE;
+ break;
+ }
+
case VINDEXED: {
freereg(fs, e->u.s.aux);
freereg(fs, e->u.s.info);
@@ -486,6 +496,14 @@
luaK_codeABx(fs, OP_SETGLOBAL, e, var->u.s.info);
break;
}
+
+ /* SIMD-PATCH */
+ case VPACKED: {
+ int e = luaK_exp2RK(fs, ex);
+ luaK_codeABC(fs, OP_SETPACKED, var->u.s.info, var->u.s.aux, e);
+ break;
+ }
+
case VINDEXED: {
int e = luaK_exp2RK(fs, ex);
luaK_codeABC(fs, OP_SETTABLE, var->u.s.info, var->u.s.aux, e);
diff -ur lua-5.1.4/src/llex.c lua-5.1.4-simd/src/llex.c
--- lua-5.1.4/src/llex.c 2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/llex.c 2009-01-01 22:20:17.531250000 -0500
@@ -37,7 +37,11 @@
const char *const luaX_tokens [] = {
"and", "break", "do", "else", "elseif",
"end", "false", "for", "function", "if",
- "in", "local", "nil", "not", "or", "repeat",
+ "in", "local", "nil", "not", "or",
+
+ "packed", /* SIMD-PATCH */
+
+ "repeat",
"return", "then", "true", "until", "while",
"..", "...", "==", ">=", "<=", "~=",
"<number>", "<name>", "<string>", "<eof>",
diff -ur lua-5.1.4/src/llex.h lua-5.1.4-simd/src/llex.h
--- lua-5.1.4/src/llex.h 2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/llex.h 2009-01-01 22:21:41.859375000 -0500
@@ -25,7 +25,11 @@
/* terminal symbols denoted by reserved words */
TK_AND = FIRST_RESERVED, TK_BREAK,
TK_DO, TK_ELSE, TK_ELSEIF, TK_END, TK_FALSE, TK_FOR, TK_FUNCTION,
- TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR, TK_REPEAT,
+ TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR,
+
+ TK_PACKED, /* SIMD-PATCH */
+
+ TK_REPEAT,
TK_RETURN, TK_THEN, TK_TRUE, TK_UNTIL, TK_WHILE,
/* other terminal symbols */
TK_CONCAT, TK_DOTS, TK_EQ, TK_GE, TK_LE, TK_NE, TK_NUMBER,
diff -ur lua-5.1.4/src/lobject.h lua-5.1.4-simd/src/lobject.h
--- lua-5.1.4/src/lobject.h 2008-08-06 09:29:48.000000000 -0400
+++ lua-5.1.4-simd/src/lobject.h 2009-01-01 22:42:06.609375000 -0500
@@ -70,8 +70,16 @@
#define TValuefields Value value; int tt
+/* SIMD-PATCH */
+typedef struct lua_TValueItem {
+ TValuefields;
+} TValueItem;
+
typedef struct lua_TValue {
TValuefields;
+
+ TValueItem more[LUA_SIMD_LEN - 1]; /* SIMD-PATCH */
+
} TValue;
@@ -102,6 +110,9 @@
#define l_isfalse(o) (ttisnil(o) || (ttisboolean(o) && bvalue(o) == 0))
+/* SIMD-PATCH */
+#define packeditem(o,i) ( (TValue*)((TValueItem*)(o) + (i)) )
+
/*
** for internal debug only
*/
diff -ur lua-5.1.4/src/lopcodes.c lua-5.1.4-simd/src/lopcodes.c
--- lua-5.1.4/src/lopcodes.c 2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lopcodes.c 2009-01-01 22:19:43.640625000 -0500
@@ -52,6 +52,11 @@
"CLOSE",
"CLOSURE",
"VARARG",
+
+ /* SIMD-PATCH */
+ "GETPACKED",
+ "SETPACKED",
+
NULL
};
@@ -98,5 +103,10 @@
,opmode(0, 0, OpArgN, OpArgN, iABC) /* OP_CLOSE */
,opmode(0, 1, OpArgU, OpArgN, iABx) /* OP_CLOSURE */
,opmode(0, 1, OpArgU, OpArgN, iABC) /* OP_VARARG */
+
+ /* SIMD-PATCH */
+ ,opmode(0, 1, OpArgR, OpArgK, iABC) /* OP_GETPACKED */
+ ,opmode(0, 0, OpArgK, OpArgK, iABC) /* OP_SETPACKED */
+
};
diff -ur lua-5.1.4/src/lopcodes.h lua-5.1.4-simd/src/lopcodes.h
--- lua-5.1.4/src/lopcodes.h 2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lopcodes.h 2009-01-01 22:23:40.531250000 -0500
@@ -205,10 +205,18 @@
OP_CLOSURE,/* A Bx R(A) := closure(KPROTO[Bx], R(A), ... ,R(A+n)) */
OP_VARARG/* A B R(A), R(A+1), ..., R(A+B-1) = vararg */
-} OpCode;
+
+/* SIMD-PATCH */
+,OP_GETPACKED/* A B C R(A) := packed(R(B), RK(C)) */
+,OP_SETPACKED/* A B C packed(R(A), RK(B)) := RK(C) */
+} OpCode;
+
+#if 0 /* SIMD-PATCH */
#define NUM_OPCODES (cast(int, OP_VARARG) + 1)
+#endif
+#define NUM_OPCODES (cast(int, OP_SETPACKED) + 1)
diff -ur lua-5.1.4/src/lparser.c lua-5.1.4-simd/src/lparser.c
--- lua-5.1.4/src/lparser.c 2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lparser.c 2009-01-01 22:20:07.343750000 -0500
@@ -691,6 +691,22 @@
/* primaryexp ->
prefixexp { `.' NAME | `[' exp `]' | `:' NAME funcargs | funcargs } */
FuncState *fs = ls->fs;
+
+ /* SIMD-PATCH */
+ if (ls->t.token == TK_PACKED) {
+ expdesc t;
+ expdesc k;
+ luaX_next(ls);
+ checknext(ls, '(');
+ expr(ls, &t);
+ checknext(ls, ',');
+ expr(ls, &k);
+ checknext(ls, ')');
+ init_exp(v, VPACKED, luaK_exp2anyreg(fs, &t));
+ v->u.s.aux = luaK_exp2RK(fs, &k);
+ return;
+ }
+
prefixexp(ls, v);
for (;;) {
switch (ls->t.token) {
diff -ur lua-5.1.4/src/lparser.h lua-5.1.4-simd/src/lparser.h
--- lua-5.1.4/src/lparser.h 2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lparser.h 2009-01-01 22:48:25.640625000 -0500
@@ -26,12 +26,17 @@
VLOCAL, /* info = local register */
VUPVAL, /* info = index of upvalue in `upvalues' */
VGLOBAL, /* info = index of table; aux = index of global name in `k' */
+
+ /* SIMD-PATCH */
+ VPACKED, /* info = local register; aux = index register (or `k') */
+
VINDEXED, /* info = table register; aux = index register (or `k') */
VJMP, /* info = instruction pc */
VRELOCABLE, /* info = instruction pc */
VNONRELOC, /* info = result register */
VCALL, /* info = instruction pc */
VVARARG /* info = instruction pc */
+
} expkind;
typedef struct expdesc {
diff -ur lua-5.1.4/src/luaconf.h lua-5.1.4-simd/src/luaconf.h
--- lua-5.1.4/src/luaconf.h 2008-02-11 11:25:08.000000000 -0500
+++ lua-5.1.4-simd/src/luaconf.h 2009-01-01 23:42:57.625000000 -0500
@@ -510,6 +510,12 @@
*/
#define LUAI_UACNUMBER double
+/* SIMD-PATCH */
+/*
+@@ LUA_SIMD_LEN is the number of values computed in parallel.
+** This is readable via the global _SIMD_LEN.
+*/
+#define LUA_SIMD_LEN 8
/*
@@ LUA_NUMBER_SCAN is the format for reading numbers.
diff -ur lua-5.1.4/src/lvm.c lua-5.1.4-simd/src/lvm.c
--- lua-5.1.4/src/lvm.c 2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lvm.c 2009-01-01 22:43:45.437500000 -0500
@@ -357,6 +357,7 @@
#define Protect(x) { L->savedpc = pc; {x;}; base = L->base; }
+#if 0 /* SIMDPATCH */
#define arith_op(op,tm) { \
TValue *rb = RKB(i); \
TValue *rc = RKC(i); \
@@ -367,6 +368,23 @@
else \
Protect(Arith(L, ra, rb, rc, tm)); \
}
+#endif
+#define arith_op(op,tm) { \
+ TValue *rb = RKB(i); \
+ TValue *rc = RKC(i); \
+ if (ttisnumber(rb) && ttisnumber(rc)) { \
+ int j; \
+ for (j=0; j < LUA_SIMD_LEN; j++) { \
+ TValue * rai = packeditem(ra,j); \
+ lua_Number nb = nvalue(rb), nc = nvalue(rc); \
+ setnvalue(rai, op(nb, nc)); \
+ (*(TValueItem**)(void*)&rb) ++; \
+ (*(TValueItem**)(void*)&rc) ++; \
+ } \
+ } \
+ else \
+ Protect(Arith(L, ra, rb, rc, tm)); \
+ }
@@ -401,7 +419,15 @@
lua_assert(L->top == L->ci->top || luaG_checkopenop(i));
switch (GET_OPCODE(i)) {
case OP_MOVE: {
+
+ #if 0 /* SIMD-PATCH */
setobjs2s(L, ra, RB(i));
+ #endif
+ int j;
+ for (j=0; j < LUA_SIMD_LEN; j++) {
+ setobjs2s(L, packeditem(ra,j), packeditem(RB(i),j));
+ }
+
continue;
}
case OP_LOADK: {
@@ -654,8 +680,21 @@
if (luai_numlt(0, step) ? luai_numle(idx, limit)
: luai_numle(limit, idx)) {
dojump(L, pc, GETARG_sBx(i)); /* jump back */
+
+ #if 0 /* SIMD-PATCH */
setnvalue(ra, idx); /* update internal index... */
setnvalue(ra+3, idx); /* ...and external index */
+ #endif
+ { int j;
+ for (j=0; j<LUA_SIMD_LEN; j++) {
+ TValue* ra_ = packeditem(ra, j);
+ TValue* step_ = packeditem(ra+2, j);
+ lua_Number idx_ = luai_numadd(nvalue(ra_), nvalue(step_));
+ setnvalue(ra_, idx_);
+ setnvalue(ra_+3, idx_);
+ }
+ }
+
}
continue;
}
@@ -670,7 +709,19 @@
luaG_runerror(L, LUA_QL("for") " limit must be a number");
else if (!tonumber(pstep, ra+2))
luaG_runerror(L, LUA_QL("for") " step must be a number");
+
+ #if 0 /* SIMD-PATCH */
setnvalue(ra, luai_numsub(nvalue(ra), nvalue(pstep)));
+ #endif
+ { int j;
+ for (j=0; j < LUA_SIMD_LEN; j++) {
+ setnvalue(packeditem(ra,j),
+ luai_numsub(nvalue(packeditem(ra,j)),
+ nvalue(packeditem(pstep,j))) );
+ }
+ }
+
+
dojump(L, pc, GETARG_sBx(i));
continue;
}
@@ -757,6 +808,26 @@
}
continue;
}
+
+ /* SIMD-PATCH */
+ case OP_GETPACKED: {
+ size_t off = (size_t)nvalue(RKC(i));
+ if (off < 1 || off > LUA_SIMD_LEN) {
+ luaG_runerror(L, "packed offset out of range");
+ }
+ setobjs2s(L, ra, (TValue*)((TValueItem*)RB(i) + off - 1) );
+ continue;
+ }
+ case OP_SETPACKED: {
+ size_t off = (size_t)nvalue(RKB(i));
+ if (off < 1 || off > LUA_SIMD_LEN) {
+ luaG_runerror(L, "packed offset out of range");
+ }
+ setobjs2s(L, (TValue*)((TValueItem*)ra + off - 1), RKC(i) );
+ continue;
+ }
+
+
}
}
}
diff -ur lua-5.1.4/src/print.c lua-5.1.4-simd/src/print.c
--- lua-5.1.4/src/print.c 2007-03-25 20:17:38.000000000 -0400
+++ lua-5.1.4-simd/src/print.c 2009-01-01 22:19:20.562500000 -0500
@@ -116,6 +116,9 @@
printf("\t; %s",svalue(&f->k[bx]));
break;
case OP_GETTABLE:
+
+ case OP_GETPACKED: /* SIMD-PATCH */
+
case OP_SELF:
if (ISK(c)) { printf("\t; "); PrintConstant(f,INDEXK(c)); }
break;
@@ -128,6 +131,9 @@
case OP_EQ:
case OP_LT:
case OP_LE:
+
+ case OP_SETPACKED: /* SIMD-PATCH */
+
if (ISK(b) || ISK(c))
{
printf("\t; ");