SIMD 实验 |
|
重要的是,这里的初始实现是 ANSI C,它不使用任何 SSE 指令或多线程。这怎么可能?Lua VM 中的操作码调度会带来不可忽略的开销。但是,如果我们解释每个操作码(指令)一次,并在多个数据元素上执行该操作码,我们预计可以降低操作码调度的相对开销,即使数据在每个操作码中是串行处理的。
当前的实现是一个实验性的原型,尚未完全实现。最初的目标仅仅是估计可能的性能提升。设计优先考虑实现的简单性。事实上,最简单的方法,需要最少的补丁,似乎是重新定义 TValue 类型以包含一个值数组。这显然会给非 SIMD 操作带来开销,但这对于现在的目标来说是可以接受的。(在 luaconf.h 中重新定义 lua_Number 类型可能是另一种比较简单的方法,但它也是一个全局更改。)更精细的实现可能会保留 TValue 的原样,并在 TValue 序列上定义 SIMD 操作,可能使用新的 SIMD 操作码,从而允许同时进行 SIMD 和非 SIMD 操作。
在这个实现中,每个 Lua 数字包含一定数量的子值(这个数字在全局变量 _SIMD_LEN 中定义,可以在 luaconf.h 中调整)。每个 Lua 数字中的单个子值可以使用“打包”运算符进行索引。从语法上讲,这个运算符在这里看起来像一个函数调用,但从语义上讲,它更像是在寄存器词法上的索引运算符,它有自己的操作码。
-- ex1.lua -- If you don't use the SIMD features, numbers will behave -- as usual. This is because any SIMD unaware operations -- (e.g. print) only use the first sub-value of a number. local x = 10 print(x + 50) --> 60 -- Individual subvalues can be accessed with the packed operator: assert(packed(x,1) == 10) -- success packed(x,2) = 20 assert(packed(x,1) == 10) assert(packed(x,2) == 20) local y; packed(y,1) = 100; packed(y,2) = 200; local z = x + y -- this is a SIMD instruction assert(packed(z,1) == 110) assert(packed(z,2) == 220) print(z) --> 110 (print is not currently SIMD-aware) print(_SIMD_LEN) --> prints some number (e.g. 2)
以下是这些操作码的样子。注意新的 GETPACKED 和 SETPACKED 指令。所有其他操作码保持不变。
$ ./src/luac -p -l ex1.lua main <ex.lua:0,0> (51 instructions, 204 bytes at 0x682930) 0+ params, 5 slots, 0 upvalues, 3 locals, 12 constants, 0 functions 1 [4] LOADK 0 -1 ; 10 2 [5] GETGLOBAL 1 -2 ; print 3 [5] ADD 2 0 -3 ; - 50 4 [5] CALL 1 2 1 5 [8] GETGLOBAL 1 -4 ; assert 6 [8] GETPACKED 2 0 -5 ; 1 7 [8] EQ 1 2 -1 ; - 10 8 [8] JMP 1 ; to 10 9 [8] LOADBOOL 2 0 1 10 [8] LOADBOOL 2 1 0 11 [8] CALL 1 2 1 12 [9] SETPACKED 0 -6 -7 ; 2 20 13 [10] GETGLOBAL 1 -4 ; assert 14 [10] GETPACKED 2 0 -5 ; 1 15 [10] EQ 1 2 -1 ; - 10 16 [10] JMP 1 ; to 18 17 [10] LOADBOOL 2 0 1 18 [10] LOADBOOL 2 1 0 19 [10] CALL 1 2 1 20 [11] GETGLOBAL 1 -4 ; assert 21 [11] GETPACKED 2 0 -6 ; 2 22 [11] EQ 1 2 -7 ; - 20 23 [11] JMP 1 ; to 25 24 [11] LOADBOOL 2 0 1 25 [11] LOADBOOL 2 1 0 26 [11] CALL 1 2 1 27 [13] LOADNIL 1 1 28 [13] SETPACKED 1 -5 -8 ; 1 100 29 [13] SETPACKED 1 -6 -9 ; 2 200 30 [14] ADD 2 0 1 31 [15] GETGLOBAL 3 -4 ; assert 32 [15] GETPACKED 4 2 -5 ; 1 33 [15] EQ 1 4 -10 ; - 110 34 [15] JMP 1 ; to 36 35 [15] LOADBOOL 4 0 1 36 [15] LOADBOOL 4 1 0 37 [15] CALL 3 2 1 38 [16] GETGLOBAL 3 -4 ; assert 39 [16] GETPACKED 4 2 -6 ; 2 40 [16] EQ 1 4 -11 ; - 220 41 [16] JMP 1 ; to 43 42 [16] LOADBOOL 4 0 1 43 [16] LOADBOOL 4 1 0 44 [16] CALL 3 2 1 45 [17] GETGLOBAL 3 -2 ; print 46 [17] MOVE 4 2 47 [17] CALL 3 2 1 48 [19] GETGLOBAL 3 -2 ; print 49 [19] GETGLOBAL 4 -12 ; _SIMD_LEN 50 [19] CALL 3 2 1 51 [19] RETURN 0 1
现在,让我们做一些基准测试:将 1 到 2^28 的整数相加。在标准 Lua 中,它看起来像这样
-- test1-standard.lua local sum = 0 for i=1,2^28 do sum = sum + i end print(sum)
使用这个 SIMD 补丁,我们可以这样重写它
-- test1-simd.lua local N=_SIMD_LEN local j; for k=1,N do packed(j,k) = k end local psum; for k=1,N do packed(psum,k)= 0 end local fi; for k=1,N do packed(fi,k) = k end local fs; for k=1,N do packed(fs,k) = N end for i=fi,2^28,fs do psum = psum + i end local sum = 0 for i=1,N do -- print('partial sum', i, packed(psum,i)) sum = sum + packed(psum,i) end print('sum:', sum)
SIMD 版本计算 N 个部分和(每个部分并行计算),然后将部分和相加。
以下是基准测试运行时间作为编译后的 _SIMD_LEN(N)值的函数。
Lua version run-time ---------------- -------- unpatched 8.095s SIMD patch N=1 9.265s SIMD patch N=2 5.660s SIMD patch N=3 3.566s SIMD patch N=4 3.248s SIMD patch N=8 2.688s SIMD patch N=32 2.406s SIMD patch N=256 2.529s SIMD patch N=1024 2.894s
使用 SIMD 补丁,N=1 有效地否定了补丁,这对于与未打补丁的版本进行比较很有用。当 N 设置为 32 时,运行时间减少了 70%,至少在这个非常简单的例子中是这样。
针对 Lua 5.1.4 的补丁在下面的附录 A 中。
对这个补丁的一个简单扩展是使用真正的 SSE 操作 [2](希望这将进一步提高性能)。
diff -ur lua-5.1.4/src/lbaselib.c lua-5.1.4-simd/src/lbaselib.c --- lua-5.1.4/src/lbaselib.c 2008-02-14 11:46:22.000000000 -0500 +++ lua-5.1.4-simd/src/lbaselib.c 2009-01-01 22:43:22.671875000 -0500 @@ -642,6 +642,11 @@ lua_setfield(L, -2, "__mode"); /* metatable(w).__mode = "kv" */ lua_pushcclosure(L, luaB_newproxy, 1); lua_setglobal(L, "newproxy"); /* set global `newproxy' */ + + /* SIMD-PATCH */ + lua_pushnumber(L, LUA_SIMD_LEN); + lua_setglobal(L, "_SIMD_LEN"); + } diff -ur lua-5.1.4/src/lcode.c lua-5.1.4-simd/src/lcode.c --- lua-5.1.4/src/lcode.c 2007-12-28 10:32:23.000000000 -0500 +++ lua-5.1.4-simd/src/lcode.c 2009-01-01 22:20:37.890625000 -0500 @@ -317,6 +317,16 @@ e->k = VRELOCABLE; break; } + + /* SIMD-PATCH */ + case VPACKED: { + freereg(fs, e->u.s.aux); + freereg(fs, e->u.s.info); + e->u.s.info = luaK_codeABC(fs, OP_GETPACKED, 0, e->u.s.info, e->u.s.aux); + e->k = VRELOCABLE; + break; + } + case VINDEXED: { freereg(fs, e->u.s.aux); freereg(fs, e->u.s.info); @@ -486,6 +496,14 @@ luaK_codeABx(fs, OP_SETGLOBAL, e, var->u.s.info); break; } + + /* SIMD-PATCH */ + case VPACKED: { + int e = luaK_exp2RK(fs, ex); + luaK_codeABC(fs, OP_SETPACKED, var->u.s.info, var->u.s.aux, e); + break; + } + case VINDEXED: { int e = luaK_exp2RK(fs, ex); luaK_codeABC(fs, OP_SETTABLE, var->u.s.info, var->u.s.aux, e); diff -ur lua-5.1.4/src/llex.c lua-5.1.4-simd/src/llex.c --- lua-5.1.4/src/llex.c 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/llex.c 2009-01-01 22:20:17.531250000 -0500 @@ -37,7 +37,11 @@ const char *const luaX_tokens [] = { "and", "break", "do", "else", "elseif", "end", "false", "for", "function", "if", - "in", "local", "nil", "not", "or", "repeat", + "in", "local", "nil", "not", "or", + + "packed", /* SIMD-PATCH */ + + "repeat", "return", "then", "true", "until", "while", "..", "...", "==", ">=", "<=", "~=", "<number>", "<name>", "<string>", "<eof>", diff -ur lua-5.1.4/src/llex.h lua-5.1.4-simd/src/llex.h --- lua-5.1.4/src/llex.h 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/llex.h 2009-01-01 22:21:41.859375000 -0500 @@ -25,7 +25,11 @@ /* terminal symbols denoted by reserved words */ TK_AND = FIRST_RESERVED, TK_BREAK, TK_DO, TK_ELSE, TK_ELSEIF, TK_END, TK_FALSE, TK_FOR, TK_FUNCTION, - TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR, TK_REPEAT, + TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR, + + TK_PACKED, /* SIMD-PATCH */ + + TK_REPEAT, TK_RETURN, TK_THEN, TK_TRUE, TK_UNTIL, TK_WHILE, /* other terminal symbols */ TK_CONCAT, TK_DOTS, TK_EQ, TK_GE, TK_LE, TK_NE, TK_NUMBER, diff -ur lua-5.1.4/src/lobject.h lua-5.1.4-simd/src/lobject.h --- lua-5.1.4/src/lobject.h 2008-08-06 09:29:48.000000000 -0400 +++ lua-5.1.4-simd/src/lobject.h 2009-01-01 22:42:06.609375000 -0500 @@ -70,8 +70,16 @@ #define TValuefields Value value; int tt +/* SIMD-PATCH */ +typedef struct lua_TValueItem { + TValuefields; +} TValueItem; + typedef struct lua_TValue { TValuefields; + + TValueItem more[LUA_SIMD_LEN - 1]; /* SIMD-PATCH */ + } TValue; @@ -102,6 +110,9 @@ #define l_isfalse(o) (ttisnil(o) || (ttisboolean(o) && bvalue(o) == 0)) +/* SIMD-PATCH */ +#define packeditem(o,i) ( (TValue*)((TValueItem*)(o) + (i)) ) + /* ** for internal debug only */ diff -ur lua-5.1.4/src/lopcodes.c lua-5.1.4-simd/src/lopcodes.c --- lua-5.1.4/src/lopcodes.c 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/lopcodes.c 2009-01-01 22:19:43.640625000 -0500 @@ -52,6 +52,11 @@ "CLOSE", "CLOSURE", "VARARG", + + /* SIMD-PATCH */ + "GETPACKED", + "SETPACKED", + NULL }; @@ -98,5 +103,10 @@ ,opmode(0, 0, OpArgN, OpArgN, iABC) /* OP_CLOSE */ ,opmode(0, 1, OpArgU, OpArgN, iABx) /* OP_CLOSURE */ ,opmode(0, 1, OpArgU, OpArgN, iABC) /* OP_VARARG */ + + /* SIMD-PATCH */ + ,opmode(0, 1, OpArgR, OpArgK, iABC) /* OP_GETPACKED */ + ,opmode(0, 0, OpArgK, OpArgK, iABC) /* OP_SETPACKED */ + }; diff -ur lua-5.1.4/src/lopcodes.h lua-5.1.4-simd/src/lopcodes.h --- lua-5.1.4/src/lopcodes.h 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/lopcodes.h 2009-01-01 22:23:40.531250000 -0500 @@ -205,10 +205,18 @@ OP_CLOSURE,/* A Bx R(A) := closure(KPROTO[Bx], R(A), ... ,R(A+n)) */ OP_VARARG/* A B R(A), R(A+1), ..., R(A+B-1) = vararg */ -} OpCode; + +/* SIMD-PATCH */ +,OP_GETPACKED/* A B C R(A) := packed(R(B), RK(C)) */ +,OP_SETPACKED/* A B C packed(R(A), RK(B)) := RK(C) */ +} OpCode; + +#if 0 /* SIMD-PATCH */ #define NUM_OPCODES (cast(int, OP_VARARG) + 1) +#endif +#define NUM_OPCODES (cast(int, OP_SETPACKED) + 1) diff -ur lua-5.1.4/src/lparser.c lua-5.1.4-simd/src/lparser.c --- lua-5.1.4/src/lparser.c 2007-12-28 10:32:23.000000000 -0500 +++ lua-5.1.4-simd/src/lparser.c 2009-01-01 22:20:07.343750000 -0500 @@ -691,6 +691,22 @@ /* primaryexp -> prefixexp { `.' NAME | `[' exp `]' | `:' NAME funcargs | funcargs } */ FuncState *fs = ls->fs; + + /* SIMD-PATCH */ + if (ls->t.token == TK_PACKED) { + expdesc t; + expdesc k; + luaX_next(ls); + checknext(ls, '('); + expr(ls, &t); + checknext(ls, ','); + expr(ls, &k); + checknext(ls, ')'); + init_exp(v, VPACKED, luaK_exp2anyreg(fs, &t)); + v->u.s.aux = luaK_exp2RK(fs, &k); + return; + } + prefixexp(ls, v); for (;;) { switch (ls->t.token) { diff -ur lua-5.1.4/src/lparser.h lua-5.1.4-simd/src/lparser.h --- lua-5.1.4/src/lparser.h 2007-12-27 08:02:25.000000000 -0500 +++ lua-5.1.4-simd/src/lparser.h 2009-01-01 22:48:25.640625000 -0500 @@ -26,12 +26,17 @@ VLOCAL, /* info = local register */ VUPVAL, /* info = index of upvalue in `upvalues' */ VGLOBAL, /* info = index of table; aux = index of global name in `k' */ + + /* SIMD-PATCH */ + VPACKED, /* info = local register; aux = index register (or `k') */ + VINDEXED, /* info = table register; aux = index register (or `k') */ VJMP, /* info = instruction pc */ VRELOCABLE, /* info = instruction pc */ VNONRELOC, /* info = result register */ VCALL, /* info = instruction pc */ VVARARG /* info = instruction pc */ + } expkind; typedef struct expdesc { diff -ur lua-5.1.4/src/luaconf.h lua-5.1.4-simd/src/luaconf.h --- lua-5.1.4/src/luaconf.h 2008-02-11 11:25:08.000000000 -0500 +++ lua-5.1.4-simd/src/luaconf.h 2009-01-01 23:42:57.625000000 -0500 @@ -510,6 +510,12 @@ */ #define LUAI_UACNUMBER double +/* SIMD-PATCH */ +/* +@@ LUA_SIMD_LEN is the number of values computed in parallel. +** This is readable via the global _SIMD_LEN. +*/ +#define LUA_SIMD_LEN 8 /* @@ LUA_NUMBER_SCAN is the format for reading numbers. diff -ur lua-5.1.4/src/lvm.c lua-5.1.4-simd/src/lvm.c --- lua-5.1.4/src/lvm.c 2007-12-28 10:32:23.000000000 -0500 +++ lua-5.1.4-simd/src/lvm.c 2009-01-01 22:43:45.437500000 -0500 @@ -357,6 +357,7 @@ #define Protect(x) { L->savedpc = pc; {x;}; base = L->base; } +#if 0 /* SIMDPATCH */ #define arith_op(op,tm) { \ TValue *rb = RKB(i); \ TValue *rc = RKC(i); \ @@ -367,6 +368,23 @@ else \ Protect(Arith(L, ra, rb, rc, tm)); \ } +#endif +#define arith_op(op,tm) { \ + TValue *rb = RKB(i); \ + TValue *rc = RKC(i); \ + if (ttisnumber(rb) && ttisnumber(rc)) { \ + int j; \ + for (j=0; j < LUA_SIMD_LEN; j++) { \ + TValue * rai = packeditem(ra,j); \ + lua_Number nb = nvalue(rb), nc = nvalue(rc); \ + setnvalue(rai, op(nb, nc)); \ + (*(TValueItem**)(void*)&rb) ++; \ + (*(TValueItem**)(void*)&rc) ++; \ + } \ + } \ + else \ + Protect(Arith(L, ra, rb, rc, tm)); \ + } @@ -401,7 +419,15 @@ lua_assert(L->top == L->ci->top || luaG_checkopenop(i)); switch (GET_OPCODE(i)) { case OP_MOVE: { + + #if 0 /* SIMD-PATCH */ setobjs2s(L, ra, RB(i)); + #endif + int j; + for (j=0; j < LUA_SIMD_LEN; j++) { + setobjs2s(L, packeditem(ra,j), packeditem(RB(i),j)); + } + continue; } case OP_LOADK: { @@ -654,8 +680,21 @@ if (luai_numlt(0, step) ? luai_numle(idx, limit) : luai_numle(limit, idx)) { dojump(L, pc, GETARG_sBx(i)); /* jump back */ + + #if 0 /* SIMD-PATCH */ setnvalue(ra, idx); /* update internal index... */ setnvalue(ra+3, idx); /* ...and external index */ + #endif + { int j; + for (j=0; j<LUA_SIMD_LEN; j++) { + TValue* ra_ = packeditem(ra, j); + TValue* step_ = packeditem(ra+2, j); + lua_Number idx_ = luai_numadd(nvalue(ra_), nvalue(step_)); + setnvalue(ra_, idx_); + setnvalue(ra_+3, idx_); + } + } + } continue; } @@ -670,7 +709,19 @@ luaG_runerror(L, LUA_QL("for") " limit must be a number"); else if (!tonumber(pstep, ra+2)) luaG_runerror(L, LUA_QL("for") " step must be a number"); + + #if 0 /* SIMD-PATCH */ setnvalue(ra, luai_numsub(nvalue(ra), nvalue(pstep))); + #endif + { int j; + for (j=0; j < LUA_SIMD_LEN; j++) { + setnvalue(packeditem(ra,j), + luai_numsub(nvalue(packeditem(ra,j)), + nvalue(packeditem(pstep,j))) ); + } + } + + dojump(L, pc, GETARG_sBx(i)); continue; } @@ -757,6 +808,26 @@ } continue; } + + /* SIMD-PATCH */ + case OP_GETPACKED: { + size_t off = (size_t)nvalue(RKC(i)); + if (off < 1 || off > LUA_SIMD_LEN) { + luaG_runerror(L, "packed offset out of range"); + } + setobjs2s(L, ra, (TValue*)((TValueItem*)RB(i) + off - 1) ); + continue; + } + case OP_SETPACKED: { + size_t off = (size_t)nvalue(RKB(i)); + if (off < 1 || off > LUA_SIMD_LEN) { + luaG_runerror(L, "packed offset out of range"); + } + setobjs2s(L, (TValue*)((TValueItem*)ra + off - 1), RKC(i) ); + continue; + } + + } } } diff -ur lua-5.1.4/src/print.c lua-5.1.4-simd/src/print.c --- lua-5.1.4/src/print.c 2007-03-25 20:17:38.000000000 -0400 +++ lua-5.1.4-simd/src/print.c 2009-01-01 22:19:20.562500000 -0500 @@ -116,6 +116,9 @@ printf("\t; %s",svalue(&f->k[bx])); break; case OP_GETTABLE: + + case OP_GETPACKED: /* SIMD-PATCH */ + case OP_SELF: if (ISK(c)) { printf("\t; "); PrintConstant(f,INDEXK(c)); } break; @@ -128,6 +131,9 @@ case OP_EQ: case OP_LT: case OP_LE: + + case OP_SETPACKED: /* SIMD-PATCH */ + if (ISK(b) || ISK(c)) { printf("\t; ");