SIMD 实验

lua-users home
wiki

以下是对 Lua 的补丁,它在 Lua VM 中提供了一种单指令多数据 (SIMD) [1] 功能的实验性实现。

重要的是,这里的初始实现是 ANSI C,它不使用任何 SSE 指令或多线程。这怎么可能?Lua VM 中的操作码调度会带来不可忽略的开销。但是,如果我们解释每个操作码(指令)一次,并在多个数据元素上执行该操作码,我们预计可以降低操作码调度的相对开销,即使数据在每个操作码中是串行处理的。

当前的实现是一个实验性的原型,尚未完全实现。最初的目标仅仅是估计可能的性能提升。设计优先考虑实现的简单性。事实上,最简单的方法,需要最少的补丁,似乎是重新定义 TValue 类型以包含一个值数组。这显然会给非 SIMD 操作带来开销,但这对于现在的目标来说是可以接受的。(在 luaconf.h 中重新定义 lua_Number 类型可能是另一种比较简单的方法,但它也是一个全局更改。)更精细的实现可能会保留 TValue 的原样,并在 TValue 序列上定义 SIMD 操作,可能使用新的 SIMD 操作码,从而允许同时进行 SIMD 和非 SIMD 操作。

在这个实现中,每个 Lua 数字包含一定数量的子值(这个数字在全局变量 _SIMD_LEN 中定义,可以在 luaconf.h 中调整)。每个 Lua 数字中的单个子值可以使用“打包”运算符进行索引。从语法上讲,这个运算符在这里看起来像一个函数调用,但从语义上讲,它更像是在寄存器词法上的索引运算符,它有自己的操作码。

-- ex1.lua

-- If you don't use the SIMD features, numbers will behave
-- as usual.  This is because any SIMD unaware operations
-- (e.g. print) only use the first sub-value of a number. 
local x = 10
print(x + 50)  --> 60

-- Individual subvalues can be accessed with the packed operator:
assert(packed(x,1) == 10)  -- success
packed(x,2) = 20
assert(packed(x,1) == 10)
assert(packed(x,2) == 20)

local y; packed(y,1) = 100; packed(y,2) = 200;
local z = x + y     -- this is a SIMD instruction
assert(packed(z,1) == 110)
assert(packed(z,2) == 220)
print(z) --> 110   (print is not currently SIMD-aware)

print(_SIMD_LEN) --> prints some number (e.g. 2)

以下是这些操作码的样子。注意新的 GETPACKED 和 SETPACKED 指令。所有其他操作码保持不变。

$ ./src/luac -p -l ex1.lua

main <ex.lua:0,0> (51 instructions, 204 bytes at 0x682930)
0+ params, 5 slots, 0 upvalues, 3 locals, 12 constants, 0 functions
        1       [4]     LOADK           0 -1    ; 10
        2       [5]     GETGLOBAL       1 -2    ; print
        3       [5]     ADD             2 0 -3  ; - 50
        4       [5]     CALL            1 2 1
        5       [8]     GETGLOBAL       1 -4    ; assert
        6       [8]     GETPACKED       2 0 -5  ; 1
        7       [8]     EQ              1 2 -1  ; - 10
        8       [8]     JMP             1       ; to 10
        9       [8]     LOADBOOL        2 0 1
        10      [8]     LOADBOOL        2 1 0
        11      [8]     CALL            1 2 1
        12      [9]     SETPACKED       0 -6 -7 ; 2 20
        13      [10]    GETGLOBAL       1 -4    ; assert
        14      [10]    GETPACKED       2 0 -5  ; 1
        15      [10]    EQ              1 2 -1  ; - 10
        16      [10]    JMP             1       ; to 18
        17      [10]    LOADBOOL        2 0 1
        18      [10]    LOADBOOL        2 1 0
        19      [10]    CALL            1 2 1
        20      [11]    GETGLOBAL       1 -4    ; assert
        21      [11]    GETPACKED       2 0 -6  ; 2
        22      [11]    EQ              1 2 -7  ; - 20
        23      [11]    JMP             1       ; to 25
        24      [11]    LOADBOOL        2 0 1
        25      [11]    LOADBOOL        2 1 0
        26      [11]    CALL            1 2 1
        27      [13]    LOADNIL         1 1
        28      [13]    SETPACKED       1 -5 -8 ; 1 100
        29      [13]    SETPACKED       1 -6 -9 ; 2 200
        30      [14]    ADD             2 0 1
        31      [15]    GETGLOBAL       3 -4    ; assert
        32      [15]    GETPACKED       4 2 -5  ; 1
        33      [15]    EQ              1 4 -10 ; - 110
        34      [15]    JMP             1       ; to 36
        35      [15]    LOADBOOL        4 0 1
        36      [15]    LOADBOOL        4 1 0
        37      [15]    CALL            3 2 1
        38      [16]    GETGLOBAL       3 -4    ; assert
        39      [16]    GETPACKED       4 2 -6  ; 2
        40      [16]    EQ              1 4 -11 ; - 220
        41      [16]    JMP             1       ; to 43
        42      [16]    LOADBOOL        4 0 1
        43      [16]    LOADBOOL        4 1 0
        44      [16]    CALL            3 2 1
        45      [17]    GETGLOBAL       3 -2    ; print
        46      [17]    MOVE            4 2
        47      [17]    CALL            3 2 1
        48      [19]    GETGLOBAL       3 -2    ; print
        49      [19]    GETGLOBAL       4 -12   ; _SIMD_LEN
        50      [19]    CALL            3 2 1
        51      [19]    RETURN          0 1

现在,让我们做一些基准测试:将 1 到 2^28 的整数相加。在标准 Lua 中,它看起来像这样

-- test1-standard.lua
local sum = 0
for i=1,2^28 do sum = sum + i end
print(sum)

使用这个 SIMD 补丁,我们可以这样重写它

-- test1-simd.lua

local N=_SIMD_LEN

local j;    for k=1,N do packed(j,k)   = k end
local psum; for k=1,N do packed(psum,k)= 0 end
local fi;   for k=1,N do packed(fi,k)  = k end
local fs;   for k=1,N do packed(fs,k)  = N end

for i=fi,2^28,fs do
  psum = psum + i
end
local sum = 0
for i=1,N do
  -- print('partial sum', i, packed(psum,i))
  sum = sum + packed(psum,i)
end
print('sum:', sum)

SIMD 版本计算 N 个部分和(每个部分并行计算),然后将部分和相加。

以下是基准测试运行时间作为编译后的 _SIMD_LEN(N)值的函数。

Lua version       run-time
----------------  --------
unpatched         8.095s
SIMD patch N=1    9.265s
SIMD patch N=2    5.660s
SIMD patch N=3    3.566s
SIMD patch N=4    3.248s
SIMD patch N=8    2.688s
SIMD patch N=32   2.406s
SIMD patch N=256  2.529s
SIMD patch N=1024 2.894s

使用 SIMD 补丁,N=1 有效地否定了补丁,这对于与未打补丁的版本进行比较很有用。当 N 设置为 32 时,运行时间减少了 70%,至少在这个非常简单的例子中是这样。

针对 Lua 5.1.4 的补丁在下面的附录 A 中。

对这个补丁的一个简单扩展是使用真正的 SSE 操作 [2](希望这将进一步提高性能)。

--DavidManura

附录 A - Lua 补丁

diff -ur lua-5.1.4/src/lbaselib.c lua-5.1.4-simd/src/lbaselib.c
--- lua-5.1.4/src/lbaselib.c	2008-02-14 11:46:22.000000000 -0500
+++ lua-5.1.4-simd/src/lbaselib.c	2009-01-01 22:43:22.671875000 -0500
@@ -642,6 +642,11 @@
   lua_setfield(L, -2, "__mode");  /* metatable(w).__mode = "kv" */
   lua_pushcclosure(L, luaB_newproxy, 1);
   lua_setglobal(L, "newproxy");  /* set global `newproxy' */
+
+  /* SIMD-PATCH */
+  lua_pushnumber(L, LUA_SIMD_LEN);
+  lua_setglobal(L, "_SIMD_LEN");
+
 }
 
 
diff -ur lua-5.1.4/src/lcode.c lua-5.1.4-simd/src/lcode.c
--- lua-5.1.4/src/lcode.c	2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lcode.c	2009-01-01 22:20:37.890625000 -0500
@@ -317,6 +317,16 @@
       e->k = VRELOCABLE;
       break;
     }
+
+    /* SIMD-PATCH */
+    case VPACKED: {
+      freereg(fs, e->u.s.aux);
+      freereg(fs, e->u.s.info);
+      e->u.s.info = luaK_codeABC(fs, OP_GETPACKED, 0, e->u.s.info, e->u.s.aux);
+      e->k = VRELOCABLE;
+      break;
+    }
+
     case VINDEXED: {
       freereg(fs, e->u.s.aux);
       freereg(fs, e->u.s.info);
@@ -486,6 +496,14 @@
       luaK_codeABx(fs, OP_SETGLOBAL, e, var->u.s.info);
       break;
     }
+
+    /* SIMD-PATCH */
+    case VPACKED: {
+      int e = luaK_exp2RK(fs, ex);
+      luaK_codeABC(fs, OP_SETPACKED, var->u.s.info, var->u.s.aux, e);
+      break;
+    }
+
     case VINDEXED: {
       int e = luaK_exp2RK(fs, ex);
       luaK_codeABC(fs, OP_SETTABLE, var->u.s.info, var->u.s.aux, e);
diff -ur lua-5.1.4/src/llex.c lua-5.1.4-simd/src/llex.c
--- lua-5.1.4/src/llex.c	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/llex.c	2009-01-01 22:20:17.531250000 -0500
@@ -37,7 +37,11 @@
 const char *const luaX_tokens [] = {
     "and", "break", "do", "else", "elseif",
     "end", "false", "for", "function", "if",
-    "in", "local", "nil", "not", "or", "repeat",
+    "in", "local", "nil", "not", "or",
+
+    "packed", /* SIMD-PATCH */
+    
+    "repeat",
     "return", "then", "true", "until", "while",
     "..", "...", "==", ">=", "<=", "~=",
     "<number>", "<name>", "<string>", "<eof>",
diff -ur lua-5.1.4/src/llex.h lua-5.1.4-simd/src/llex.h
--- lua-5.1.4/src/llex.h	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/llex.h	2009-01-01 22:21:41.859375000 -0500
@@ -25,7 +25,11 @@
   /* terminal symbols denoted by reserved words */
   TK_AND = FIRST_RESERVED, TK_BREAK,
   TK_DO, TK_ELSE, TK_ELSEIF, TK_END, TK_FALSE, TK_FOR, TK_FUNCTION,
-  TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR, TK_REPEAT,
+  TK_IF, TK_IN, TK_LOCAL, TK_NIL, TK_NOT, TK_OR,
+  
+  TK_PACKED, /* SIMD-PATCH */
+  
+  TK_REPEAT,
   TK_RETURN, TK_THEN, TK_TRUE, TK_UNTIL, TK_WHILE,
   /* other terminal symbols */
   TK_CONCAT, TK_DOTS, TK_EQ, TK_GE, TK_LE, TK_NE, TK_NUMBER,
diff -ur lua-5.1.4/src/lobject.h lua-5.1.4-simd/src/lobject.h
--- lua-5.1.4/src/lobject.h	2008-08-06 09:29:48.000000000 -0400
+++ lua-5.1.4-simd/src/lobject.h	2009-01-01 22:42:06.609375000 -0500
@@ -70,8 +70,16 @@
 
 #define TValuefields	Value value; int tt
 
+/* SIMD-PATCH */
+typedef struct lua_TValueItem {
+  TValuefields;
+} TValueItem;
+
 typedef struct lua_TValue {
   TValuefields;
+
+  TValueItem more[LUA_SIMD_LEN - 1]; /* SIMD-PATCH */
+
 } TValue;
 
 
@@ -102,6 +110,9 @@
 
 #define l_isfalse(o)	(ttisnil(o) || (ttisboolean(o) && bvalue(o) == 0))
 
+/* SIMD-PATCH */
+#define packeditem(o,i) ( (TValue*)((TValueItem*)(o) + (i)) )
+
 /*
 ** for internal debug only
 */
diff -ur lua-5.1.4/src/lopcodes.c lua-5.1.4-simd/src/lopcodes.c
--- lua-5.1.4/src/lopcodes.c	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lopcodes.c	2009-01-01 22:19:43.640625000 -0500
@@ -52,6 +52,11 @@
   "CLOSE",
   "CLOSURE",
   "VARARG",
+
+  /* SIMD-PATCH */
+  "GETPACKED",
+  "SETPACKED",
+
   NULL
 };
 
@@ -98,5 +103,10 @@
  ,opmode(0, 0, OpArgN, OpArgN, iABC)		/* OP_CLOSE */
  ,opmode(0, 1, OpArgU, OpArgN, iABx)		/* OP_CLOSURE */
  ,opmode(0, 1, OpArgU, OpArgN, iABC)		/* OP_VARARG */
+
+ /* SIMD-PATCH */
+ ,opmode(0, 1, OpArgR, OpArgK, iABC)		/* OP_GETPACKED */
+ ,opmode(0, 0, OpArgK, OpArgK, iABC)		/* OP_SETPACKED */
+
 };
 
diff -ur lua-5.1.4/src/lopcodes.h lua-5.1.4-simd/src/lopcodes.h
--- lua-5.1.4/src/lopcodes.h	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lopcodes.h	2009-01-01 22:23:40.531250000 -0500
@@ -205,10 +205,18 @@
 OP_CLOSURE,/*	A Bx	R(A) := closure(KPROTO[Bx], R(A), ... ,R(A+n))	*/
 
 OP_VARARG/*	A B	R(A), R(A+1), ..., R(A+B-1) = vararg		*/
-} OpCode;
+
+/* SIMD-PATCH */
+,OP_GETPACKED/*	A B C	R(A) := packed(R(B), RK(C))				*/
+,OP_SETPACKED/*	A B C	packed(R(A), RK(B)) := RK(C)				*/
 
 
+} OpCode;
+
+#if 0 /* SIMD-PATCH */
 #define NUM_OPCODES	(cast(int, OP_VARARG) + 1)
+#endif
+#define NUM_OPCODES	(cast(int, OP_SETPACKED) + 1)
 
 
 
diff -ur lua-5.1.4/src/lparser.c lua-5.1.4-simd/src/lparser.c
--- lua-5.1.4/src/lparser.c	2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lparser.c	2009-01-01 22:20:07.343750000 -0500
@@ -691,6 +691,22 @@
   /* primaryexp ->
         prefixexp { `.' NAME | `[' exp `]' | `:' NAME funcargs | funcargs } */
   FuncState *fs = ls->fs;
+
+  /* SIMD-PATCH */
+  if (ls->t.token == TK_PACKED) {
+    expdesc t;
+    expdesc k;
+    luaX_next(ls);
+    checknext(ls, '(');
+    expr(ls, &t);
+    checknext(ls, ',');
+    expr(ls, &k);
+    checknext(ls, ')');
+    init_exp(v, VPACKED, luaK_exp2anyreg(fs, &t));
+    v->u.s.aux = luaK_exp2RK(fs, &k);
+    return;
+  }
+
   prefixexp(ls, v);
   for (;;) {
     switch (ls->t.token) {
diff -ur lua-5.1.4/src/lparser.h lua-5.1.4-simd/src/lparser.h
--- lua-5.1.4/src/lparser.h	2007-12-27 08:02:25.000000000 -0500
+++ lua-5.1.4-simd/src/lparser.h	2009-01-01 22:48:25.640625000 -0500
@@ -26,12 +26,17 @@
   VLOCAL,	/* info = local register */
   VUPVAL,       /* info = index of upvalue in `upvalues' */
   VGLOBAL,	/* info = index of table; aux = index of global name in `k' */
+
+  /* SIMD-PATCH */
+  VPACKED,	/* info = local register; aux = index register (or `k') */
+
   VINDEXED,	/* info = table register; aux = index register (or `k') */
   VJMP,		/* info = instruction pc */
   VRELOCABLE,	/* info = instruction pc */
   VNONRELOC,	/* info = result register */
   VCALL,	/* info = instruction pc */
   VVARARG	/* info = instruction pc */
+
 } expkind;
 
 typedef struct expdesc {
diff -ur lua-5.1.4/src/luaconf.h lua-5.1.4-simd/src/luaconf.h
--- lua-5.1.4/src/luaconf.h	2008-02-11 11:25:08.000000000 -0500
+++ lua-5.1.4-simd/src/luaconf.h	2009-01-01 23:42:57.625000000 -0500
@@ -510,6 +510,12 @@
 */
 #define LUAI_UACNUMBER	double
 
+/* SIMD-PATCH */
+/*
+@@ LUA_SIMD_LEN is the number of values computed in parallel.
+** This is readable via the global _SIMD_LEN.
+*/
+#define LUA_SIMD_LEN 8
 
 /*
 @@ LUA_NUMBER_SCAN is the format for reading numbers.
diff -ur lua-5.1.4/src/lvm.c lua-5.1.4-simd/src/lvm.c
--- lua-5.1.4/src/lvm.c	2007-12-28 10:32:23.000000000 -0500
+++ lua-5.1.4-simd/src/lvm.c	2009-01-01 22:43:45.437500000 -0500
@@ -357,6 +357,7 @@
 #define Protect(x)	{ L->savedpc = pc; {x;}; base = L->base; }
 
 
+#if 0 /* SIMDPATCH */
 #define arith_op(op,tm) { \
         TValue *rb = RKB(i); \
         TValue *rc = RKC(i); \
@@ -367,6 +368,23 @@
         else \
           Protect(Arith(L, ra, rb, rc, tm)); \
       }
+#endif
+#define arith_op(op,tm) { \
+        TValue *rb = RKB(i); \
+        TValue *rc = RKC(i); \
+        if (ttisnumber(rb) && ttisnumber(rc)) { \
+          int j; \
+          for (j=0; j < LUA_SIMD_LEN; j++) { \
+            TValue * rai = packeditem(ra,j); \
+            lua_Number nb = nvalue(rb), nc = nvalue(rc); \
+            setnvalue(rai, op(nb, nc)); \
+            (*(TValueItem**)(void*)&rb) ++; \
+            (*(TValueItem**)(void*)&rc) ++; \
+          } \
+        } \
+        else \
+          Protect(Arith(L, ra, rb, rc, tm)); \
+      }
 
 
 
@@ -401,7 +419,15 @@
     lua_assert(L->top == L->ci->top || luaG_checkopenop(i));
     switch (GET_OPCODE(i)) {
       case OP_MOVE: {
+
+        #if 0 /* SIMD-PATCH */
         setobjs2s(L, ra, RB(i));
+        #endif
+        int j;
+        for (j=0; j < LUA_SIMD_LEN; j++) {
+          setobjs2s(L, packeditem(ra,j), packeditem(RB(i),j));
+        }
+
         continue;
       }
       case OP_LOADK: {
@@ -654,8 +680,21 @@
         if (luai_numlt(0, step) ? luai_numle(idx, limit)
                                 : luai_numle(limit, idx)) {
           dojump(L, pc, GETARG_sBx(i));  /* jump back */
+
+          #if 0 /* SIMD-PATCH */
           setnvalue(ra, idx);  /* update internal index... */
           setnvalue(ra+3, idx);  /* ...and external index */
+          #endif
+          { int j;
+            for (j=0; j<LUA_SIMD_LEN; j++) {
+              TValue* ra_ = packeditem(ra, j);
+              TValue* step_ = packeditem(ra+2, j);
+              lua_Number idx_ = luai_numadd(nvalue(ra_), nvalue(step_));
+              setnvalue(ra_, idx_);
+              setnvalue(ra_+3, idx_);
+            }
+          }
+
         }
         continue;
       }
@@ -670,7 +709,19 @@
           luaG_runerror(L, LUA_QL("for") " limit must be a number");
         else if (!tonumber(pstep, ra+2))
           luaG_runerror(L, LUA_QL("for") " step must be a number");
+
+        #if 0 /* SIMD-PATCH */
         setnvalue(ra, luai_numsub(nvalue(ra), nvalue(pstep)));
+        #endif
+        { int j;
+          for (j=0; j < LUA_SIMD_LEN; j++) {
+            setnvalue(packeditem(ra,j),
+              luai_numsub(nvalue(packeditem(ra,j)),
+                          nvalue(packeditem(pstep,j))) );
+          }
+        }
+
+
         dojump(L, pc, GETARG_sBx(i));
         continue;
       }
@@ -757,6 +808,26 @@
         }
         continue;
       }
+
+      /* SIMD-PATCH */
+      case OP_GETPACKED: {
+        size_t off = (size_t)nvalue(RKC(i));
+        if (off < 1 || off > LUA_SIMD_LEN) {
+          luaG_runerror(L, "packed offset out of range");
+        }
+        setobjs2s(L, ra, (TValue*)((TValueItem*)RB(i) + off - 1) );
+        continue;
+      }
+      case OP_SETPACKED: {
+        size_t off = (size_t)nvalue(RKB(i));
+        if (off < 1 || off > LUA_SIMD_LEN) {
+          luaG_runerror(L, "packed offset out of range");
+        }
+        setobjs2s(L, (TValue*)((TValueItem*)ra + off - 1), RKC(i) );
+        continue;
+      }
+
+
     }
   }
 }
diff -ur lua-5.1.4/src/print.c lua-5.1.4-simd/src/print.c
--- lua-5.1.4/src/print.c	2007-03-25 20:17:38.000000000 -0400
+++ lua-5.1.4-simd/src/print.c	2009-01-01 22:19:20.562500000 -0500
@@ -116,6 +116,9 @@
     printf("\t; %s",svalue(&f->k[bx]));
     break;
    case OP_GETTABLE:
+
+   case OP_GETPACKED: /* SIMD-PATCH */
+
    case OP_SELF:
     if (ISK(c)) { printf("\t; "); PrintConstant(f,INDEXK(c)); }
     break;
@@ -128,6 +131,9 @@
    case OP_EQ:
    case OP_LT:
    case OP_LE:
+
+   case OP_SETPACKED: /* SIMD-PATCH */
+
     if (ISK(b) || ISK(c))
     {
      printf("\t; ");

最近更改 · 偏好设置
编辑 · 历史记录
最后编辑于 2009 年 5 月 2 日凌晨 2:32 GMT (差异)