Lpeg 教程

wiki

简单匹配

Lpeg 是一种功能强大的文本数据匹配符号，比 Lua 字符串模式和标准正则表达式更强大。但是，就像任何语言一样，你需要了解基本词汇以及如何将它们组合起来。

学习的最佳方法是在交互式会话中使用模式进行练习，首先定义一些快捷方式

$ lua -llpeg
Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> match = lpeg.match -- match a pattern against a string
> P = lpeg.P -- match a string literally
> S = lpeg.S  -- match anything in a set
> R = lpeg.R  -- match anything in a range

如果你不想手动创建快捷方式，你可以这样做

> setmetatable(_ENV or _G, { __index = lpeg or require"lpeg" })

我不建议在正式代码中这样做，但为了探索 Lpeg，它非常方便。

匹配发生在字符串的开头，成功的匹配返回成功匹配后的位置，或者如果失败则返回nil。（这里我使用的是f'x' 等价于f('x') 在 Lua 中；使用单引号与双引号具有相同的含义。）

> = match(P'a','aaa')
2
> = match(P'a','123')
nil

它的工作原理类似于string.find，只是它只返回一个索引。

你可以匹配范围或字符的集合

> = match(R'09','123')
2
> =  match(S'123','123')
2

匹配多个项目是使用^ 运算符完成的。在这种情况下，匹配等效于 Lua 模式 '^a+' - 一个或多个 'a' 的出现

> = match(P'a'^1,'aaa')
4

按顺序组合模式是使用* 运算符完成的。这等效于 '^ab*' - 一个 'a' 后跟零个或多个 'b'

> = match(P'a'*P'b'^0,'abbc')
4

到目前为止，lpeg 给了我们一种更详细的方式来表达正则表达式，但这些模式是可组合的 - 它们可以很容易地从更简单的模式构建起来，而无需笨拙的字符串操作。通过这种方式，lpeg 模式可以比等效的正则表达式更容易阅读。请注意，在构建模式时，如果其中一个参数已经是模式，你通常可以省略显式的P 调用

> maybe_a = P'a'^-1  -- one or zero matches of 'a'
> match_ab = maybe_a * 'b'
> = match(match_ab, 'ab')
3
> =  match(match_ab, 'b')
2
> =  match(match_ab, 'aaab')
nil

+ 运算符表示要么一个要么另一个模式

> either_ab = (P'a' + P'b')^1 -- sequence of either 'a' or 'b'
> = either_ab:match 'aaa'
4
> =  either_ab:match 'bbaa'
5

请注意，模式对象有一个match 方法！

当然，S'ab'^1 将是更简短的说法，但这里的参数可以是任意模式。

基本捕获

获取匹配后的索引非常好，然后你可以使用string.sub 来提取字符串。但有一些方法可以明确地请求捕获

> C = lpeg.C  -- captures a match
> Ct = lpeg.Ct -- a table with all captures from the pattern

第一个等效于在 Lua 模式中使用 '(...)'（或在正则表达式中使用 '\(...\)'）

> digit = R'09' -- anything from '0' to '9'
> digits = digit^1 -- a sequence of at least one digit
> cdigits= C(digits)  -- capture digits
> = cdigits:match '123'
123

因此，要获取字符串值，请将模式括在C 中。

此模式不涵盖一般整数，一般整数前面可能会有 '+' 或 '-'。

> int = S'+-'^-1 * digits
> = match(C(int),'+23')
+23

与 Lua 模式或正则表达式不同，您不必担心转义“魔法”字符 - 字符串中的每个字符都代表它本身：'('、'+'、'*' 等只是它们的 ASCII 等价物。

/ 运算符提供了一种特殊的捕获类型 - 它将捕获的字符串传递给函数或表。这里我在结果中加 1，只是为了表明结果已使用 tonumber 转换为数字。

> =  match(int/tonumber,'+123') + 1
124

请注意，与 string.match 一样，匹配可以返回多个捕获。这等效于 '^(a+)(b+)'

> = match(C(P'a'^1) * C(P'b'^1), 'aabbbb')
aa	bbbb

构建更复杂的模式

考虑一般浮点数

> function maybe(p) return p^-1 end
> digits = R'09'^1
> mpm = maybe(S'+-')
> dot = '.'
> exp = S'eE'
> float = mpm * digits * maybe(dot*digits) * maybe(exp*mpm*digits)
> = match(C(float),'2.3')
2.3
> = match(C(float),'-2')
-2
> = match(C(float),'2e-02')
2e-02

此 lpeg 模式比正则表达式等效项 '[-+]?[0-9]+\.?[0-9]+([eE][+-]?[0-9]+)?' 更易于阅读；更短总是更好！原因之一是我们可以将模式视为表达式：提取公共模式，编写函数以实现便利性和清晰度等。请注意，以这种方式写出内容不会有任何损失；lpeg 仍然是解析文本的非常快的方法！

更复杂的结构可以由这些构建块组成。考虑解析浮点数列表的任务。列表是一个数字，后面跟着零个或多个由逗号和数字组成的组

> listf = C(float) * (',' * C(float))^0
> = listf:match '2,3,4'
2	3	4

这很酷，但如果将其作为实际列表会更酷。这就是 lpeg.Ct 的用武之地；它将模式内的所有捕获收集到一个表中。

= match(Ct(listf),'1,2,3')
table: 0x84fe628

库存 Lua 不会美化打印表，但您可以使用 [? Microlight] 来完成此工作

> tostring = require 'ml'.tstring
> = match(Ct(listf),'1,2,3')
{"1","2","3"}

这些值仍然是字符串。最好编写 listf 以便它转换其捕获

> floatc = float/tonumber
> listf = floatc * (',' * floatc)^0

这种捕获列表的方式非常通用，因为您可以在 floatc 的位置放置任何捕获的表达式。但此列表模式仍然过于严格，因为通常我们希望忽略空格

> sp = P' '^0  -- zero or more spaces (like '%s*')
> function space(pat) return sp * pat * sp end -- surrond a pattern with optional space
> floatc = space(float/tonumber) 
> listc = floatc * (',' * floatc)^0
> =  match(Ct(listc),' 1,2, 3')
{1,2,3}

这取决于个人喜好，但我更喜欢允许项目周围有可选空格，而不是允许分隔符 ',' 周围有空格。

使用 lpeg，我们可以再次使用模式匹配进行编程，并重用模式

function list(pat)
    pat = space(pat)
    return pat * (',' * pat)^0
end

因此，标识符列表（根据通常的规则）

> idenchar = R('AZ','az')+P'_'
> iden = idenchar * (idenchar+R'09')^0
> =  list(C(iden)):match 'hello, dolly, _x, s23'
"hello"	"dolly"	"_x"	"s23"

使用显式范围似乎过时且容易出错。更便携的解决方案是使用 lpeg 等效于字符类，它们从定义上来说是与区域设置无关的

> l = {}
> lpeg.locale(l)
> for k in pairs(l) do print(k) end
"punct"
"alpha"
"alnum"
"digit"
"graph"
"xdigit"
"upper"
"space"
"print"
"cntrl"
"lower"
> iden =  (l.alpha+P'_') * (l.alnum+P'_')^0

鉴于此 list 定义，很容易定义常见 CSV 格式的简单子集，其中每个记录都是由换行符分隔的列表

> listf =  list(float/tonumber)
> csv = Ct( (Ct(listf)+'\n')^1 )
> =  csv:match '1,2.3,3\n10,20, 30\n'
{{1,2.3,3},{10,20,30}}

学习 lpeg 的一个很好的理由是它的性能非常出色。此模式比使用 Lua 字符串匹配解析数据快得多。

字符串替换

我将展示 lpeg 可以完成 string.gsub 的所有功能，并且更通用、更灵活。

我们还没有使用过的运算符是 -，它表示“或”。考虑匹配双引号字符串的问题。在最简单的情况下，它们是一个双引号，后面跟着任何不是双引号的字符，最后是一个闭合的双引号。P(1) 匹配任何单个字符，即它等同于字符串模式中的“.”。字符串可以为空，所以我们匹配零个或多个非引号字符

> Q = P'"'
> str = Q * (P(1) - Q)^0 * Q
> = C(str):match '"hello"'
"\"hello\""

或者你可能想提取字符串的内容，不带引号。在这种情况下，只使用 1 而不是 P(1) 并不模棱两可，事实上，这就是你通常看到这种“任何不是 P 的 x”模式的方式

> str2 = Q * C((1 - Q)^0) * Q
> = str2:match '"hello"'
"hello"

这个模式显然是可推广的；通常终止模式与最终模式不同

function extract_quote(openp,endp)
    openp = P(openp)
    endp = endp and P(endp) or openp
    local upto_endp = (1 - endp)^1 
    return openp * C(upto_endp) * endp
end

> return  extract_quote('(',')'):match '(and more)'
"and more"
> = extract_quote('[[',']]'):match '[[long string]]'
"long string"

现在考虑将 Markdown code（反斜杠包围的文本）转换为 Lua wiki（双大括号包围的文本）所理解的格式。天真的方法是提取字符串并连接结果，但这很笨拙，而且（正如我们将会看到的那样）极大地限制了我们的选择。

function subst(openp,repl,endp)
    openp = P(openp)
    endp = endp and P(endp) or openp
    local upto_endp = (1 - endp)^1 
    return openp * C(upto_endp)/repl * endp
end

> =  subst('`','{{%1}}'):match '`code`'
"{{code}}"
> =  subst('_',"''%1''"):match '_italics_'
"''italics''"

我们之前已经遇到过捕获处理运算符 /，使用 tonumber 来转换数字。它也理解与 string.gsub 非常相似的字符串格式，其中 %n 表示第 n 个捕获。

这个操作可以准确地表示为

> = string.gsub('_italics_','^_([^_]+)_',"''%1''")
"''italics''"

但好处是我们不必构建自定义字符串模式，也不必担心转义像 '(' 和 ')' 这样的“魔法”字符。

lpeg.Cs 是一个替换捕获，它提供了一个更通用的全局字符串替换模块。在 lpeg 手册中，有这个等效于 string.gsub 的内容

function gsub (s, patt, repl)
    patt = P(patt)
    local p = Cs ((patt / repl + 1)^0)
    return p:match(s)
end

> =  gsub('hello dog, dog!','dog','cat')
"hello cat, cat!"

为了理解区别，这里是用普通 C 表示的模式

> p = C((P'dog'/'cat' + 1)^0)
> = p:match 'hello dog, dog!'
"hello dog, dog!"	"cat"	"cat"

这里的 C 只捕获整个匹配，每个 '/' 添加一个新的捕获，其值为替换字符串。

使用 Cs，所有内容都会被捕获，并且一个字符串由所有捕获构建而成。其中一些捕获被 '/' 修改，因此我们有了替换。

在 Markdown 中，块引用行以 '> ' 开头。

lf = P'\n'
rest_of_line_nl = C((1 - lf)^0*lf)         -- capture chars upto \n
quoted_line = '> '*rest_of_line_nl       -- block quote lines start with '> '
-- collect the quoted lines and put inside [[[..]]]
quote = Cs (quoted_line^1)/"[[[\n%1]]]\n"

> = quote:match '> hello\n> dolly\n'
"[[[
> hello
> dolly
]]]
"

这不太对 - Cs 捕获所有内容，包括 '> '。但我们可以强制一些捕获返回空字符串： }}}

function empty(p)
    return C(p)/''
end

quoted_line = empty ('> ') * rest_of_line_nl
...

现在一切都会正常工作！

以下是将本文档从 Markdown 转换为 Lua wiki 格式的程序

local lpeg = require 'lpeg'

local P,S,C,Cs,Cg = lpeg.P,lpeg.S,lpeg.C,lpeg.Cs,lpeg.Cg

local test = [[
## A title

here _we go_ and `a:bonzo()`:

    one line
    two line
    three line
       
and `more_or_less_something`

[A reference](http://bonzo.dog)

> quoted
> lines
 
]]

function subst(openp,repl,endp)
    openp = P(openp)  -- make sure it's a pattern
    endp = endp and P(endp) or openp
    -- pattern is 'bracket followed by any number of non-bracket followed by bracket'
    local contents = C((1 - endp)^1)
    local patt = openp * contents * endp    
    if repl then patt = patt/repl end
    return patt
end

function empty(p)
    return C(p)/''
end

lf = P'\n'
rest_of_line = C((1 - lf)^1)
rest_of_line_nl = C((1 - lf)^0*lf)

-- indented code block
indent = P'\t' + P'    '
indented = empty(indent)*rest_of_line_nl
-- which we'll assume are Lua code
block = Cs(indented^1)/'    [[[!Lua\n%1]]]\n'

-- use > to get simple quoted block
quoted_line = empty('> ')*rest_of_line_nl 
quote = Cs (quoted_line^1)/"[[[\n%1]]]\n"
 
code = subst('`','{{%1}}')
italic = subst('_',"''%1''")
bold = subst('**',"'''%1'''")
rest_of_line = C((1 - lf)^1)
title1 = P'##' * rest_of_line/'=== %1 ==='
title2 = P'###' * rest_of_line/'== %1 =='

url = (subst('[',nil,']')*subst('(',nil,')'))/'[%2 %1]'
 
item = block + title1 + title2 + code + italic + bold + quote + url + 1
text = Cs(item^1)

if arg[1] then
    local f = io.open(arg[1])
    test = f:read '*a'
    f:close()
end

print(text:match(test))

由于这个 Wiki 的转义问题，我不得不将源代码中的 '[' 替换为 '{'，等等。请注意！

SteveDonovan，2012 年 6 月 12 日

分组和反向捕获

本节将分析分组和反向捕获（分别为 Cg() 和 Cb()）的行为。

分组捕获（以下简称“分组”）有两种类型：命名分组和匿名分组。

    Cg(C"baz" * C"qux", "name") -- named group.

    Cg(C"foo" * C"bar")         -- anonymous group.

让我们先解决一个简单的：表格捕获中的命名分组。

    Ct(Cc"foo" * Cg(Cc"bar" * Cc"baz", "TAG")* Cc"qux"):match"" 
    --> { "foo", "qux", TAG = "bar" }

在表格捕获中，分组内第一个捕获的值（"bar"）将被分配给表格中对应的键（"TAG"）。如您所见，Cc"baz" 在此过程中丢失了。标签必须是字符串（或将自动转换为字符串的数字）。

请注意，分组必须是表格的直接子元素，否则表格捕获将不会处理它。

    Ct(C(Cg(1,"foo"))):match"a"
    --> {"a"}

关于捕获和值

在深入研究分组之前，我们必须先探讨捕获处理其子捕获方式的细微之处。

一些捕获操作其子捕获生成的值，而另一些则操作捕获对象。这有时会让人感到困惑。

让我们看以下模式

    (1 * C( C"b" * C"c" ) * 1):match"abcd"
    --> "bc", "b", "c"

如您所见，它在捕获流中插入了三个值。

让我们将其包装在一个表格捕获中

    Ct(1 * C( C"b" * C"c" ) * 1):match"abcd"
    --> { "bc", "b", "c" }

Ct() 操作值。在最后一个示例中，三个值按顺序插入表格中。

现在，让我们尝试一个替换捕获

    Cs(1 * C( C"b" * C"c" ) * 1):match"abcd"
    --> "abcd"

Cs() 操作捕获。它扫描其嵌套捕获的第一级，并且只取每个捕获的第一个值。在上面的示例中，"b" 和 "c" 因此被丢弃。以下是一个可能使事情更清晰的示例

    function the_func (bcd) 
        assert(bcd == "bcd")
        return "B", "C", "D" 
    end

    Ct(1 * ( C"bcd" / the_func ) * 1):match"abcde"
    --> {"B", "C", "D"}  -- All values are inserted.

    Cs(1 * ( C"bcd" / the_func ) * 1):match"abcde"
    --> "aBe"   -- the "C" and "D" have been discarded.

对每种捕获的按值/按捕获行为的更详细说明将是另一节的主题。

捕获不透明度

另一个需要认识的重要事项是，大多数捕获会遮蔽其子捕获，但有些不会。如您在最后一个示例中所见，C"bcd" 的值被传递给 /function 捕获，但它没有出现在最终的捕获列表中。Ct() 和 Cs() 在这方面也是不透明的。它们只分别生成一个表格或一个字符串。

另一方面，C() 是透明的。如我们上面所见，C() 的子捕获也会插入流中。

    C(C"b" * C"c"):match"bc" --> "bc", "b", "c"

唯一透明的捕获是 C() 和匿名 Cg()。

匿名分组

Cg() 将其子捕获包装在一个单独的捕获对象中，但不会生成任何自己的内容。根据上下文，要么所有值都会被插入，要么只有第一个值会被插入。

以下是一些匿名分组的示例

    (1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
    --> "b", "c", "d"

    Ct(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
    --> { "b", "c", "d" }

    Cs(1 * Cg(C"b" * C"c" * C"d") * 1):match"abcde"
    --> "abe" -- "c" and "d" are dropped.

这种行为在什么情况下有用？在折叠捕获中。

让我们编写一个非常基本的计算器，它可以对一位数进行加减运算。

    function calc(a, op, b)
        a, b = tonumber(a), tonumber(b)
        if op == "+" then 
            return a + b
        else
            return a - b
        end
    end

    digit = R"09"

    calculate = Cf(
        C(digit) * Cg( C(S"+-") * C(digit) )^0
        , calc
    )
    calculate:match"1+2-3+4"
    --> 4

捕获树将如下所示 [*]

    {"Cf", func = calc, children = {
        {"C", val = "1"},
        {"Cg", children = {
            {"C", val = "+"},
            {"C", val = "2"}
        } },
        {"Cg", children = {
            {"C", val = "-"},
            {"C", val = "3"}
        } },
        {"Cg", children = {
            {"C", val = "+"},
            {"C", val = "4"}
        } }
    } }

你可能已经猜到是怎么回事了... 与 Cs() 一样，Cf() 对捕获对象进行操作。它首先提取第一个捕获的第一个值，并将其用作初始值。如果没有更多捕获，则此值将成为 Cf() 的值。

但我们还有更多捕获。在我们的例子中，它将把第二个捕获（组）的所有值传递给 calc()，并将其附加在第一个值的后面。以下是上述 Cf() 的评估结果

    first_arg = "1"
    next_ones: "+", "2"
    first_arg = calc("1", "+", "2") -- 3, calc() returns numbers

    next_ones: "-", "3"
    first_arg = calc(3, "-", "3")

    next_ones: "+", "4"
    first_arg = calc(0, "+", "4")

    return first_arg -- Tadaaaa.

[*] 实际上，在匹配时，捕获对象只存储它们的边界和辅助数据（如 Cf() 的 calc()）。实际值是在匹配完成后按顺序生成的，但这使得上面显示的内容更加清晰。在上面的例子中，嵌套的 C() 和 Cg(C(),C()) 的值实际上是在折叠过程的每个对应循环中一次生成一个。

命名组

（命名 Cg() / Cb()）对的行为类似于匿名 Cg()，但捕获在命名 Cg() 中的值不会在本地插入。它们会被传送，最终插入到流中 Cb() 的位置。

以下是一个例子

    ( 1 * Cg(C"bc", "FOOO") * C"d" * 1 * Cb"FOOO" * Cb"FOOO"):match"abcde"
    -- > "d", "bc", "bc"

如果有多个 Cb()，则会进行扭曲... 以及复制。另一个例子

    ( 1 * Cg(C"b" * C"c" * C"d", "FOOO") * C"e" * Ct(Cb"FOOO") ):match"abcde"
    --> "e", { "b", "c", "d" }

通常，为了清晰起见，在我的代码中，我将 Cg() 称为 Tag()。我将前者用于匿名组，将后者用于命名组。

Cb"FOOO" 将回溯查找一个成功的对应 Cg()。它会向上回溯，并消耗捕获。换句话说，它会搜索其兄弟姐妹，以及其父辈的兄弟姐妹，但不会搜索父辈本身。它也不会测试兄弟姐妹/祖先兄弟姐妹的子节点。

它按以下步骤进行（从 [ #### ] <--- [[ START ]] 开始，并按照数字向上回溯）。

[ numbered ] 捕获是按顺序测试的捕获。标记为 [ ** ] 的捕获不会被测试，原因是各种原因。这很复杂，但据我所知是完整的。

    Cg(-- [ ** ] ... This one would have been seen, 
       -- if the search hadn't stopped at *the one*.
       "Too late, mate."
        , "~@~"
    )

    * Cg( -- [ 3 ] The search ends here. <--------------[[ Stop ]]
        "This is *the one*!"
        , "~@~"
    )

    * Cg(--  [ ** ] ... The great grand parent. 
                     -- Cg with the right tag, but direct ancestor,
                     -- thus not checked.

        
        Cg( -- [ 2 ] ... Cg, but not the right tag. Skipped.
            Cg( -- [ ** ] good tag but masked by the parent (whatever its type)
                "Masked"
                , "~@~"
            )
            , "BADTAG"
        )

        * C( -- [ ** ] ... grand parent. Not even checked.

            ( 
                Cg( -- [ ** ] ... This subpattern will fail after Cg() succeeds.
                    -- The group is thus removed from the capture tree, and will
                    -- not be found dureing the lookup.
                    "FAIL"
                    , "~@~"
                ) 
                * false 
            )

            + Cmt(  -- [ ** ] ... Direct parent. Not assessed.
                C(1) -- [ 1 ] ... Not a Cg. Skip.

                * Cb"~@~"   -- [ #### ]  <----------------- [[ START HERE ]] --
                , function(subject, index, cap1, cap2) 
                    return assert(cap2 == "This is *the one*!")
                end
            )
        )
        , "~@~" -- [ ** ] This label goes with the great grand parent.
    )

-- PierreYvesGerardy

最近更改 · 偏好设置
编辑 · 历史
最后编辑于 2019 年 2 月 18 日下午 7:39 GMT (差异)