EarleyParseFunctionAlgorithm

(derived from Dick Grune and Ceriel J.H. Jacob's "Parsing Techniques -- A Practical Guide" -- my compliments to the authors! See page 149 in Edition 1 -- which is the 139th page in the .pdf book file.)

See also:
EarleyParseCompletedItemSetsExample.html
EarleyTreeBuilder.html


Assumptions:


1. The input Metamath database has been validated successfully against the Syntax Axiom Rules and Conventions. In other words, the syntax and grammar of the input have been determined to be valid (we will not proceed in this using bogus data!)

Note: there will likely need to be two invocations of the parse function, one for validating Syntax Axioms themselves, and another for validation of non-Syntax statements. Both functions need a parse process, but for opposite reasons! A parse attempt is required for Syntax Axioms to prove that they cannot be parsed using the grammar derived from other Syntax Axioms, meaning that each is independent; non-Syntax statements are parsed to prove that they can be derived from the grammar!

2. The Grammar Rules and associated information have been created from the Syntax Axioms of the Metamath Database and all information stored (for convenience) in Cnst and Axiom has been initialized; of local interest, specifically, the Cnst.earleyFIRST and Cnst.earleyRules items have been set to null prior to construction of EarleyParser (which happens at the end of Grammar.initializeGrammar().

3. Primary Input to EarleyParser.parseExpr consists of:
        public int parseExpr(ParseTree[]       parseTreeArrayIn,
                             Cnst              formulaTypIn,
                             ParseNodeHolder[] parseNodeHolderExprIn,
                             int               highestSeqIn)

3. Primary Output from  EarleyParser.parseExpr is int -- the number of generated ParseTrees. The input parseTreeArrayIn is treated as I/O and is loaded with the output parseTrees, starting at index 0; unused array entries are not rei-initialized, so the contents beyond the index = (output count - 1) is indeterminate.

4. Auxiliary Input accessed via Grammar, Cnst, etc.(grammar and grammar.maxFormulaCnt are input via the EarleyParser Constructor.)

5. Auxiliary Output:



Special Parse Cases: (NOTE: in the event that we discontinue using the "combinatorial explosion" method of resolving Nulls Permitteds and Type Conversions, these Special Cases will need to be abandoned -- they will have to pass through the regular parse just like everyone else :) The issue involves nulls prior to or after any symbol and making those matches to the Grammar Rules.)
  1. Expression length = 0: Occurs on Nulls Permitted Syntax Axioms or, in theory, on a Logical Hypothesis, Logical Axiom or Theorem -- statements beginning with a valid Theorem TypeCode (i.e. "|-"). In both cases, the parse should return a Nulls Permitted result, but the question is, for which Type Code? Answer: if the parsed Formula Type Code is a Provable Logic Statement Type code, then return a Nulls Permitted parse for one of the Logical Statement Type Codes (and if 2 results are possible then that is an ambiguity); otherwise, if the parsed Formula Type Code is not a Provable Logic Statement Type Code, then return a Nulls Permitted parse result for any of the non-Provable Logic Statement Type Codes, including the Logical Type Codes (ex. if Formula Type Code is "wff" then if there is a Nulls Permitted Rule for "wff", return that, else no parse is possible.) NOTE: At this time, the input is restricted to allow a maximum of ONE Provable Logic Statement Type Code and ONE Logic Statement Type Code.
  2. Expression with Length = 1 and (!parseNodeHolderExprIn[0].mObj.isCnst()) and (parseNodeHolderExprIn[0].getStmt().isVarHyp(): This occurs on Type Conversion Syntax Axioms, on Variable Hypothesis Statements and on Logical Statements (see ax-mp in set.mm!) If the parsed Formula's Type Code is a Provable Logic Statement Type Code then the parse result should be just a Variable Hypothesis parse tree -- otherwise, the parse result should be a Type Conversion parse tree. (NOTE: when parsing a Metamath database in its entirety, we do not send variable hypotheses through the parser, therefore weignore that scenario in this special case. Thus, if someone purposely sends, say, " xyz $f class x $." through the parser it will be treated as a type conversion, from class to class, and it will generate an error message becase there can be no conversion from a Type Code to itself.)


Processing Overview

There are four primary steps in parsing an expression.
  1. Rewrite parseNodeHolderExprIn with parse of each mObj.isCnst that refers to a "gimme" 1-symbol Syntax Axiom Rule (see Cnst.len1CnstNotationRule not = null and NotationRule.isGimmeNbr == 1). At the same time, increase the length of the rewritten expr so that the first symbol occurs in array index 1; that simplifies a fair amount of code, leaving only the -1 adjustment -- from position to index -- for accessing GrammarRule expression symbols.
  2. Check for special cases -- ParseNodeHolderExpr length = 0, and ParseNodeHolderExpr length = 1 consisting of just a variable (hypothesis). Output these automatically without bothering Mr. Earley...
  3. Perform the Earley Parse algorithm on the rewritten expression. This process recognizes grammatically correct expressions and performs record-keeping that enables generation of parse tree(s) using the Completed ItemSets.
  4. Generate the Parse Tree(s), if any, using the Earley ItemSets/CompleteItemSets. If the final Competed Itemset does not contain an EarleyItem @1 for the Start Rule Type, then there are no parses available (we'll have to figure out how to generate multiples as we go through this...)
In addition, at construction time, the Cnst.earleyFIRST set and the Cnst.earleyRules list must be initialized using the Grammar.


Processing Overview, Step 3: Perform the Earley Parse algorithm on the Rule Format Expr.

General Issues:



EarleyItem Data Structure -- stored in Itemsets

public class EarleyItem {
    /**
     *  rule - NotationRule reference used to avoid copying expr
     *         symbols over and over again.
     *
     *         EarleyItem.rule.ruleNbr is the high order sub-key
     *         of an Earley Item. Together, rule.ruleNbr and
     *         atIndex make a unique key, which is essential for
     *         maintaining *set* Earley Itemsets.
     */
    NotationRule rule;

    /**
     *  atIndex - index to next Cnst (Sym) in expr, equal to the
     *         Earley loop counter, "i"; 1 is first, not 0, which
     *         is left empty in expr for simplicity.
     *
     *         EarleyItem.atIndex is the low-order sub-key of an
     *         Earley Item. Together, rule.ruleNbr and atIndex make
     *         a unique key, which is essential for maintaining
     *         Earley Itemsets.
     */
    int          atIndex;

    /**
     *  dotindex - index of the Earley "dot" (or "gap") used instead
     *         of physically moving the dot around in the EarleyItem. A
     *         value of 1 means that the dot is *before* the first
     *         sym in the rule's expr, while a value equal to
     *         expr.length means that the EarleyItem is completed.
     */
    int          dotIndex;

    /**
     *  afterDot - if item completed equal to null, else equal to
     *         ruleExpr[dotIndex]. Used to avoid repetitious and
     *         annoying "hunting trips" into the rule's forest.
     */
    Cnst         afterDot;

    /*
     * Compare for equality with another EarleyItem.
     * Equal if and only if Rule and atIndexSym are
     * identical.
     *
     * @return returns true if equal, otherwise false.
     */
    public boolean equals(Object obj) {
        return (this == obj) ? true
            : !(obj instanceof EarleyItem) ? false
                    : rule != ((EarleyItem)obj).rule ? false
                        : atIndex != ((EarleyItem)obj).atIndex? false
                            : true;
    }

    public String toString() {
        StringBuilder s = new StringBuilder();
        s.append(GrammarConstants.ERRMSG_AT_CAPTION);
        s.append(atIndex);
        s.append(GrammarConstants.ERRMSG_DOT_CAPTION);
        s.append(dotIndex);
        s.append(GrammarConstants.ERRMSG_AFTER_DOT_CAPTION);
        if (afterDot == null) {
            s.append(" ");
        }
        else {
            s.append(afterDot);
        }
        s.append(" ");
        s.append(rule.ruleNbr);
        s.append(GrammarConstants.ERRMSG_COLON_CAPTION);
        s.append(rule.getBaseSyntaxAxiom().getTyp());
        s.append(GrammarConstants.ERRMSG_COLON_CAPTION);
        s.append(rule.getBaseSyntaxAxiom().getLabel());
        s.append(" ");
        s.append(GrammarConstants.ERRMSG_RULE_EXPR_CAPTION);
        s.append(rule.getRuleFormatExprAsString());
        return s.toString();
    }
}



Example: Parse "xyz $a |- ( ph <-> x = A ) $."

This generates the rule format expr (with empty index 0)
 "( wff  <->  set = class )"

And the correct parse RPN is "wph vx cv cA wceq wb"



0
  1
  2
  3
  4
  5
  6
  7

(
wff
<->
set
=
class )

ItemSet[0] (reduced to minimum rules for ease of exposition)

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
3:wb
"( wff <-> wff )"
wff
1
1
(
Itemset[0] preloaded by Predictor with all rules that derive directly from the Start Rule Type Code (and that have Max Sequence Number <= the input formula Max Sequence Number.) Input to Predictor(0) can be the precomputed Earley Itemset[0] that is based on just the grammar, not the input formula. The Predictor can also be enhanced with Lookahead using the EarleyFIRST set, which is also pre-computed using just the grammar -- this will reduce the size of both Itemset[0] and subsequent predictions.
18:wceq
"set = class"
wff
1
1
set
NOTE: With Lookahead, this item will not be loaded because set.earleyFirst does not contain "(", the first symbol of the parsed expression.










expr, Earley loop #1
0
  1
  2
  3
  4
  5
  6
  7

(
wff
<->
set
=
class )

ItemSet[1]
___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
3:wb
"( wff <-> wff )"
wff
1
2
wff
Scanner copied from Itemset[i - 1] because it contained afterDot = "(". dotIndex moved forward, after the "(" to the next sym, which is "wff".
3:wb
"( wff <-> wff )"
wff
2
1
(
Predictor copied all rules for wff because of previous "Active" item in ItemSet[1], which contains Non-Terminal afterDot = wff
18:wceq
"set = class"
wff
2
1
set
Predictor: ditto. Note: there are no rules present for class or set, so Predictor stops here!


CompletedItemSet[1]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________





empty? yep.




expr, Earley loop #2

0
  1
  2
  3
  4
  5
  6
  7

(
wff
<->
set
=
class )


ItemSet[2]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
3:wb
"( wff <-> wff )"
wff
1
3
<->
Scanner copied from Itemset[i - 1] because it contained afterDot = "wff". dotIndex moved forward, after the "wff" to the next sym, which is "<->".





Predictor: does not add rules for terminal "<->"...done.








CompletedItemSet[2]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________





empty? yep.




expr, Earley loop #3
0
  1
  2
  3
  4
  5
  6
  7

(
wff
<->
set
=
class )

ItemSet[3]
___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
3:wb
"( wff <-> wff )"
wff
1
4
wff
Scanner copied from Itemset[i - 1] because it contained afterDot = "<->". dotIndex moved forward, after the "<->" to the next sym in the rule, which is "wff".
3:wb
"( wff <-> wff )"
wff
4
1
(
Predictor copied all rules for wff because of previous "Active" item in ItemSet[3], which contains Non-Terminal afterDot = wff.
18:wceq
"set = class"
wff
4
1
set
Predictor: ditto. Note: there are no rules present for class or set, so Predictor stops here!


CompletedItemSet[3]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________





empty? yep.




expr, Earley loop #4

0
  1
  2
  3
  4
  5
  6
  7

(
wff
<->
set
=
class )


ItemSet[4]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
18:wceq
"set = class"
wff
4
2
=
Scanner copied from Itemset[i - 1] because it contained afterDot = "set". dotIndex moved forward, after the "set" to the next sym in the rule, which is "=", a terminal.





Predictor: does not add rules for afterDot terminal, "="...done.


CompletedItemSet[4]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________





empty? yep.




expr, Earley loop #5

0
  1
  2
  3
  4
  5
  6
  7

(
wff
<->
set
=
class )


ItemSet[5]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
18:wceq
"set = class"
wff
4
3
class
Scanner copied from Itemset[i - 1] because it contained afterDot = "=". dotIndex moved forward, after the "=" to the next sym in the rule, which is "class", a non-terminal.





Predictor: does not add rules for afterDot non-terminal, "class" because there are no grammar rules for "class".


CompletedItemSet[5]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________





empty? yep. still empty. sigh!!!




expr, Earley loop #6

0
  1
  2
  3
  4
  5
  6
  7

(
wff
<->
set
=
class )


ItemSet[6]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
3:wb
"( wff <-> wff )"
wff
1
5
)
Completer copied copied from Itemset[3] because it contained  afterDot = "wff" and the CompletedItemSet item below has rule type "wff" atIndex 4 (it goes to Itemset[m - 1]. It then moved dotIndex forward and stores the item in Active because it is not completed, yet.





Predictor: does not add rules for afterDot terminal, nothing for it to do.


CompletedItemSet[6]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
18:wceq
"set = class"
wff
4
4

Scanner copied from Itemset[i - 1] because it contained afterDot = "class". dotIndex moved forward, after the "class" to the end! Therefore this is completed and should be in CompletedItemSet! Yay!





expr, Earley loop #7

0
  1
  2
  3
  4
  5
  6
  7

(
wff
<->
set
=
class )


ItemSet[7]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________





Predictor: nothing to do, Active ItemSet is empty. And on top of that, I don't think we need to run Predictor on the final symbol.


CompletedItemSet[7]

___________rule___________
rule
Typ
at
Index

dot
Index
after
Dot
_______________comments_______________
3:wb
"( wff <-> wff )"
wff
1
6

Scanner: copied copied from Itemset[6] because it contained  afterDot = ")".  It then moved dotIndex forward and sets afterDot to null because this item is completed -- so it is stored in CompletedItemSet instead of Active.
AND...since the final CompletedItemSet has an atIndex = 1 for the Start Rule Type "wff", we have a parse!!! Yay. (It only took about 1 1/2 hours to type this example in. Now to generate the parse tree for it...)