Ubuntu Manpage: Marpa::R2::Semantics::Rank

Provided by: libmarpa-r2-perl_12.000000-1_amd64

NAME

       Marpa::R2::Semantics::Rank - How ranks are computed

Synopsis

           my $source = <<'END_OF_SOURCE';
             :discard ~ ws; ws ~ [\s]+
             :default ::= action => ::array

             Top ::= List action => main::group
             List ::= Item3 rank => 3
             List ::= Item2 rank => 2
             List ::= Item1 rank => 1
             List ::= List Item3 rank => 3
             List ::= List Item2 rank => 2
             List ::= List Item1 rank => 1
             Item3 ::= VAR '=' VAR action => main::concat
             Item2 ::= VAR '='     action => main::concat
             Item1 ::= VAR         action => main::concat
             VAR ~ [\w]+

           END_OF_SOURCE

           my @tests = (
               [ 'a',                 '(a)', ],
               [ 'a = b',             '(a=b)', ],
               [ 'a = b = c',         '(a=)(b=c)', ],
               [ 'a = b = c = d',     '(a=)(b=)(c=d)', ],
               [ 'a = b c = d',       '(a=b)(c=d)' ],
               [ 'a = b c = d e =',   '(a=b)(c=d)(e=)' ],
               [ 'a = b c = d e',     '(a=b)(c=d)(e)' ],
               [ 'a = b c = d e = f', '(a=b)(c=d)(e=f)' ],
           );

           my $grammar = Marpa::R2::Scanless::G->new( { source => \$source } );
           for my $test (@tests) {
               my ( $input, $output ) = @{$test};
               my $recce = Marpa::R2::Scanless::R->new(
                   {
                       grammar        => $grammar,
                       ranking_method => 'high_rule_only'
                   }
               );
               $recce->read( \$input );
               my $value_ref = $recce->value();
               if ( not defined $value_ref ) {
                   die 'No parse';
               }
               push @results, ${$value_ref};
           }

           sub flatten {
               my ($array) = @_;

               # say STDERR 'flatten arg: ', Data::Dumper::Dumper($array);
               my $ref = ref $array;
               return [$array] if $ref ne 'ARRAY';
               my @flat = ();
             ELEMENT: for my $element ( @{$array} ) {
                   my $ref = ref $element;
                   if ( $ref ne 'ARRAY' ) {
                       push @flat, $element;
                       next ELEMENT;
                   }
                   my $flat_piece = flatten($element);
                   push @flat, @{$flat_piece};
               }
               return \@flat;
           }

           sub concat {
               my ( $pp, @args ) = @_;

               # say STDERR 'concat: ', Data::Dumper::Dumper(\@args);
               my $flat = flatten( \@args );
               return join '', @{$flat};
           }

           sub group {
               my ( $pp, @args ) = @_;

               # say STDERR 'comma_sep args: ', Data::Dumper::Dumper(\@args);
               my $flat = flatten( \@args );
               return join '', map { +'(' . $_ . ')'; } @{$flat};
           }

Description

       This document describes rule ranking.  Rule ranking plays a role in parse ordering, which is described in
       a separate document.

Lexeme locations and fenceposts

       The lexeme locations are an ordered set of sets of lexemes.  Numbering of lexeme locations is 0-based.

       It is also useful to use an idea of locations between lexeme locations -- This is the classic "fencepost"
       issue -- sometimes you want to count sections of fence, and in other cases it is more convenient to count
       fenceposts.

       Let the lexeme locations be from 0 to "N".  If "I" is greater than 1 and less than "N", then lexeme
       fencepost "I" is before lexeme location "I" and after lexeme location "I-1".  lexeme fencepost 0 is
       before lexeme location 0.  lexeme fencepost "N" is after lexeme location "N"

       The above implies that

       •   If there are "N" lexeme locations, there are "N+1" lexeme fenceposts.

       •   lexeme location "I" is always between lexeme fencepost "I" and lexeme fencepost "I+1".

       Lexeme locations and fenceposts are closely related to G1 locations.

Dotted Rules

To understand this document, it is important to understand what a dotted rule is. An acquaintance with
dotted rules is also important in understanding Marpa's progress reports. Dotted rules are thoroughly
described in the progress report documentation. This section repeats the main ideas from the perspective
of this document.

Recall that a rule is a LHS (left hand side) and a RHS (right hand side). The LHS is always exactly one
symbol. The RHS is zero or more symbols. Consider the following example of a rule, given in the syntax
of Marpa's DSL.

S ::= A B C

Dotted rules are used to track the progress of a parse through a rule. They consist of a rule and a dot
position, which marks the point in the RHS which scanning has reached. It is called the dot position
because, traditionally, the position is represented by a dot. The symbol before the dot is called the
predot symbol.

The following is an example of a dotted rule. (The dot of a dotted rule is not part of Marpa's DSL but,
when it is useful for illustration, we will use it in the notation in this document.)

S ::= A B . C

In this rule, B is the predot symbol.

When the dot is after the last symbol of the RHS, the dotted rule is called a completion. Here is the
completion for the above rule:

S ::= A B C .

In this completion example, the symbol C is the predot symbol.

When the dot is before the first symbol of the RHS, the rule is called a prediction. Here is the
prediction of the rule we've been using for our examples:

S ::= . A B C

In predictions, there is no predot symbol.

Choicepoints

Informally a choicepoint is a place where the parser can decide among one or more parse choices. A
choicepoint can be thought of either a set of parse choices, or as the tuple of the properties which all
the parse choices for the choicepoint must have in common.

For a choicepoint to work, all the choices must have enough in common that each of them could be replaced
with any other. Thought of as a tuple of properties, a choicepoint is a triple: "(dp, start, current)".
In this triple,

• "dp" is a dotted rule. The predot symbol of "dp" is the predot symbol of the choicepoint -- if "dp"
is a prediction, there is no predot symbol. The rule of "dp" is the rule of the choicepoint. The
dot position of "dp" is the dot position of the choicepoint.

• "start" is the lexeme location where the dotted rule begins. "start" is sometimes called the origin
of the choicepoint.

• "current" is the lexeme location corresponding to the dot in "dp". "current" is sometimes called the
"current location" of the choicepoint.

The choicepoint is a prediction choicepoint if "dp" is a prediction. The choicepoint is a token
choicepoint if it is not a prediction choicepoint and the predot symbol of "dp" is a token symbol. The
choicepoint is a rule choicepoint if it is not a prediction choicepoint and the predot symbol of "dp" is
the LHS of a rule. (Token symbols are never the LHS of a rule, and vice versa.)

Parse choices

       As mentioned, a choicepoint can be seen as a set of one or more parse choices.  From the point of view of
       the individual parse trees, the traversal is top-down and left-to-right.

       Often, there is only one parse choice.  When there is only one parse choice, the choicepoint is  said  to
       be trivial.  Prediction and token choicepoints are always trivial.  If all of the choicepoints of a parse
       are trivial, the parse is unambiguous.

       Every  rule  choicepoint,  and  therefore  every  non-trivial  choicepoint,  has  a  set of parse choices
       associated with it.  For a rule choicepoint, each parse choice is a duple: "(predecessor, cause)",  where
       "cause"  and  "predecesor"  are  also  choicepoints.   The  first element of the duple is the predecessor
       choicepoint, or predecessor, of the parse  choice.   The  second  element  of  the  duple  is  the  cause
       choicepoint, or cause of the parse choice.

       Let  "res" be a rule choicepoint and let "(pred, cuz)" be one of the parse choices of "res".  Then we say
       that "res" is the result of "cuz" and "pred"; and we say that "cuz" is one of the causes of "res".

       The predecessor of a parse choice represents a portion of the parse that "leads up to"  the  cause.   The
       predecessor  of  a  parse choice plays no role in ranking decisions, and this document will mostly ignore
       predecessors.

Causes

Since it is a choicepoint, a cause choicepoint must also be a triple. Let the result of "cuz" be "res".
Let the triple for for "cuz" be "(cuz-dp, cuz-origin, cuz-current)". Let the triple for for "res" be
"(res-dp, res-origin, res-current)".

If "res" is the result of "cuz", then

• "cuz-current" is always the same as "res-current". In other words, the current location of "cuz" is
the same as the current location of "res".

• "cuz-origin" is before "res-current". In other words, the origin of "cuz" is before the current
location of "res".

• "cuz-origin" is at or after "res-origin". In other words, the origin of "cuz" is at or after the
origin of "res". While it is not always properly between "res-origin" and "res-current",
"cuz-origin" is always in the range bounded by "res-origin" on one side, and by "res-current" on the
other. "cuz-origin" can therefore be thought of as "in the middle" of the parse choice. For this
reason, "cuz-origin" is often called the middle location of the parse choice.

• The dotted rule of "cuz" will be a completion.

• The LHS of the dotted rule of "cuz" will be the predot symbol of "res".

Summing up the above, the causes of a rule choicepoint vary only by middle position and rule RHS. The
causes of a rule choicepoint always share the same current lexeme location and rule LHS, and their dotted
rules are always completions.

Rank is by Cause

       The  rank  of  a  parse  choice  is  that  of  its  cause.   The rank of a cause is the rank of its rule.
       Therefore, the rank of a parse choice is the rank of the rule of its cause.  That last  sentence  is  the
       most important sentence of this document, so we will repeat it:

           The rank of a parse choice is
           the rank of the rule of its cause.

Motivation

By the definition of the previous section, the rule of the choicepoint itself plays no direct role in
ranking. Instead, the ranking is based on the rules of one of the choicepoint's child nodes. At first,
this may seem unnecessarily roundabout, and even counter-intuitive, but on reflection, it will make
sense.

Ranking cannot be based directly on the rank of the choicepoint's rule because every parse choice for
that choicepoint shares the choicepoint's rule. We have not examined the internal structure of
predecessors, but ranking is not based on them for the same reason: the rule of a predecessor is always
the same as the rule of its result choicepoint, and therefore the rule of its predecessor is the same for
every parse choice of a choicepoint.

On the other hand, the causes of parse choices can and often do differ in their rules. Therefore ranking
is based on the rules of the causes of a choicepoint.

Note that ranking is only by direct causes. It might reasonably be asked, why not, at least in the case
of a tie, look at causes of causes. Or why not resolve ties by looking at causes of predecessors?

Marpa's built-in rule ranking was chosen to be the most powerful system that could be implemented with
effectively zero cost. Ranking by direct causes uses only information that is quickly and directly
available, so that its runtime cost is probably not measurable.

Complexity was also an issue. If indirect causes are taken into account, we would need to specify which
causes, and under what circumstances they are used. Only causes of causes? Or causes of predecessors?
Depth-first, or breadth-first? Only in case of ties, or using a more complex metric? To arbitrary
depth, or using a traversal that is cut off at some point?

Ranking by direct causes only means the answer to all of these questions is as simple as it can be. Once
we get into analyzing the synopsis grammar in detail, the importance of simplicity will become clear.

To be sure, there are apps whose requirements justify extra overhand and extra complexity. For these
apps, Marpa's ASF's allow full generality in ranking.

Examples

       Our examples in this document will look at the ranked grammar in the synopsis, and at minor variations of
       it.

   Longest highest, version 1
       We will first consider the example as it is given in the synopsis:

         :discard ~ ws; ws ~ [\s]+
         :default ::= action => ::array

         Top ::= List action => main::group
         List ::= Item3 rank => 3
         List ::= Item2 rank => 2
         List ::= Item1 rank => 1
         List ::= List Item3 rank => 3
         List ::= List Item2 rank => 2
         List ::= List Item1 rank => 1
         Item3 ::= VAR '=' VAR action => main::concat
         Item2 ::= VAR '='     action => main::concat
         Item1 ::= VAR         action => main::concat
         VAR ~ [\w]+

       Longest highest ranking

       The DSL in the synopsis ranks its items "longest highest".  Here "items" are represented by the  symbols,
       "<Item3>",  "<Item2>"  and  "<Item1>".   The  "longest"  choice is considered to be the one with the most
       lexemes.  Working this idea out for this grammar, we see that the items  should  rank,  from  highest  to
       lowest: "<Item3>", "<Item2>" and "<Item1>".

       The non-trivial choicepoints

       Several  examples  and  their  results are shown in the synopsis.  If you study the grammar, you will see
       that the non-trivial choicepoints will be for the dotted rules:

           Top ::= List .
           List ::= List . Item3
           List ::= List . Item2
           List ::= List . Item1

       The above are all of the potential dotted rules for result choicepoints.

       The cause choicepoints

       In all of these dotted rules, the predot symbol is "<List>".  Recall that cause choicepoints  are  always
       completions.  Since the predot symbol of all the non-trivial choicepoints is "<List>", the dotted rule of
       a cause of the non-trivial choicepoints may be any of

           List ::= Item1 .
           List ::= Item2 .
           List ::= Item3 .
           List ::= List Item1 .
           List ::= List Item2 .
           List ::= List Item3 .

       In  fact,  if  we  study the grammar more closely, we will see that the only possible ambiguity is in the
       sequence of items, and that ambiguities always take the form ""(v=)(v)"" versus ""(v=v)"".  Therefore the
       only causes of non-trivial parse choices are

           List ::= Item3 .
           List ::= Item1 .
           List ::= List Item3 .
           List ::= List Item1 .

       The first non-trivial choicepoints

       Further study of the grammar shows that the first non-trivial choice must be between  two  parse  choices
       with these cause choicepoints:

           List ::= Item3 .
           List ::= Item1 .

       The rule "List ::= Item3" has rank 3, and it obviously outranks the rule "List ::= Item1", which has rank
       1.  Our example uses "high_rank_only" ranking, and our example therefore will leave only this choice:

           List ::= Item3 .

       With  only  one  choice  left,  the  resulting  choicepoint  becomes trivial, and, as defined above, that
       remaining choice is "longest", and therefore the correct one for the "longest highest" ranking.

       The second and subsequent non-trivial choicepoints

       We've looked at only the first non-trivial choice for our example code.  Again examining the grammar,  we
       see  that  the  second  and subsequent non-trivial choices will all be between parse choices whose causes
       have these two dotted rules:

           List ::= List Item3 .
           List ::= List Item1 .

       The rule "List ::= List Item3" has rank 3, and it obviously outranks the  rule  "List  ::=  List  Item1",
       which  has  rank 1.  Our example uses "high_rank_only" ranking, and our example therefore will leave only
       this choice:

           List ::= List Item3 .

       Conclusion

       We have now shown that our example will reduce  all  choicepoints  to  a  single  choice,  one  which  is
       consistent  with  "longest  highest" ranking.  Since all choicepoints are reduced to a single choice, the
       ranked grammar in unambiguous.

       This analysis made a lot of unstated assumptions.  Below, there is a "Proof of  correctness".   It  deals
       with this same example, but proceeds much more carefully.

   Shortest highest, version 1
       Here  we  see the grammar of the synopsis, reworked for a "shortest highest" ranking.  "Shortest highest"
       is the reverse of "longest highest".

         :discard ~ ws; ws ~ [\s]+
         :default ::= action => ::array

         Top ::= List action => main::group
         List ::= Item3 rank => 1
         List ::= Item2 rank => 2
         List ::= Item1 rank => 3
         List ::= List Item3 rank => 1
         List ::= List Item2 rank => 2
         List ::= List Item1 rank => 3
         Item3 ::= VAR '=' VAR action => main::concat
         Item2 ::= VAR '='     action => main::concat
         Item1 ::= VAR         action => main::concat
         VAR ~ [\w]+

       Here are what the results will look like for "shortest highest".

           my @tests = (
               [ 'a',                 '(a)', ],
               [ 'a = b',             '(a=)(b)', ],
               [ 'a = b = c',         '(a=)(b=)(c)', ],
               [ 'a = b = c = d',     '(a=)(b=)(c=)(d)', ],
               [ 'a = b c = d',       '(a=)(b)(c=)(d)' ],
               [ 'a = b c = d e =',   '(a=)(b)(c=)(d)(e=)' ],
               [ 'a = b c = d e',     '(a=)(b)(c=)(d)(e)' ],
               [ 'a = b c = d e = f', '(a=)(b)(c=)(d)(e=)(f)' ],
           );

       The reader who wants an example of a ranking scheme  to  work  out  for  themselves  may  find  this  one
       suitable.  The reasoning will very similar to that for the "longest highest" example, just above.

   Longest highest, version 2
       The  previous  examples  have shown the rule involved in parse ranking in "spelled out" form.  In fact, a
       more compact form of the grammar can be used, as shown below for "longest highest" ranking.

         :discard ~ ws; ws ~ [\s]+
         :default ::= action => ::array

         Top ::= List action => main::group
         List ::= Item rank => 1
         List ::= List Item rank => 0
         Item ::= VAR '=' VAR rank => 3 action => main::concat
         Item ::= VAR '='     rank => 2 action => main::concat
         Item ::= VAR         rank => 1 action => main::concat
         VAR ~ [\w]+

   Shortest highest, version 2
       This is the grammar for "shortest highest", in compact form:

         :discard ~ ws; ws ~ [\s]+
         :default ::= action => ::array

         Top ::= List action => main::group
         List ::= Item rank => 0
         List ::= List Item rank => 1
         Item ::= VAR '=' VAR rank => 1 action => main::concat
         Item ::= VAR '='     rank => 2 action => main::concat
         Item ::= VAR         rank => 3 action => main::concat
         VAR ~ [\w]+

       Again, the reader looking for simple examples to work out for themselves may want to rework the  argument
       given above for the "spelled out" examples for the compact examples.

Alternatives to Ranking

   Reimplementation as pure BNF
       It  is  generally  better,  when  possible,  to  write  a language as BNF, instead of using ranking.  The
       advantage of using BNF is that you can more readily determine exactly what language it is  that  you  are
       parsing:  Ranked  grammars make look easier to analyze at first glance, but the more you look at them the
       more tricky you realize they are.

       These "pure BNF" reimplementations rely on an observation used in the detailed proof of  the  ranked  BNF
       example:  The  parse  string  becomes  easier  to  analyze when you see it as a sequence of "var-bounded"
       substrings, where "var-bounded" means the substring is delimited by the start of the string, the  end  of
       the string, or the fencepost between two "<VAR>" lexemes.

       Longest highest as pure BNF

       Here is the "longest highest" example, reimplemented as BNF:

         :discard ~ ws; ws ~ [\s]+
         :default ::= action => ::array

         Top            ::= Max_Boundeds action => main::group
         Top            ::= Max_Boundeds Unbounded action => main::group
         Top            ::= Unbounded action => main::group
         Max_Boundeds   ::= Max_Bounded+
         Max_Bounded    ::= Eq_Finals Var_Final3
         Max_Bounded    ::= Var_Final
         Unbounded      ::= Eq_Finals
         Eq_Finals      ::= Eq_Final+
         Var_Final      ::= Var_Final3 | Var_Final1
         Var_Final3     ::= VAR '=' VAR action => main::concat
         Eq_Final       ::= VAR '='     action => main::concat
         Var_Final1     ::= VAR         action => main::concat
         VAR ~ [\w]+

       Shortest highest as pure BNF

       We  can  also  reimplement  the  "shortest  highest"  example  as  BNF.   One  of the advantages of a BNF
       (re)implementation, is that it often clarifies the grammar.  For example in this case, we note  that  the
       DSL rule

         Var_Final3     ::= VAR '=' VAR action => main::concat

       is, in fact, never used.  We therefore omit it:

         :discard ~ ws; ws ~ [\s]+
         :default ::= action => ::array

         Top            ::= Max_Boundeds action => main::group
         Top            ::= Max_Boundeds Unbounded action => main::group
         Top            ::= Unbounded action => main::group
         Max_Boundeds   ::= Max_Bounded+
         Max_Bounded    ::= Eq_Finals Var_Final
         Max_Bounded    ::= Var_Final
         Unbounded      ::= Eq_Finals
         Eq_Finals      ::= Eq_Final+
         Eq_Final       ::= VAR '='     action => main::concat
         Var_Final      ::= VAR         action => main::concat
         VAR ~ [\w]+

   Abstract Syntax Forests (ASFs)
       Ranking  can  also  be  implemented  using  Marpa's  abstract syntax forests.  ASFs are likely to be less
       efficient than the built in ranking, but they are a more general and powerful solution.

Comparison with PEG

For those familiar with PEG, Marpa's ranking may seem familiar. In fact, Marpa's ranking can be seen as
a "better PEG".

A PEG specification looks like a BNF grammar. It is a common misconception that a PEG specification
implements the same language that it would if interpreted as pure, unranked BNF. This is the case only
when the grammar is LL(1) -- in other words, rarely in practical use.

Typically, a PEG implementer thinks their problem out in BNF, perhaps ordering the PEG rules to resolve a
few of the choices, and then twiddles the PEG specification until the test suite passes. This results in
a parser whose behavior is to a large extent unknown.

Marpa with ranking allows a safer form of PEG-style parsing. Both Marpa's DSL and PEG allow the
implementer to specify any context-free grammar, but Marpa parses all context-free grammars correctly,
while PEG only correctly parses a small subset of the context-free grammars and fakes the rest.

In Marpa, the implementer of a ranked grammar can work systematically. He can first write a "superset
grammar", and then "sculpt" it using ranks until he has exactly the language he wants. At all points,
the superset grammar will act as a "safety net". That is, ranking or no ranking, Marpa returns no parse
trees that are not specified by the BNF grammar.

This "sculpted superset" is more likely to result in a satisfactory solution than the PEG approach. The
PEG specification, re-interpreted as pure BNF, will usually only parse a subset of what the implementer
needs, and the implementer must use ranks to "stretch" the grammar to describe what he has in mind.

Ranking is not as predictable as specifying BNF. The "safety net" provided by Marpa's "sculpted
superset" guards against over-liberal parsing. This is important: over-liberal parsing can be a security
loophole, and bugs caused by over-liberal parsing have been the topic of security advisories.

Determining the exact language parsed by a ranked grammar in PEG is extremely hard -- a small specialist
literature has looked at this subject. Most implementers don't bother trying -- when the test suite
passes, they consider the PEG implementation complete.

Ranking's effect is more straightforward in Marpa. Above, we showed very informally that the example in
our synopsis does what it claims to do -- that the parses returned are actually, and only, those desired.
A more formal proof is given below.

Since Marpa parses all context-free grammars, and Marpa parses most unambiguous grammars in linear time,
Marpa often parses your language efficiently when it is implemented as pure BNF, without rule ranking.
As one example, the ranked grammar in the synopsis can be reimplemented as pure unranked BNF. Where it
is possible to convert your ranked grammar to a pure BNF grammar, that will usually be the best and
safest choice.

Details

       This section contains additional explanations, not essential to understanding the rest of this  document.
       Often  they  are  formal  or  mathematical.   While  some  people  find  these  helpful, others find them
       distracting, which is why they are segregated here.

   Earley parsing
       In this section we assume that the reader is comfortable reading and analyzing BNF.  We  will  often  use
       the basic theorem of Earley parsing.  As a reminder, the theorem states that an instance of a dotted rule
       is present in a parse, if and only if that instance is valid for the input so far.  More formally:

       (EARLEY):  Let  "G"  be  a  grammar.   Let "Syms" be the set of symbols in "G" and let exactly one of the
       symbols in "Syms" be distinguished as the "start symbol".  Call the "start symbol", "<Start>".

       Let "Terms" be a subset of "Syms" called the "terminals" of "G".  Let "W" be a  string  of  symbols  from
       "Terms".  Let "W[i]" be the "i"'th terminal of "W", so that the first terminal of "W" is "W[1]".

       Where  "alpha"  and  "beta"  are  sequences  of  symbols in "Syms", let "alpha => beta" mean that "alpha"
       derives "beta" in grammar "G" in zero or more steps.

       In this theorem, let "alpha", "beta", "gamma", "delta" be sequences of zero or more symbols in "Syms".

       Theorem: These two statement sets are equivalent:

       Statement set 1 is true if and only if we have all of the following:

       1a.  "eitem" is an instance of a dotted rule, which we will treat as a triple: "dp, origin, current".  In
       the literature, a triple of this form is also called an "Earley item".

       1b.  "dp" is the dotted rule "<A> ::= beta . gamma>".

       Statement set 2 is true if and only if we have all of the following:

       2a. "alpha => W[1] ... W[origin]"

       2b. "beta => W[origin+1] .. W[current]"

       2c. "<Start> => alpha <A> delta"

       The proof is omitted.  It can be found in Jay Earley's thesis, in his original  paper,  and  in  Aho  and
       Ullmann's 1972 textbook.

   Proof of correctness
       Let  "G"  be  the grammar in the Marpa DSL synopsis, Let "W" be a string, and let "ps-pure" be the set of
       parses allowed by "G", treated as if it was pure unranked BNF.  Let "ps-ranked"  be  the  set  of  parses
       allowed by the "G" after ranking is taken into account.

       We  will  say string "W" is valid if and only if "ps-pure" is non-empty.  We will say that a parse set is
       "unambiguous" if and only if it contains at most one parse.

       In the less formal discussion above, we took an intuitive approach to the idea of "longest highest",  one
       which  made  the  unstated  assumption  that optimizing for "longest highest" locally would also optimize
       globally.  Here we make our idea of "longest highest" explicit and justify it.

       What "longest highest" means locally is reasonably obvious -- if "item3" has more lexemes  than  "item1",
       then "item3" is "longer" than "item1".  It is less clear what it means globally.  Items might overlap, so
       that optimizing locally might make the parse as a whole suboptimal.

       We  will  define  a  "longest  highest" parse of "W" as one of the parses of "W" with the fewest "items".
       This is based on the observations that "W" is fixed in length; and that, if items are  longer,  fewer  of
       them will fit into "W".

       More  formally,  let "p" be a parse, and let "ps" be a parse set.  Let Items(p) be the number of items in
       "p".  Let Items(ps), be the minimum number of items of any parse in "ps".  Then, if "lh" is  a  parse  in
       "ps"  and  if,  for  every  parse  "p"  in parse set "ps", "Items(lh) <= Items(p)", we say that "lh" is a
       "longest highest" parse.

       To conveniently represent token strings we will represent them as literal strings where ""v"" stands  for
       a "<VAR>" lexemes and equal signs represent themselves.  For example, we might represent the token string

           <VAR> = <VAR>

       as the string

           v=v

       Theorem: If "W" is a valid string, "ps-ranked" contains exactly one parse, and that parse is the "longest
       highest" parse.

       Proof:  Recall  that  a Marpa "high_rank_only" parse first parses according to the pure unranked BNF, and
       then prunes the parse using ranking.

       Part 1: We will first examine "ps-pure", the parse set which results from the pure BNF phase.

       (1) "Item": An "item" will means an instance of one of the symbols "Item1", "Item2" or "Item3".  When  we
       want  to  show  a  token  string's division into items, we will enclose them in parentheses.  For example
       ""(v=)(v=)(v)"" will indicate that the token string is divided  into  3  items;  and  ""(v=)(v=v)""  will
       represent the same token string divided into 2 items.

       (2)  "Factoring":  Analyzing  the  grammar  "G",  we  see that it is a sequence of items, and that "G" is
       ambiguous for a string "W" if and only if that string divides into items in more than one way.   We  will
       call a division of "W" into items a "factoring".

       (3) "Independence": Let "seq1" and "seq2" be two factorings of "W" into items.  Let "first" and "last" be
       two lexeme locations such that "first <= last".

       Let  "sub1"  be  a  subsequence of one or more items of "seq1" that starts at "first" and ends at "last".
       Similarly, let "sub2" be a subsequence of one or more items of "seq2" that starts at "first" and ends  at
       "last".   Let  "seq3"  be  the  result of replacing "sub1" in "seq1" with "sub2".  Then, because "G" is a
       context free grammar, "seq3" is also a factoring of "W".

       As an example, Let "W" be "v=vv=v", let "seq1" be "(v=v)(v=v)" and let "seq2" be  "(v=)(v)(v=)(v)".   Let
       "first"  be  4  and  "last"  be  6,  so  that  "sub1" is "(v=v)" and "sub2" is "(v=)(v)".  Then "seq3" is
       "(v=v)(v=)(v)".  We claimed that "seq3" must be a factoring of "W", and we can  see  that  our  claim  is
       correct.

       (4):  Recall  the  discussion  of  lexeme  fenceposts.   We  will  say  that a lexeme fencepost "fp" is a
       "factoring barrier" if it is not properly contained in any item.  More formally, let First(it1) be  first
       lexeme  location  of  an  item  "it1",  and let Last(it1) be last lexeme location of "it1".  Let "i" be a
       lexeme fencepost "i" such that, for every item "it", either "i <= First(it)" or  "i  >  Last(it)".   Then
       lexeme fencepost "i" is a factoring barrier.

       (5): Where "|W|" is the length of "W", we can see from (4) that fencepost "|W|" is a factoring barrier.

       (6): Also from (4) we see that lexeme fencepost 0 is a factoring barrier.

       (7): Let a "subfactoring" be a sequence of items bounded by factoring barriers.

       (8):  Let  a  "minimal  subfactoring"  be  a  subfactoring  which does not properly contain any factoring
       barriers.  From (5) and (6) we can see that "W" can be divided into one or more minimal subfactorings.

       (9): We can see from "G" that two "VAR"s never occur together in the same item.  This means that wherever
       we do find two "VAR"s in a row, the lexeme fencepost between them is a factoring barrier.

       (10): No item begins with an equal sign (""="").  Every item begins with a "<VAR>" token.

       (11): Every minimal subfactoring starts with a "<VAR>" token.  This is because every subfactoring  starts
       with an item, and from (10).

       (12):  Every  minimal  subfactoring  start  with  a  "<VAR>" token and alternates equal signs and "<VAR>"
       lexemes: that is, it has the pattern ""v=v=v= ..."".  This follows from (9) and (11).

       (13): We will call a minimal subfactoring "var-bounded" or simply "bounded" if it starts and ends with  a
       "<VAR>" token.  By (12), this means that it has the pattern ""v=v=v= ... v"".

       (14): We will call a minimal subfactoring "var-unbounded" or simply "unbounded" if it is not var-bounded.
       By (12), this means that it has the pattern ""v=v= ... v="".

       (15):  If  a  "<Item1>"  or  "<Item3>"  occurs  in  a  minimal  subfactoring, it is the last item in that
       subfactoring.

       To show (15), assume for a reductio ad  absurdam  there  is  a  minimal  subfactoring  with  a  non-final
       "<Item1>" or "<Item3>".  Call that subfactoring "f", and call that item, "it1".

       (15a): Both "<Item1>" and "<Item3>" end with a "<VAR>" token, so that "it1" ends in a "<VAR>" token.

       (15b):  Since  "it1",  by assumption for the reductio, is non-final, there is a next item.  Call the next
       item "it2".

       (15c): By (10), all items start with "<VAR>" lexemes, so that "it2" starts with a "<VAR>" token.

       (15d): From (15a) and (15c), we know that "it1" ends with a "<VAR>" token and "it2" starts with a "<VAR>"
       token, so that there are two "<VAR>" lexemes at the fencepost between "it1" and "it2".

       (15e): From (9) and (15d), there is a factoring barrier at the fencepost between "it1" and "it2".

       (15f): All items, including "it1" and "it2" and of non-zero length so that, from (15e), we know that  the
       fencepost between "it1" and "it2" is properly inside subfactoring "f".

       (15g):  From  (15f) we see that there is a factoring barrier properly inside "f".  But "f" is assumed for
       the reductio to be minimal and therefore cannot properly contain a factoring barrier.

       (15h): (15g) shows the reductio, and allows us to conclude that, if  "f"  contains  an  "<Item1>"  or  an
       "<Item3>", that item must be the final item.  This is what we needed to show for (15).

       (16): By (3), every minimal subfactoring is independent -- in other words, no ambiguity crosses factoring
       barriers.   So  we narrow our consideration of ambiguities to two cases: ambiguities in unbounded minimal
       subfactorings, and ambiguities in bounded minimal subfactorings.

       (17): Every unbounded minimal subfactoring is unambiguous.  To show (17), let "u" be an unbounded minimal
       subfactoring.  It follows from the definition of unbounded minimal subfactoring (14), that "u" ends in an
       equal sign.  Both "<Item1>" and "<Item3>" end in "<VAR>" lexemes, so that no "<Item1>" or  "<Item3>"  can
       be the last item in "u".  Therefore, by (15), there is no "<Item1>" or "<Item3>" item in "u".  This means
       that  every  item  in  "u" is an "<Item2>" item.  This in turn means that "u" is unambiguous, which shows
       (17).

       (18): Every bounded minimal subfactoring parses either as a sequence of "Item2"'s followed by an "Item1";
       or as as a sequence of "Item2"'s followed by an "Item3".  This follows from (1), (13) and (15).

       (19): From (16), (17) and (18) it follows that every ambiguity between subfactorings is  between  bounded
       subfactorings whose divisions into items take the two forms

       ""(v=) ... (v=) (v)""

       and

       ""(v=) ... (v=v)""

       Part  2:  We have now shown how pure BNF parsing works for the grammar in the synopsis.  We next show the
       consequences of ranking.

       (20): A completed instance of "List ::= Item1" can only be found at lexeme location 1.  We know this from
       "G" and (EARLEY).

       (21): A completed instance of "List ::= List Item1" can only be found at lexeme location 2 or after.   We
       know this from "G" and (EARLEY).

       (22)  At  most  one  dotted rule with "Item1" as its last item appears in a cause at any choicepoint.  We
       know this because all choices of a choicepoint must be completions with the same  current  location,  and
       from (20) and (21).

       (23): A completed instance of "List ::= Item3" can only be found at lexeme location 3.  We know this from
       "G" and (EARLEY).

       (24):  A completed instance of "List ::= List Item3" can only be found at lexeme location 4 or after.  We
       know this from "G" and (EARLEY).

       (25) At most one dotted rule with "Item3" as its last item appears in a cause  at  any  choicepoint.   We
       know  this  because  all choices of a choicepoint must be completions with the same current location, and
       from (23) and (24).

       (26): Consider the ambiguity shown in (19) and let "current" be the lexeme fencepost at its end.  An item
       also ends at "current" so from "G" and (EARLEY), we know that there is at least one dotted rule  instance
       whose  predot  symbol is "List" and whose current location is "current".  The choicepoint can have one of
       these dotted rules:

            Top ::= List .
            List ::= List . Item3
            List ::= List . Item2
            List ::= List . Item1

       with location 0 as its origin and "current" as  its  current  location.   There  may  be  more  than  one
       choicepoint fulfilling these criteria.

       Of the choicepoints fulfilling the criteria, we arbitrarily pick one.  We call this choicepoint "result".
       We will see as we proceed that, while the choicepoints may have different dotted rules, our choice of one
       of  them for "result" makes no difference in the ranking logic.  The only properties of the "result" that
       we will use are dot position, predot symbol,  and  current  location.   These  properties  are  the  same
       regardless of which choicepoint we picked for "result".

       (27):  Recall that the "middle location" of a choice is always the origin of its cause.  From (26) we see
       that, for every possible rule in "result", the predot symbol is the first  symbol  of  the  dotted  rule.
       This  means  that  the origin of the causes of "result" must be the same as the origin of "result".  From
       (26) we also know that the origin of "result" is always location 0.  Therefore the origin of every one of
       the causes of "result" must be location 0.  Therefore every one of the causes of "result"  has  the  same
       middle location.

       (28):  From  (26) and (27) we know that, if "cuz" is a cause of "result", it has an origin at location 0;
       its current location is "current"; and its LHS is "List".  "cuz" may be any one of

            List ::= Item3 .
            List ::= Item2 .
            List ::= Item1 .
            List ::= List Item3 .
            List ::= List Item2 .
            List ::= List Item1 .

       (29): From (19) and (EARLEY) we know that dotted rules  ending  in  "Item2"  are  not  valid  choices  at
       location "current".

       (30):  From  (19),  (22), (28) and (EARLEY) we know that exactly one of the following two dotted rules is
       the cause of a parse choice for "result":

            List ::= Item1 .
            List ::= List Item1 .

       (31): From (19), (25), (28) and (EARLEY) we know that exactly one of the following two  dotted  rules  is
       the cause of a parse choice for "result":

            List ::= Item3 .
            List ::= List Item3 .

       (32): From (28), (29), (30) and (31) we know that we will have exactly two choices for "result"; that the
       final  symbol  of  the  cause in one will be "Item1"; and that the final symbol of the cause in the other
       will be "Item3".  From the grammar in the synopsis, we know that the cause ending in  "Item1"  will  have
       rank  1;  and  that  the  cause  ending in "Item3" will have rank 3.  From this we see, since the ranking
       method is "high_rank_only", that the choice whose final symbol is "Item3" will be the only  one  kept  in
       "P-ranked".  We further see from (19) that this choice will have fewer items than the alternative.

       (33):  From (19) we know that there is at most one ambiguity per minimal subfactoring.  From (16) we know
       that ambiguity resolutions of a minimal subfactoring are independent  of  the  ambiguity  resolutions  in
       other  minimal  subfactorings.   From  (32) we know that every subfactoring in "P-ranked" has exactly one
       factoring, and that that factoring is the one with the fewest items.  Therefore, "P-ranked" will  contain
       exactly one parse, and that that parse will be "longest highest".

       QED.

Copyright and License

         Copyright 2022 Jeffrey Kegler
         This file is part of Marpa::R2.  Marpa::R2 is free software: you can
         redistribute it and/or modify it under the terms of the GNU Lesser
         General Public License as published by the Free Software Foundation,
         either version 3 of the License, or (at your option) any later version.

         Marpa::R2 is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
         Lesser General Public License for more details.

         You should have received a copy of the GNU Lesser
         General Public License along with Marpa::R2.  If not, see
         http://www.gnu.org/licenses/.

perl v5.40.1                                       2025-06-25                    Marpa::R2::Semantics::Rank(3pm)