Provided by: libmarpa-r2-perl_12.000000-1_amd64 

NAME
Marpa::R2::Semantics::Rank - How ranks are computed
Synopsis
my $source = <<'END_OF_SOURCE';
:discard ~ ws; ws ~ [\s]+
:default ::= action => ::array
Top ::= List action => main::group
List ::= Item3 rank => 3
List ::= Item2 rank => 2
List ::= Item1 rank => 1
List ::= List Item3 rank => 3
List ::= List Item2 rank => 2
List ::= List Item1 rank => 1
Item3 ::= VAR '=' VAR action => main::concat
Item2 ::= VAR '=' action => main::concat
Item1 ::= VAR action => main::concat
VAR ~ [\w]+
END_OF_SOURCE
my @tests = (
[ 'a', '(a)', ],
[ 'a = b', '(a=b)', ],
[ 'a = b = c', '(a=)(b=c)', ],
[ 'a = b = c = d', '(a=)(b=)(c=d)', ],
[ 'a = b c = d', '(a=b)(c=d)' ],
[ 'a = b c = d e =', '(a=b)(c=d)(e=)' ],
[ 'a = b c = d e', '(a=b)(c=d)(e)' ],
[ 'a = b c = d e = f', '(a=b)(c=d)(e=f)' ],
);
my $grammar = Marpa::R2::Scanless::G->new( { source => \$source } );
for my $test (@tests) {
my ( $input, $output ) = @{$test};
my $recce = Marpa::R2::Scanless::R->new(
{
grammar => $grammar,
ranking_method => 'high_rule_only'
}
);
$recce->read( \$input );
my $value_ref = $recce->value();
if ( not defined $value_ref ) {
die 'No parse';
}
push @results, ${$value_ref};
}
sub flatten {
my ($array) = @_;
# say STDERR 'flatten arg: ', Data::Dumper::Dumper($array);
my $ref = ref $array;
return [$array] if $ref ne 'ARRAY';
my @flat = ();
ELEMENT: for my $element ( @{$array} ) {
my $ref = ref $element;
if ( $ref ne 'ARRAY' ) {
push @flat, $element;
next ELEMENT;
}
my $flat_piece = flatten($element);
push @flat, @{$flat_piece};
}
return \@flat;
}
sub concat {
my ( $pp, @args ) = @_;
# say STDERR 'concat: ', Data::Dumper::Dumper(\@args);
my $flat = flatten( \@args );
return join '', @{$flat};
}
sub group {
my ( $pp, @args ) = @_;
# say STDERR 'comma_sep args: ', Data::Dumper::Dumper(\@args);
my $flat = flatten( \@args );
return join '', map { +'(' . $_ . ')'; } @{$flat};
}
Description
This document describes rule ranking. Rule ranking plays a role in parse ordering, which is described in
a separate document.
Lexeme locations and fenceposts
The lexeme locations are an ordered set of sets of lexemes. Numbering of lexeme locations is 0-based.
It is also useful to use an idea of locations between lexeme locations -- This is the classic "fencepost"
issue -- sometimes you want to count sections of fence, and in other cases it is more convenient to count
fenceposts.
Let the lexeme locations be from 0 to "N". If "I" is greater than 1 and less than "N", then lexeme
fencepost "I" is before lexeme location "I" and after lexeme location "I-1". lexeme fencepost 0 is
before lexeme location 0. lexeme fencepost "N" is after lexeme location "N"
The above implies that
• If there are "N" lexeme locations, there are "N+1" lexeme fenceposts.
• lexeme location "I" is always between lexeme fencepost "I" and lexeme fencepost "I+1".
Lexeme locations and fenceposts are closely related to G1 locations.
Dotted Rules
To understand this document, it is important to understand what a dotted rule is. An acquaintance with
dotted rules is also important in understanding Marpa's progress reports. Dotted rules are thoroughly
described in the progress report documentation. This section repeats the main ideas from the perspective
of this document.
Recall that a rule is a LHS (left hand side) and a RHS (right hand side). The LHS is always exactly one
symbol. The RHS is zero or more symbols. Consider the following example of a rule, given in the syntax
of Marpa's DSL.
S ::= A B C
Dotted rules are used to track the progress of a parse through a rule. They consist of a rule and a dot
position, which marks the point in the RHS which scanning has reached. It is called the dot position
because, traditionally, the position is represented by a dot. The symbol before the dot is called the
predot symbol.
The following is an example of a dotted rule. (The dot of a dotted rule is not part of Marpa's DSL but,
when it is useful for illustration, we will use it in the notation in this document.)
S ::= A B . C
In this rule, B is the predot symbol.
When the dot is after the last symbol of the RHS, the dotted rule is called a completion. Here is the
completion for the above rule:
S ::= A B C .
In this completion example, the symbol C is the predot symbol.
When the dot is before the first symbol of the RHS, the rule is called a prediction. Here is the
prediction of the rule we've been using for our examples:
S ::= . A B C
In predictions, there is no predot symbol.
Choicepoints
Informally a choicepoint is a place where the parser can decide among one or more parse choices. A
choicepoint can be thought of either a set of parse choices, or as the tuple of the properties which all
the parse choices for the choicepoint must have in common.
For a choicepoint to work, all the choices must have enough in common that each of them could be replaced
with any other. Thought of as a tuple of properties, a choicepoint is a triple: "(dp, start, current)".
In this triple,
• "dp" is a dotted rule. The predot symbol of "dp" is the predot symbol of the choicepoint -- if "dp"
is a prediction, there is no predot symbol. The rule of "dp" is the rule of the choicepoint. The
dot position of "dp" is the dot position of the choicepoint.
• "start" is the lexeme location where the dotted rule begins. "start" is sometimes called the origin
of the choicepoint.
• "current" is the lexeme location corresponding to the dot in "dp". "current" is sometimes called the
"current location" of the choicepoint.
The choicepoint is a prediction choicepoint if "dp" is a prediction. The choicepoint is a token
choicepoint if it is not a prediction choicepoint and the predot symbol of "dp" is a token symbol. The
choicepoint is a rule choicepoint if it is not a prediction choicepoint and the predot symbol of "dp" is
the LHS of a rule. (Token symbols are never the LHS of a rule, and vice versa.)
Parse choices
As mentioned, a choicepoint can be seen as a set of one or more parse choices. From the point of view of
the individual parse trees, the traversal is top-down and left-to-right.
Often, there is only one parse choice. When there is only one parse choice, the choicepoint is said to
be trivial. Prediction and token choicepoints are always trivial. If all of the choicepoints of a parse
are trivial, the parse is unambiguous.
Every rule choicepoint, and therefore every non-trivial choicepoint, has a set of parse choices
associated with it. For a rule choicepoint, each parse choice is a duple: "(predecessor, cause)", where
"cause" and "predecesor" are also choicepoints. The first element of the duple is the predecessor
choicepoint, or predecessor, of the parse choice. The second element of the duple is the cause
choicepoint, or cause of the parse choice.
Let "res" be a rule choicepoint and let "(pred, cuz)" be one of the parse choices of "res". Then we say
that "res" is the result of "cuz" and "pred"; and we say that "cuz" is one of the causes of "res".
The predecessor of a parse choice represents a portion of the parse that "leads up to" the cause. The
predecessor of a parse choice plays no role in ranking decisions, and this document will mostly ignore
predecessors.
Causes
Since it is a choicepoint, a cause choicepoint must also be a triple. Let the result of "cuz" be "res".
Let the triple for for "cuz" be "(cuz-dp, cuz-origin, cuz-current)". Let the triple for for "res" be
"(res-dp, res-origin, res-current)".
If "res" is the result of "cuz", then
• "cuz-current" is always the same as "res-current". In other words, the current location of "cuz" is
the same as the current location of "res".
• "cuz-origin" is before "res-current". In other words, the origin of "cuz" is before the current
location of "res".
• "cuz-origin" is at or after "res-origin". In other words, the origin of "cuz" is at or after the
origin of "res". While it is not always properly between "res-origin" and "res-current",
"cuz-origin" is always in the range bounded by "res-origin" on one side, and by "res-current" on the
other. "cuz-origin" can therefore be thought of as "in the middle" of the parse choice. For this
reason, "cuz-origin" is often called the middle location of the parse choice.
• The dotted rule of "cuz" will be a completion.
• The LHS of the dotted rule of "cuz" will be the predot symbol of "res".
Summing up the above, the causes of a rule choicepoint vary only by middle position and rule RHS. The
causes of a rule choicepoint always share the same current lexeme location and rule LHS, and their dotted
rules are always completions.
Rank is by Cause
The rank of a parse choice is that of its cause. The rank of a cause is the rank of its rule.
Therefore, the rank of a parse choice is the rank of the rule of its cause. That last sentence is the
most important sentence of this document, so we will repeat it:
The rank of a parse choice is
the rank of the rule of its cause.
Motivation
By the definition of the previous section, the rule of the choicepoint itself plays no direct role in
ranking. Instead, the ranking is based on the rules of one of the choicepoint's child nodes. At first,
this may seem unnecessarily roundabout, and even counter-intuitive, but on reflection, it will make
sense.
Ranking cannot be based directly on the rank of the choicepoint's rule because every parse choice for
that choicepoint shares the choicepoint's rule. We have not examined the internal structure of
predecessors, but ranking is not based on them for the same reason: the rule of a predecessor is always
the same as the rule of its result choicepoint, and therefore the rule of its predecessor is the same for
every parse choice of a choicepoint.
On the other hand, the causes of parse choices can and often do differ in their rules. Therefore ranking
is based on the rules of the causes of a choicepoint.
Note that ranking is only by direct causes. It might reasonably be asked, why not, at least in the case
of a tie, look at causes of causes. Or why not resolve ties by looking at causes of predecessors?
Marpa's built-in rule ranking was chosen to be the most powerful system that could be implemented with
effectively zero cost. Ranking by direct causes uses only information that is quickly and directly
available, so that its runtime cost is probably not measurable.
Complexity was also an issue. If indirect causes are taken into account, we would need to specify which
causes, and under what circumstances they are used. Only causes of causes? Or causes of predecessors?
Depth-first, or breadth-first? Only in case of ties, or using a more complex metric? To arbitrary
depth, or using a traversal that is cut off at some point?
Ranking by direct causes only means the answer to all of these questions is as simple as it can be. Once
we get into analyzing the synopsis grammar in detail, the importance of simplicity will become clear.
To be sure, there are apps whose requirements justify extra overhand and extra complexity. For these
apps, Marpa's ASF's allow full generality in ranking.
Examples
Our examples in this document will look at the ranked grammar in the synopsis, and at minor variations of
it.
Longest highest, version 1
We will first consider the example as it is given in the synopsis:
:discard ~ ws; ws ~ [\s]+
:default ::= action => ::array
Top ::= List action => main::group
List ::= Item3 rank => 3
List ::= Item2 rank => 2
List ::= Item1 rank => 1
List ::= List Item3 rank => 3
List ::= List Item2 rank => 2
List ::= List Item1 rank => 1
Item3 ::= VAR '=' VAR action => main::concat
Item2 ::= VAR '=' action => main::concat
Item1 ::= VAR action => main::concat
VAR ~ [\w]+
Longest highest ranking
The DSL in the synopsis ranks its items "longest highest". Here "items" are represented by the symbols,
"<Item3>", "<Item2>" and "<Item1>". The "longest" choice is considered to be the one with the most
lexemes. Working this idea out for this grammar, we see that the items should rank, from highest to
lowest: "<Item3>", "<Item2>" and "<Item1>".
The non-trivial choicepoints
Several examples and their results are shown in the synopsis. If you study the grammar, you will see
that the non-trivial choicepoints will be for the dotted rules:
Top ::= List .
List ::= List . Item3
List ::= List . Item2
List ::= List . Item1
The above are all of the potential dotted rules for result choicepoints.
The cause choicepoints
In all of these dotted rules, the predot symbol is "<List>". Recall that cause choicepoints are always
completions. Since the predot symbol of all the non-trivial choicepoints is "<List>", the dotted rule of
a cause of the non-trivial choicepoints may be any of
List ::= Item1 .
List ::= Item2 .
List ::= Item3 .
List ::= List Item1 .
List ::= List Item2 .
List ::= List Item3 .
In fact, if we study the grammar more closely, we will see that the only possible ambiguity is in the
sequence of items, and that ambiguities always take the form ""(v=)(v)"" versus ""(v=v)"". Therefore the
only causes of non-trivial parse choices are
List ::= Item3 .
List ::= Item1 .
List ::= List Item3 .
List ::= List Item1 .
The first non-trivial choicepoints
Further study of the grammar shows that the first non-trivial choice must be between two parse choices
with these cause choicepoints:
List ::= Item3 .
List ::= Item1 .
The rule "List ::= Item3" has rank 3, and it obviously outranks the rule "List ::= Item1", which has rank
1. Our example uses "high_rank_only" ranking, and our example therefore will leave only this choice:
List ::= Item3 .
With only one choice left, the resulting choicepoint becomes trivial, and, as defined above, that
remaining choice is "longest", and therefore the correct one for the "longest highest" ranking.
The second and subsequent non-trivial choicepoints
We've looked at only the first non-trivial choice for our example code. Again examining the grammar, we
see that the second and subsequent non-trivial choices will all be between parse choices whose causes
have these two dotted rules:
List ::= List Item3 .
List ::= List Item1 .
The rule "List ::= List Item3" has rank 3, and it obviously outranks the rule "List ::= List Item1",
which has rank 1. Our example uses "high_rank_only" ranking, and our example therefore will leave only
this choice:
List ::= List Item3 .
Conclusion
We have now shown that our example will reduce all choicepoints to a single choice, one which is
consistent with "longest highest" ranking. Since all choicepoints are reduced to a single choice, the
ranked grammar in unambiguous.
This analysis made a lot of unstated assumptions. Below, there is a "Proof of correctness". It deals
with this same example, but proceeds much more carefully.
Shortest highest, version 1
Here we see the grammar of the synopsis, reworked for a "shortest highest" ranking. "Shortest highest"
is the reverse of "longest highest".
:discard ~ ws; ws ~ [\s]+
:default ::= action => ::array
Top ::= List action => main::group
List ::= Item3 rank => 1
List ::= Item2 rank => 2
List ::= Item1 rank => 3
List ::= List Item3 rank => 1
List ::= List Item2 rank => 2
List ::= List Item1 rank => 3
Item3 ::= VAR '=' VAR action => main::concat
Item2 ::= VAR '=' action => main::concat
Item1 ::= VAR action => main::concat
VAR ~ [\w]+
Here are what the results will look like for "shortest highest".
my @tests = (
[ 'a', '(a)', ],
[ 'a = b', '(a=)(b)', ],
[ 'a = b = c', '(a=)(b=)(c)', ],
[ 'a = b = c = d', '(a=)(b=)(c=)(d)', ],
[ 'a = b c = d', '(a=)(b)(c=)(d)' ],
[ 'a = b c = d e =', '(a=)(b)(c=)(d)(e=)' ],
[ 'a = b c = d e', '(a=)(b)(c=)(d)(e)' ],
[ 'a = b c = d e = f', '(a=)(b)(c=)(d)(e=)(f)' ],
);
The reader who wants an example of a ranking scheme to work out for themselves may find this one
suitable. The reasoning will very similar to that for the "longest highest" example, just above.
Longest highest, version 2
The previous examples have shown the rule involved in parse ranking in "spelled out" form. In fact, a
more compact form of the grammar can be used, as shown below for "longest highest" ranking.
:discard ~ ws; ws ~ [\s]+
:default ::= action => ::array
Top ::= List action => main::group
List ::= Item rank => 1
List ::= List Item rank => 0
Item ::= VAR '=' VAR rank => 3 action => main::concat
Item ::= VAR '=' rank => 2 action => main::concat
Item ::= VAR rank => 1 action => main::concat
VAR ~ [\w]+
Shortest highest, version 2
This is the grammar for "shortest highest", in compact form:
:discard ~ ws; ws ~ [\s]+
:default ::= action => ::array
Top ::= List action => main::group
List ::= Item rank => 0
List ::= List Item rank => 1
Item ::= VAR '=' VAR rank => 1 action => main::concat
Item ::= VAR '=' rank => 2 action => main::concat
Item ::= VAR rank => 3 action => main::concat
VAR ~ [\w]+
Again, the reader looking for simple examples to work out for themselves may want to rework the argument
given above for the "spelled out" examples for the compact examples.
Alternatives to Ranking
Reimplementation as pure BNF
It is generally better, when possible, to write a language as BNF, instead of using ranking. The
advantage of using BNF is that you can more readily determine exactly what language it is that you are
parsing: Ranked grammars make look easier to analyze at first glance, but the more you look at them the
more tricky you realize they are.
These "pure BNF" reimplementations rely on an observation used in the detailed proof of the ranked BNF
example: The parse string becomes easier to analyze when you see it as a sequence of "var-bounded"
substrings, where "var-bounded" means the substring is delimited by the start of the string, the end of
the string, or the fencepost between two "<VAR>" lexemes.
Longest highest as pure BNF
Here is the "longest highest" example, reimplemented as BNF:
:discard ~ ws; ws ~ [\s]+
:default ::= action => ::array
Top ::= Max_Boundeds action => main::group
Top ::= Max_Boundeds Unbounded action => main::group
Top ::= Unbounded action => main::group
Max_Boundeds ::= Max_Bounded+
Max_Bounded ::= Eq_Finals Var_Final3
Max_Bounded ::= Var_Final
Unbounded ::= Eq_Finals
Eq_Finals ::= Eq_Final+
Var_Final ::= Var_Final3 | Var_Final1
Var_Final3 ::= VAR '=' VAR action => main::concat
Eq_Final ::= VAR '=' action => main::concat
Var_Final1 ::= VAR action => main::concat
VAR ~ [\w]+
Shortest highest as pure BNF
We can also reimplement the "shortest highest" example as BNF. One of the advantages of a BNF
(re)implementation, is that it often clarifies the grammar. For example in this case, we note that the
DSL rule
Var_Final3 ::= VAR '=' VAR action => main::concat
is, in fact, never used. We therefore omit it:
:discard ~ ws; ws ~ [\s]+
:default ::= action => ::array
Top ::= Max_Boundeds action => main::group
Top ::= Max_Boundeds Unbounded action => main::group
Top ::= Unbounded action => main::group
Max_Boundeds ::= Max_Bounded+
Max_Bounded ::= Eq_Finals Var_Final
Max_Bounded ::= Var_Final
Unbounded ::= Eq_Finals
Eq_Finals ::= Eq_Final+
Eq_Final ::= VAR '=' action => main::concat
Var_Final ::= VAR action => main::concat
VAR ~ [\w]+
Abstract Syntax Forests (ASFs)
Ranking can also be implemented using Marpa's abstract syntax forests. ASFs are likely to be less
efficient than the built in ranking, but they are a more general and powerful solution.
Comparison with PEG
For those familiar with PEG, Marpa's ranking may seem familiar. In fact, Marpa's ranking can be seen as
a "better PEG".
A PEG specification looks like a BNF grammar. It is a common misconception that a PEG specification
implements the same language that it would if interpreted as pure, unranked BNF. This is the case only
when the grammar is LL(1) -- in other words, rarely in practical use.
Typically, a PEG implementer thinks their problem out in BNF, perhaps ordering the PEG rules to resolve a
few of the choices, and then twiddles the PEG specification until the test suite passes. This results in
a parser whose behavior is to a large extent unknown.
Marpa with ranking allows a safer form of PEG-style parsing. Both Marpa's DSL and PEG allow the
implementer to specify any context-free grammar, but Marpa parses all context-free grammars correctly,
while PEG only correctly parses a small subset of the context-free grammars and fakes the rest.
In Marpa, the implementer of a ranked grammar can work systematically. He can first write a "superset
grammar", and then "sculpt" it using ranks until he has exactly the language he wants. At all points,
the superset grammar will act as a "safety net". That is, ranking or no ranking, Marpa returns no parse
trees that are not specified by the BNF grammar.
This "sculpted superset" is more likely to result in a satisfactory solution than the PEG approach. The
PEG specification, re-interpreted as pure BNF, will usually only parse a subset of what the implementer
needs, and the implementer must use ranks to "stretch" the grammar to describe what he has in mind.
Ranking is not as predictable as specifying BNF. The "safety net" provided by Marpa's "sculpted
superset" guards against over-liberal parsing. This is important: over-liberal parsing can be a security
loophole, and bugs caused by over-liberal parsing have been the topic of security advisories.
Determining the exact language parsed by a ranked grammar in PEG is extremely hard -- a small specialist
literature has looked at this subject. Most implementers don't bother trying -- when the test suite
passes, they consider the PEG implementation complete.
Ranking's effect is more straightforward in Marpa. Above, we showed very informally that the example in
our synopsis does what it claims to do -- that the parses returned are actually, and only, those desired.
A more formal proof is given below.
Since Marpa parses all context-free grammars, and Marpa parses most unambiguous grammars in linear time,
Marpa often parses your language efficiently when it is implemented as pure BNF, without rule ranking.
As one example, the ranked grammar in the synopsis can be reimplemented as pure unranked BNF. Where it
is possible to convert your ranked grammar to a pure BNF grammar, that will usually be the best and
safest choice.
Details
This section contains additional explanations, not essential to understanding the rest of this document.
Often they are formal or mathematical. While some people find these helpful, others find them
distracting, which is why they are segregated here.
Earley parsing
In this section we assume that the reader is comfortable reading and analyzing BNF. We will often use
the basic theorem of Earley parsing. As a reminder, the theorem states that an instance of a dotted rule
is present in a parse, if and only if that instance is valid for the input so far. More formally:
(EARLEY): Let "G" be a grammar. Let "Syms" be the set of symbols in "G" and let exactly one of the
symbols in "Syms" be distinguished as the "start symbol". Call the "start symbol", "<Start>".
Let "Terms" be a subset of "Syms" called the "terminals" of "G". Let "W" be a string of symbols from
"Terms". Let "W[i]" be the "i"'th terminal of "W", so that the first terminal of "W" is "W[1]".
Where "alpha" and "beta" are sequences of symbols in "Syms", let "alpha => beta" mean that "alpha"
derives "beta" in grammar "G" in zero or more steps.
In this theorem, let "alpha", "beta", "gamma", "delta" be sequences of zero or more symbols in "Syms".
Theorem: These two statement sets are equivalent:
Statement set 1 is true if and only if we have all of the following:
1a. "eitem" is an instance of a dotted rule, which we will treat as a triple: "dp, origin, current". In
the literature, a triple of this form is also called an "Earley item".
1b. "dp" is the dotted rule "<A> ::= beta . gamma>".
Statement set 2 is true if and only if we have all of the following:
2a. "alpha => W[1] ... W[origin]"
2b. "beta => W[origin+1] .. W[current]"
2c. "<Start> => alpha <A> delta"
The proof is omitted. It can be found in Jay Earley's thesis, in his original paper, and in Aho and
Ullmann's 1972 textbook.
Proof of correctness
Let "G" be the grammar in the Marpa DSL synopsis, Let "W" be a string, and let "ps-pure" be the set of
parses allowed by "G", treated as if it was pure unranked BNF. Let "ps-ranked" be the set of parses
allowed by the "G" after ranking is taken into account.
We will say string "W" is valid if and only if "ps-pure" is non-empty. We will say that a parse set is
"unambiguous" if and only if it contains at most one parse.
In the less formal discussion above, we took an intuitive approach to the idea of "longest highest", one
which made the unstated assumption that optimizing for "longest highest" locally would also optimize
globally. Here we make our idea of "longest highest" explicit and justify it.
What "longest highest" means locally is reasonably obvious -- if "item3" has more lexemes than "item1",
then "item3" is "longer" than "item1". It is less clear what it means globally. Items might overlap, so
that optimizing locally might make the parse as a whole suboptimal.
We will define a "longest highest" parse of "W" as one of the parses of "W" with the fewest "items".
This is based on the observations that "W" is fixed in length; and that, if items are longer, fewer of
them will fit into "W".
More formally, let "p" be a parse, and let "ps" be a parse set. Let Items(p) be the number of items in
"p". Let Items(ps), be the minimum number of items of any parse in "ps". Then, if "lh" is a parse in
"ps" and if, for every parse "p" in parse set "ps", "Items(lh) <= Items(p)", we say that "lh" is a
"longest highest" parse.
To conveniently represent token strings we will represent them as literal strings where ""v"" stands for
a "<VAR>" lexemes and equal signs represent themselves. For example, we might represent the token string
<VAR> = <VAR>
as the string
v=v
Theorem: If "W" is a valid string, "ps-ranked" contains exactly one parse, and that parse is the "longest
highest" parse.
Proof: Recall that a Marpa "high_rank_only" parse first parses according to the pure unranked BNF, and
then prunes the parse using ranking.
Part 1: We will first examine "ps-pure", the parse set which results from the pure BNF phase.
(1) "Item": An "item" will means an instance of one of the symbols "Item1", "Item2" or "Item3". When we
want to show a token string's division into items, we will enclose them in parentheses. For example
""(v=)(v=)(v)"" will indicate that the token string is divided into 3 items; and ""(v=)(v=v)"" will
represent the same token string divided into 2 items.
(2) "Factoring": Analyzing the grammar "G", we see that it is a sequence of items, and that "G" is
ambiguous for a string "W" if and only if that string divides into items in more than one way. We will
call a division of "W" into items a "factoring".
(3) "Independence": Let "seq1" and "seq2" be two factorings of "W" into items. Let "first" and "last" be
two lexeme locations such that "first <= last".
Let "sub1" be a subsequence of one or more items of "seq1" that starts at "first" and ends at "last".
Similarly, let "sub2" be a subsequence of one or more items of "seq2" that starts at "first" and ends at
"last". Let "seq3" be the result of replacing "sub1" in "seq1" with "sub2". Then, because "G" is a
context free grammar, "seq3" is also a factoring of "W".
As an example, Let "W" be "v=vv=v", let "seq1" be "(v=v)(v=v)" and let "seq2" be "(v=)(v)(v=)(v)". Let
"first" be 4 and "last" be 6, so that "sub1" is "(v=v)" and "sub2" is "(v=)(v)". Then "seq3" is
"(v=v)(v=)(v)". We claimed that "seq3" must be a factoring of "W", and we can see that our claim is
correct.
(4): Recall the discussion of lexeme fenceposts. We will say that a lexeme fencepost "fp" is a
"factoring barrier" if it is not properly contained in any item. More formally, let First(it1) be first
lexeme location of an item "it1", and let Last(it1) be last lexeme location of "it1". Let "i" be a
lexeme fencepost "i" such that, for every item "it", either "i <= First(it)" or "i > Last(it)". Then
lexeme fencepost "i" is a factoring barrier.
(5): Where "|W|" is the length of "W", we can see from (4) that fencepost "|W|" is a factoring barrier.
(6): Also from (4) we see that lexeme fencepost 0 is a factoring barrier.
(7): Let a "subfactoring" be a sequence of items bounded by factoring barriers.
(8): Let a "minimal subfactoring" be a subfactoring which does not properly contain any factoring
barriers. From (5) and (6) we can see that "W" can be divided into one or more minimal subfactorings.
(9): We can see from "G" that two "VAR"s never occur together in the same item. This means that wherever
we do find two "VAR"s in a row, the lexeme fencepost between them is a factoring barrier.
(10): No item begins with an equal sign (""=""). Every item begins with a "<VAR>" token.
(11): Every minimal subfactoring starts with a "<VAR>" token. This is because every subfactoring starts
with an item, and from (10).
(12): Every minimal subfactoring start with a "<VAR>" token and alternates equal signs and "<VAR>"
lexemes: that is, it has the pattern ""v=v=v= ..."". This follows from (9) and (11).
(13): We will call a minimal subfactoring "var-bounded" or simply "bounded" if it starts and ends with a
"<VAR>" token. By (12), this means that it has the pattern ""v=v=v= ... v"".
(14): We will call a minimal subfactoring "var-unbounded" or simply "unbounded" if it is not var-bounded.
By (12), this means that it has the pattern ""v=v= ... v="".
(15): If a "<Item1>" or "<Item3>" occurs in a minimal subfactoring, it is the last item in that
subfactoring.
To show (15), assume for a reductio ad absurdam there is a minimal subfactoring with a non-final
"<Item1>" or "<Item3>". Call that subfactoring "f", and call that item, "it1".
(15a): Both "<Item1>" and "<Item3>" end with a "<VAR>" token, so that "it1" ends in a "<VAR>" token.
(15b): Since "it1", by assumption for the reductio, is non-final, there is a next item. Call the next
item "it2".
(15c): By (10), all items start with "<VAR>" lexemes, so that "it2" starts with a "<VAR>" token.
(15d): From (15a) and (15c), we know that "it1" ends with a "<VAR>" token and "it2" starts with a "<VAR>"
token, so that there are two "<VAR>" lexemes at the fencepost between "it1" and "it2".
(15e): From (9) and (15d), there is a factoring barrier at the fencepost between "it1" and "it2".
(15f): All items, including "it1" and "it2" and of non-zero length so that, from (15e), we know that the
fencepost between "it1" and "it2" is properly inside subfactoring "f".
(15g): From (15f) we see that there is a factoring barrier properly inside "f". But "f" is assumed for
the reductio to be minimal and therefore cannot properly contain a factoring barrier.
(15h): (15g) shows the reductio, and allows us to conclude that, if "f" contains an "<Item1>" or an
"<Item3>", that item must be the final item. This is what we needed to show for (15).
(16): By (3), every minimal subfactoring is independent -- in other words, no ambiguity crosses factoring
barriers. So we narrow our consideration of ambiguities to two cases: ambiguities in unbounded minimal
subfactorings, and ambiguities in bounded minimal subfactorings.
(17): Every unbounded minimal subfactoring is unambiguous. To show (17), let "u" be an unbounded minimal
subfactoring. It follows from the definition of unbounded minimal subfactoring (14), that "u" ends in an
equal sign. Both "<Item1>" and "<Item3>" end in "<VAR>" lexemes, so that no "<Item1>" or "<Item3>" can
be the last item in "u". Therefore, by (15), there is no "<Item1>" or "<Item3>" item in "u". This means
that every item in "u" is an "<Item2>" item. This in turn means that "u" is unambiguous, which shows
(17).
(18): Every bounded minimal subfactoring parses either as a sequence of "Item2"'s followed by an "Item1";
or as as a sequence of "Item2"'s followed by an "Item3". This follows from (1), (13) and (15).
(19): From (16), (17) and (18) it follows that every ambiguity between subfactorings is between bounded
subfactorings whose divisions into items take the two forms
""(v=) ... (v=) (v)""
and
""(v=) ... (v=v)""
Part 2: We have now shown how pure BNF parsing works for the grammar in the synopsis. We next show the
consequences of ranking.
(20): A completed instance of "List ::= Item1" can only be found at lexeme location 1. We know this from
"G" and (EARLEY).
(21): A completed instance of "List ::= List Item1" can only be found at lexeme location 2 or after. We
know this from "G" and (EARLEY).
(22) At most one dotted rule with "Item1" as its last item appears in a cause at any choicepoint. We
know this because all choices of a choicepoint must be completions with the same current location, and
from (20) and (21).
(23): A completed instance of "List ::= Item3" can only be found at lexeme location 3. We know this from
"G" and (EARLEY).
(24): A completed instance of "List ::= List Item3" can only be found at lexeme location 4 or after. We
know this from "G" and (EARLEY).
(25) At most one dotted rule with "Item3" as its last item appears in a cause at any choicepoint. We
know this because all choices of a choicepoint must be completions with the same current location, and
from (23) and (24).
(26): Consider the ambiguity shown in (19) and let "current" be the lexeme fencepost at its end. An item
also ends at "current" so from "G" and (EARLEY), we know that there is at least one dotted rule instance
whose predot symbol is "List" and whose current location is "current". The choicepoint can have one of
these dotted rules:
Top ::= List .
List ::= List . Item3
List ::= List . Item2
List ::= List . Item1
with location 0 as its origin and "current" as its current location. There may be more than one
choicepoint fulfilling these criteria.
Of the choicepoints fulfilling the criteria, we arbitrarily pick one. We call this choicepoint "result".
We will see as we proceed that, while the choicepoints may have different dotted rules, our choice of one
of them for "result" makes no difference in the ranking logic. The only properties of the "result" that
we will use are dot position, predot symbol, and current location. These properties are the same
regardless of which choicepoint we picked for "result".
(27): Recall that the "middle location" of a choice is always the origin of its cause. From (26) we see
that, for every possible rule in "result", the predot symbol is the first symbol of the dotted rule.
This means that the origin of the causes of "result" must be the same as the origin of "result". From
(26) we also know that the origin of "result" is always location 0. Therefore the origin of every one of
the causes of "result" must be location 0. Therefore every one of the causes of "result" has the same
middle location.
(28): From (26) and (27) we know that, if "cuz" is a cause of "result", it has an origin at location 0;
its current location is "current"; and its LHS is "List". "cuz" may be any one of
List ::= Item3 .
List ::= Item2 .
List ::= Item1 .
List ::= List Item3 .
List ::= List Item2 .
List ::= List Item1 .
(29): From (19) and (EARLEY) we know that dotted rules ending in "Item2" are not valid choices at
location "current".
(30): From (19), (22), (28) and (EARLEY) we know that exactly one of the following two dotted rules is
the cause of a parse choice for "result":
List ::= Item1 .
List ::= List Item1 .
(31): From (19), (25), (28) and (EARLEY) we know that exactly one of the following two dotted rules is
the cause of a parse choice for "result":
List ::= Item3 .
List ::= List Item3 .
(32): From (28), (29), (30) and (31) we know that we will have exactly two choices for "result"; that the
final symbol of the cause in one will be "Item1"; and that the final symbol of the cause in the other
will be "Item3". From the grammar in the synopsis, we know that the cause ending in "Item1" will have
rank 1; and that the cause ending in "Item3" will have rank 3. From this we see, since the ranking
method is "high_rank_only", that the choice whose final symbol is "Item3" will be the only one kept in
"P-ranked". We further see from (19) that this choice will have fewer items than the alternative.
(33): From (19) we know that there is at most one ambiguity per minimal subfactoring. From (16) we know
that ambiguity resolutions of a minimal subfactoring are independent of the ambiguity resolutions in
other minimal subfactorings. From (32) we know that every subfactoring in "P-ranked" has exactly one
factoring, and that that factoring is the one with the fewest items. Therefore, "P-ranked" will contain
exactly one parse, and that that parse will be "longest highest".
QED.
Copyright and License
Copyright 2022 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.
perl v5.40.1 2025-06-25 Marpa::R2::Semantics::Rank(3pm)