This algorithm determines the occurrences of a string of characters
(symbols) called the pattern within another string called the text.
Refer to links A and B below.
http://www-igm.univ-mlv.fr/~lecroq/string/node14.html A
http://www-igm.univ-mlv.fr/~lecroq/string/examples/exp14.html B
Link B gives an example and concludes that 17 character comparisons are
performed. If you count the number of matching trailing characters
plus one in each of the 5 attempts, (1,3,8,3,2) the 17 figure is the
number of trailing characters checked in the pattern vs. the text above
it, until a failure. But there is a "shift" involved that also
takes character comparisons.
Link A defines the "Shift" of at least the first 2 attempts (shift
= 1 and 4) as moving to "the segment Y[. . .] (what matched before)
with its rightmost occurrence in x (the pattern) that is preceded by a
different character from x[i] (the character that failed the check)."
Note 1. Faulty Count
He is not counting the characters compared to make the shift!
Note 2. Sequential produces the same result more efficiently
I don’t think there is anything that you can do to improve over a
simple (faster) sequential scan without precompiling. Added complexity
means addition inefficiency.
If you do precompile, then no comparison can be made as to the
efficiencies. Creating a table keyed to a character string is untold
orders of magnitude slower than retrieving a character. The author
says this algorithm is NOT precompiling, referring to its being minimal
in some contexts "for any string-matching algorithm in the model
where the pattern only is preprocessed."
So he is claiming that there is no precompiling of the text line being
searched, and is determining the rightmost occurrence by uncounted
character comparisons.
Note 3. Inefficient logical check (easily improved)
The shift algorithm makes the shift when the character to the left of
the previously matched segment Y[. . .] is not the same as the previous
character encountered in the text and which failed. But we really want
it to match the next character in the text. So we will have to check
it again, to establish that it is not only not the same character that
failed before, but that it is the correct character in the text.
So simply changing this to check that it is the matching character in
the text would do an additional shift when it does not match the text
and is not equal to the previously failing character. This would
sometimes then detect that there is no more pattern to check and we
immediately see the match without checking that character again.
Note 4. Additional repeated retrievals and comparisons (more complex
and overhead to fix)
Furthermore, as specified you will have to compare those same
characters again that you did to make the shift (and were not counted
by this web site.) You will have to go back to the end of the pattern
and compare the part to the right of what was checked in the pattern.
In the process, you will recheck the part that was used to make the
shift, finding the previously matched substring in the pattern.
Then notes 1-4 show that they are not accurately counting the number of
character comparisons performed, and that there are simple improvements
to what they call "the most efficient" and "the absolute
minimum", refuting these two claims.
So is it just Moore BS? Is there anything you can do to change the
expected results from a substring search without compiling? (I can
think of 1 extreme case: assume the strings are binary i.e. there are
only 2 characters, and optimize from that.)
C-B












[Charlie-Boo]
> This algorithm determines the occurrences of a string of characters
> (symbols) called the pattern within another string called the text.
> Refer to links A and B below.
> http://www-igm.univ-mlv.fr/~lecroq/string/node14.html A
> http://www-igm.univ-mlv.fr/~lecroq/string/examples/exp14.html B
> Link B gives an example and concludes that 17 character comparisons are
> performed. If you count the number of matching trailing characters
> plus one in each of the 5 attempts, (1,3,8,3,2) the 17 figure is the
> number of trailing characters checked in the pattern vs. the text above
> it, until a failure.
In link B, the characters compared at each stage are shown shaded, and
labeled with little integers. They do sum to 17. Each stage starts by
comparing the last character of the pattern to its current position in the
target, and proceeding "left" until a mismatch is found or the pattern is
exhausted.
> But there is a "shift" involved
Right.
> that also takes character comparisons.
? Unclear why you believe that; it’s not true. The shift at each stage is
determined by precomputed tables. Look at the /implementation/ in link A,
where this is very clear in function BM() — the only character comparisons
done during the search are due to the "x[i] == y[i+j]" portion of the `for`
loop, and that’s executed 17 times in the link B example.
> Link A defines the "Shift" of at least the first 2 attempts (shift
> = 1 and 4) as moving to "the segment Y[. . .] (what matched before)
> with its rightmost occurrence in x (the pattern) that is preceded by a
> different character from x[i] (the character that failed the check)."
And all possible such shifts are precomputed, before the search proper
begins, via the preBmGs() and preBmBc() routines.
> Note 1. Faulty Count
> He is not counting the characters compared to make the shift!
There are none.
> Note 2. Sequential produces the same result more efficiently
> I don’t think there is anything that you can do to improve over a
> simple (faster) sequential scan without precompiling. Added complexity
> means addition inefficiency.
Don’t know what you mean by precompiling. Link A defines what it means by
"preprocessing", and the pattern definitely is preprocessed.
> If you do precompile, then no comparison can be made as to the
> efficiencies.
Comparing what to what?
> Creating a table keyed to a character string is untold orders of
> magnitude slower than retrieving a character.
Also unsure what that means, but the BM preprocessing steps given in link A
require, as the page says, O(m + |A|) time and space, where m is the length
of the pattern, and |A| is the size of the alphabet. That’s based on an
implementation that uses alphabet symbols to index a dense array directly;
throughout that site, the alphabet is assumed to be a subset of ASCII (as it
says in the introduction) so that’s reasonable in context. There are of
course variations that use more space-efficient data structures when the
alphabet is large.
Nothing here uses a table keyed to arbitrary strings in any case; at worst
this uses tables keyed by (individual) alphabet symbols.
> The author says this algorithm is NOT precompiling,
Again, the word "precompiling" appears nowhere in the text, and the author
is extremely clear about that the algorithm does do preprocessing on the
pattern.
> referring to its being minimal in some contexts "for any string-matching
> algorithm in the model where the pattern only is preprocessed."
Yes, it preprocesses the pattern, and in some cases makes a provably minimal
number of character comparisons across all possible string-matching
algorithms "where the pattern only is preprocessed". The text gives a
specific example of that.
> So he is claiming that there is no precompiling of the text line being
> searched,
Yes. But, as above, do note that the size of the alphabet also affects the
preprocessing time required for this version of the algorithm.
- Hide quoted text — Show quoted text -
> …
Tim Peters wrote:
> [Charlie-Boo]
> > But there is a "shift" involved that also takes character comparisons.
> ? Unclear why you believe that; it’s not true.
Exactly how do you determine "the segment Y[. . .] with its rightmost
occurrence in x that is preceded by a different character from x[i] "?
That is work not included in his figure of 17.
> The shift at each stage is determined by precomputed tables.
Then you are not counting the cost of the precompiling.
> > If you do precompile, then no comparison can be made as to the
> > efficiencies.
> Comparing what to what?
His algorithm to an algorithm that doesn’t use any precompiling.
> That’s based on an
> implementation that uses alphabet symbols to index a dense array directly;
Then you are still comparing apples to oranges and leaving out the
difference in the 17 figure. He talks about number of comparisons, and
now there is an unkown amount of additional work needed to produce
tables. Thus the 17 figure is wrong.
> > The author says this algorithm is NOT precompiling,
> Yes, it preprocesses the pattern
The skipping process is a function of both the pattern and the text
line, so you would need to precompile the text line as well. You might
as well precompile the whole answer!
Not to mention that you can’t compare the cost of creating a table
with a different process of retrieving from the text line.
Not to mention that this is not included in the 17 figure.
> note that the size of the alphabet also affects the
> preprocessing time required for this version of the algorithm.
Then how can he call it "the most efficient" and "the absolute
minimum"?
Then what is the significance of this algorithm? What is its advantage
over a simple sequential scan? The sequential scan is much simpler.
What is the advantage of this algorithm over that?
C-B
[Charlie-Boo]
…
>>> But there is a "shift" involved that also takes character comparisons.
[Tim Peters]
>> ? Unclear why you believe that; it’s not true.
[Charlie-Boo]
> Exactly how do you determine "the segment Y[. . .] with its rightmost
> occurrence in x that is preceded by a different character from x[i] "?
Via lookup in tables computed during pattern preprocessing.
> That is work not included in his figure of 17.
Correct. Already explained what 17 does count, and pointed to the exact
fragment of the implementation code it corresponds to.
>> The shift at each stage is determined by precomputed tables.
> Then you are not counting the cost of the precompiling.
Correct. The preprocessing and searching costs are accounted for separately
in his writeup, and are summarized in the first section ("Main features") at
the top of:
http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
Separating them is standard practice in the literature, BTW.
>>> If you do precompile, then no comparison can be made as to the
>>> efficiencies.
>> Comparing what to what?
> His algorithm to an algorithm that doesn’t use any precompiling.
Then, as he says, the algorithm requires O(n/m) comparisons to fail to find
a^(m-1)b in b^n , which is far better (by about a factor of m) than
sequential search with no preprocessing.
>> That’s based on an implementation that uses alphabet symbols to index
>> a dense array directly;
> Then you are still comparing apples to oranges
I’m not comparing anything.
> and leaving out the difference in the 17 figure.
17 is the correct count of the number of comparisons. If you don’t believe
it, compile the code and /run/ the example yourself. Count the number of
comparisons and display it at the end. You’ll get 17 too.
He didn’t claim comparisons are the /only/ expense — it’s fine if you want
to count other operations too, but Lecroq is counting what he /said/ he was
counting. What he’s doing is standard when discussing this kind of
algorithm; the cost of preprocessing the pattern can often be amortized over
many searches, so it’s of practical value to analyze preprocessing and
searching costs separately.
> He talks about number of comparisons, and now there is an unkown
> amount of additional work needed to produce tables.
It’s not unknown. If you can’t read C code, read the "Main features"
section at the top of:
http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
As that says, & as I said last time, preprocessing for this version of the
algorithm takes O(m + |A|) time, where m is the length of the pattern and
|A| is the size of the alphabet.
> Thus the 17 figure is wrong.
No.
>>> The author says this algorithm is NOT precompiling,
>> Yes, it preprocesses the pattern
> The skipping process is a function of both the pattern and the text
> line, so you would need to precompile the text line as well.
No. Preprocessing doesn’t look at the target string /at all/. Look at the
code: the target string isn’t even passed to the functions that do the
preprocessing. If the pattern is, e.g., "xxxxx", then for /any/ target
string that contains "a" in the position of the final "x", you know that
it’s impossible for the pattern to match if you shift it right one, two,
three, or four positions: that would still leave an "x" lined up with the
same "a". You need to shift at least five to have any possibility of
matching. This is independent of target string; it depends only on the
pattern and the alphabet, and as I also said last time, shifts are
precomputed for each symbol of the /alphabet/.
> You might as well precompile the whole answer!
No. In practice n (the length of the target string) is potentially
unbounded, while preprocessing takes O(m + |A|) time independent of n. The
result of preprocessing is a set of tables that can be used to try to speed
searches for the pattern in /any/ target string.
> Not to mention that you can’t compare the cost of creating a table
> with a different process of retrieving from the text line.
Why is that? Is one of them a magical process that deviously frustrates all
?
attempts to time it
> Not to mention that this is not included in the 17 figure.
17 counts the number of comparisons in the example. Period.
>> note that the size of the alphabet also affects the
>> preprocessing time required for this version of the algorithm.
> Then how can he call it "the most efficient" and "the absolute
> minimum"?
He didn’t say any such things /without qualification/. You were actually
more careful to avoid caricaturing his statements last time, so this is
degenerating now. Read what he actually said again — you were closer to
undertanding it last time around.
> Then what is the significance of this algorithm? What is its advantage
> over a simple sequential scan? The sequential scan is much simpler.
> What is the advantage of this algorithm over that?
Speed, and it’s widely /used/ because of that. It can be extremely useful
when the pattern is short compared to the target, so, e.g., it’s a method of
choice when implementing search in text editors.
You’re clearly uncomfortable with his style of presentation, so I suggest
looking at a simpler algorithm first. In practice, most people don’t
implement the full Boyer-Moore algorithm. Simpler variations more often
implemented also covered by Lecroq at that site include "Quick Search" and
"Horspool". I suggest looking at Quick Search first, which is implemented
in about a dozen lines of C total. The uses a variant of only the
"bad-character shift" portion of full-blown BM, and should be much easier to
understand.
On 9 Sep 2006 14:54:25 -0700, "Charlie-Boo" <shymath…@gmail.com>
wrote:
>Tim Peters wrote:
>> [Charlie-Boo]
>> > But there is a "shift" involved that also takes character comparisons.
>> ? Unclear why you believe that; it’s not true.
>Exactly how do you determine "the segment Y[. . .] with its rightmost
>occurrence in x that is preceded by a different character from x[i] "?
>That is work not included in his figure of 17.
>> The shift at each stage is determined by precomputed tables.
>Then you are not counting the cost of the precompiling.
In case you missed what seems to me to be the most important
point in all the things that Tim has said about this:
The "precompiling" involves only the _target_ string. In particular
the "precompiling" time is independent of the length of the
_text_. The idea is that we’re searching for a short target
in a long text – hence the "precompiling" time is essentially
irrelevant.
(Irrelevant because it takes orders of magnitude less time than
the actual search, following the preprocessing stage.)
>> > If you do precompile, then no comparison can be made as to the
>> > efficiencies.
>>[...]
>Then how can he call it "the most efficient" and "the absolute
>minimum"?
>Then what is the significance of this algorithm? What is its advantage
>over a simple sequential scan? The sequential scan is much simpler.
>What is the advantage of this algorithm over that?
It’s much faster (when searching for a short target in a long text).
It’s kind of curious that someone who’s breaking new ground in CS,
continually going where no man has gone before, is unaware of all
this stuff. I mean _I_ knew what Tim’s reply was going to say,
although I’m not a programmer, and no I didn’t look at those
links either – this is all very well-known stuff (definition:
an aspect of theoretical CS that’s known even to _me_ is
"well-known".)
>C-B
************************
David C. Ullrich
[David C. Ullrich, to Charlie-Boo]
> …
> In case you missed what seems to me to be the most important
> point in all the things that Tim has said about this:
> The "precompiling" involves only the _target_ string. In particular
> the "precompiling" time is independent of the length of the
> _text_.
All true, although I used "pattern" for the needle and "target" for the
haystack.
> The idea is that we’re searching for a short target in a long
> text – hence the "precompiling" time is essentially irrelevant.
Well … yes & no. The specific version of Boyer-Moore he was looking at
precomputes shifts for every alphabet symbol, so for a large alphabet (like
Unicode) that specific version is often impractical. There are more-or-less
obvious ways to worm around that, though.
There are also practical applications where the string searched is
relatively short, but the same pattern is searched for in millions (billions
…) of distinct strings. For example, "on the fly" analysis of
ever-growing log files of various kinds fits that niche.
> (Irrelevant because it takes orders of magnitude less time than
> the actual search, following the preprocessing stage.)
Yup, same thing in the end.
> …
> It’s kind of curious that someone who’s breaking new ground in CS,
> continually going where no man has gone before, is unaware of all
> this stuff. I mean _I_ knew what Tim’s reply was going to say,
> although I’m not a programmer, and no I didn’t look at those
> links either – this is all very well-known stuff (definition:
> an aspect of theoretical CS that’s known even to _me_ is
> "well-known".)
Hey, I was impressed! I know a lot of professional programmers, and I bet
fewer than 5% of them know the essentials of how this algorithm works. I’ll
be /very/ impressed if you can explain why this discussion is on sci.logic
and sci.math
On Tue, 12 Sep 2006 01:36:08 -0400, "Tim Peters" <tim….@comcast.net>
wrote:
>[David C. Ullrich, to Charlie-Boo]
>> …
>> In case you missed what seems to me to be the most important
>> point in all the things that Tim has said about this:
>> The "precompiling" involves only the _target_ string. In particular
>> the "precompiling" time is independent of the length of the
>> _text_.
>All true, although I used "pattern" for the needle and "target" for the
>haystack.
Sorry.
>> The idea is that we’re searching for a short target in a long
>> text – hence the "precompiling" time is essentially irrelevant.
>Well … yes & no. The specific version of Boyer-Moore he was looking at
>precomputes shifts for every alphabet symbol, so for a large alphabet (like
>Unicode) that specific version is often impractical. There are more-or-less
>obvious ways to worm around that, though.
Hmm. I woulda thought that we regarded all characters not appearing
in needle as equivalent, so the size of the alphabet was irrelevant…
Oh. I suppose that’s one of the "more or less obvious ways to worm
around that" – this requires that the lookup table be some sort
of associative array, while with a short alphabet regarding all
characters as different allows us to use a straight array for the
lookup table, taking longer to "compile" and taking more space
but making lookups much faster. ???
- Hide quoted text — Show quoted text -
>There are also practical applications where the string searched is
>relatively short, but the same pattern is searched for in millions (billions
>…) of distinct strings. For example, "on the fly" analysis of
>ever-growing log files of various kinds fits that niche.
>> (Irrelevant because it takes orders of magnitude less time than
>> the actual search, following the preprocessing stage.)
>Yup, same thing in the end.
>> …
>> It’s kind of curious that someone who’s breaking new ground in CS,
>> continually going where no man has gone before, is unaware of all
>> this stuff. I mean _I_ knew what Tim’s reply was going to say,
>> although I’m not a programmer, and no I didn’t look at those
>> links either – this is all very well-known stuff (definition:
>> an aspect of theoretical CS that’s known even to _me_ is
>> "well-known".)
>Hey, I was impressed! I know a lot of professional programmers, and I bet
>fewer than 5% of them know the essentials of how this algorithm works. I’ll
>be /very/ impressed if you can explain why this discussion is on sci.logic
>and sci.math
Well, I’m afraid I can’t answer the last question – all that springs
to mind is "it’s in both groups because Charley posted it there",
which can’t be the right answer.
Regarding whether I’m impressive… oops, wrong question, we already
know that<g>. Regarding whether all this is further evidence that
I’m an impressive guy, I didn’t mean to give the idea that I actually
had an efficient working version of the algorithm in my head. What
I "know" is just a vague idea of how the algorithm works and why
it works that way – vague, but nonetheless accurate enough to see
why Charlie’s comments were all wrong, and accurate enough that
your comments were utterly unsurprising.
Something like this: Say needle has length m, and to simplify the
notation let’s say string indices are 1-based. "First look at
haystack[m]. If that character does not occur in needle then
you know that there is no match starting at any of the first
m positions, so you can look next at haystack[2*m]. Otoh if
haystack[m] does appear in needle then [mumble about how it's
clear that some preprocessing on needle tells you what to
look at next]".
What would result if I actually thought about it and tried
to write an implementation for fun is not clear. If I
actually had a practical problem to solve I’d look it
up… Surely you’re not saying that less than 5% of the
programmers you know "understand" the algorithm at
_that_ level of detail?
Ok, I was surprised at one thing you said, that being that
the algorithm is provably optimal. I would have thought that
there were sort of infinitely many ways to decide what to
do when you’ve found a character in haystack that does
appear in needle, giving faster searches after slower
preprocessing. For example, the other day thinking about
what I’d say if someone challenged me to actually explain
the algorithm the fanciest thing that sprang to mind was
a tree of all substrings of needle, where paths down the
tree spell out substrings in reverse order… I don’t
recall seeing anything like that the time that I did
look at an actual implementation in some book.
************************
David C. Ullrich
In article <V6Cdnb4zAZyi2pvYnZ2dnUVZ_sOdn…@comcast.com>, "Tim Peters"
<tim….@comcast.net> dox:
!! [David C. Ullrich, to Charlie-Boo]
[...]
!! > It’s kind of curious that someone who’s breaking new ground in CS,
!! > continually going where no man has gone before, is unaware of all
!! > this stuff. I mean _I_ knew what Tim’s reply was going to say,
!! > although I’m not a programmer, and no I didn’t look at those
!! > links either – this is all very well-known stuff (definition:
!! > an aspect of theoretical CS that’s known even to _me_ is
!! > "well-known".)
!!
!! Hey, I was impressed! I know a lot of professional programmers, and I bet
!! fewer than 5% of them know the essentials of how this algorithm works. I’ll
!! be /very/ impressed if you can explain why this discussion is on sci.logic
!! and sci.math
its much more common these days
due to the big 90′s push in bioinformatics
at least those programmers who went to more serious institutions
that prepare them for solid careers
instead of the .NET hacks that fill so many local colleges
now that we have huge online searchable databases
and DNA sequence matching
the demand for programmers with a good grasp of string algorithms
has grown
the 5% is much better than the 1% or less of the previous decade
(i bet c-b posts to sci.logic
because he knows he will learn from the responses there
… though i doubt it will be admitted)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
galathaea: prankster, fablist, magician, liar
- Hide quoted text — Show quoted text -
David C. Ullrich wrote:
> On 9 Sep 2006 14:54:25 -0700, "Charlie-Boo" <shymath…@gmail.com>
> wrote:
> >Tim Peters wrote:
> >> [Charlie-Boo]
> >> > But there is a "shift" involved that also takes character comparisons.
> >> ? Unclear why you believe that; it’s not true.
> >Exactly how do you determine "the segment Y[. . .] with its rightmost
> >occurrence in x that is preceded by a different character from x[i] "?
> >That is work not included in his figure of 17.
> >> The shift at each stage is determined by precomputed tables.
> >Then you are not counting the cost of the precompiling.
> In case you missed what seems to me to be the most important
> point in all the things that Tim has said about this:
> The "precompiling" involves only the _target_ string. In particular
> the "precompiling" time is independent of the length of the
> _text_. The idea is that we’re searching for a short target
> in a long text
What does it mean by "The idea is that"? Where does it say that in
the text?
> – hence the "precompiling" time is essentially
> irrelevant.
> (Irrelevant because it takes orders of magnitude less time than
> the actual search, following the preprocessing stage.)
What does "essentially irrelevant" mean? We aren’t only
searching with short patterns or long text lines. Why is that special
case more relevant than others, e.g. short texts? You are arbitrarily
declaring what the frequencies will be like.
How do you know how long it will take to create the table? Accessing a
single variable determines the contents of a single particular memory
location. This is about as primitive as can be. But creating a table
from a pattern is a complex procedure. You cannot so arbitrarily
declare the opposite – especially on baseless assumptions about
frequencies.
> >> > If you do precompile, then no comparison can be made as to the
> >> > efficiencies.
> >>[...]
> >Then how can he call it "the most efficient" and "the absolute
> >minimum"?
> >Then what is the significance of this algorithm? What is its advantage
> >over a simple sequential scan? The sequential scan is much simpler.
> >What is the advantage of this algorithm over that?
> It’s much faster (when searching for a short target in a long text).
That is an informal characterization which carries no mathematical
weight. You are ignoring, or making baseless assumptions concerning,
variables such as the cost of creating the table or the distribution of
search pattern and text lengths.
If you want faster, create an array of linked lists of each character.
Then the look-up will skip (N-1)/N of search positions (symbol
comparisons) where N = the number of different symbols i.e. about
98-99%!
> It’s kind of curious that someone who’s breaking new ground in CS,
> continually going where no man has gone before, is unaware of all
> this stuff.
That’s a lie. You never established my unawareness of anything.
You attack the messenger because you have no basis to attack the
message.
> I mean _I_ knew what Tim’s reply was going to say,
> although I’m not a programmer, and no I didn’t look at those
> links either
Mental telepathy?
> – this is all very well-known stuff (definition:
> an aspect of theoretical CS that’s known even to _me_ is
> "well-known".)
Oh, I have no doubt that lots of people slurp up that . . .
- Hide quoted text — Show quoted text -
> >C-B
> ************************
> David C. Ullrich
Tim Peters wrote:
> The specific version of Boyer-Moore he was looking at
> precomputes shifts for every alphabet symbol, so for a large alphabet (like
> Unicode) that specific version is often impractical.
And yet he calls this very algorithm "the most efficient" and "the
absolute minimum"!
> Hey, I was impressed! I know a lot of professional programmers, and I bet
> fewer than 5% of them know the essentials of how this algorithm works.
How do you know what % knows about it? And why should they?
(Everything I have ever read by Moore was BS.)
The algorithm described is pitifully inefficient. SImply making the
changes I described makes it faster and never slower, starting with
comparing the character before the matched section to the pattern
rather than to the text character that failed.
C-B
- Hide quoted text — Show quoted text -
> I’ll
> be /very/ impressed if you can explain why this discussion is on sci.logic
> and sci.math
[Tim Peters]
>> The specific version of Boyer-Moore he was looking at
>> precomputes shifts for every alphabet symbol, so for a large alphabet
>> (like Unicode) that specific version is often impractical.
[Charlie-Boo]
> And yet he calls this very algorithm "the most efficient" and "the
> absolute minimum"!
I agree you can find the quoted phrases verbatim on:
http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
but you’ve yanked them out of sorely needed context. Now you’re either
intentionally misrepresenting him, or you don’t know any better. But since
it’s the second time you’ve done that in this thread, and this is the second
time I’ve called you on it, I see no reason to continue responding: looks
like you’re either determined to argue in bad faith, or you’re ineducable.
BTW, as I also told you before, the introduction on that site makes clear
that he’s assuming the alphabet is a subset of ASCII, and the specific
version of Boyer-Moore he presents exploits that. He didn’t claim that this
specific version of Boyer-Moore was suitable for all alphabets.
[Tim, to David Ullrich]
>> Hey, I was impressed! I know a lot of professional programmers, and I
>> bet
>> fewer than 5% of them know the essentials of how this algorithm works.
> How do you know what % knows about it?
An educated guess, based on technical discussions with at least hundreds of
professional programmers over a span of decades.
> And why should they?
I didn’t say that they should. I said I was impressed by how much David
knew despite not working in a particularly related field. I know what he
does for a living; I have no idea what you do, so have no opinion on whether
your lack of knowledge here is, e.g., expected or appalling.
> (Everything I have ever read by Moore was BS.)
> The algorithm described is pitifully inefficient. SImply making the
> changes I described makes it faster and never slower, starting with
> comparing the character before the matched section to the pattern
> rather than to the text character that failed.
Honestly, just about everything you’ve posted about Lecroq’s site betrays
that you understood very little of what you read. To be fair, I understood
very little of your claimed improvements either. I already suggested that
you look at the simpler "Quick Search" B-M variant first, which indeed does
not compare "the text character that failed". See Sunday’s article in the
references listed on that page for extensive analysis of that variation,
along with a few others.
Tim Peters wrote:
> I agree you can find the quoted phrases verbatim on:
> http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
> but you’ve yanked them out of sorely needed context.
Then give the meaning he meant, the meaning that I give, why what I say
gives that meaning instead of the real one, and how that differs from
his meaning.
> > How do you know what % knows about it?
> An educated guess, based on technical discussions with at least hundreds of
> professional programmers over a span of decades.
How do you relate that experience to one specific algorithm? And
please don’t claim there is a linear ordering of what people know.
Algorithms can be long and very simple, or short and very complex, and
are valued for these differing attributes.
The Moore algorithm, in contrast, has no clear advantage. It uses
precompiling that involves different types of processing (and thus
different weights), making it impossible to compare to simple
sequential comparisons. It is simply apples and oranges to compare a
variable number of gets to a variable number of array sets. It’s
only clear difference is the increase in complexity.
It is also not true that it is optimal in any way as described, simply
because of the minor changes that make it logically "smarter".
> > And why should they?
> I didn’t say that they should.
Then there’s no relevance as to how many people are aware of it.
It’s funny. You are saying that it is too unknown, and of course the
problem is that it is the opposite: a well-published algorithm that
shouldn’t be.
> I know what he
> does for a living; I have no idea what you do, so have no opinion on whether
> your lack of knowledge here is, e.g., expected or appalling.
I agree. His being a teacher makes his lack of knowledge appalling.
(Can you imagine his telling the little kids how wonderful Principia
Mathematica is, with all it’s 1,000+ pages, while Occam rolls over in
disbelief that something so long, complex and containing so many
primitives is seen as being a good design!)
How about unsubstantiated personal attacks: expected or appalling?
> > (Everything I have ever read by Moore was BS.)
> > The algorithm described is pitifully inefficient. SImply making the
> > changes I described makes it faster and never slower, starting with
> > comparing the character before the matched section to the pattern
> > rather than to the text character that failed.
> Honestly, just about everything you’ve posted about Lecroq’s site betrays
> that you understood very little of what you read. To be fair, I understood
> very little of your claimed improvements either.
You understood "very little" but draw conclusions concerning
"just about everything you’ve posted". Is that logically
possible?
> I already suggested that
> you look at the simpler "Quick Search" B-M variant first, which indeed does
> not compare "the text character that failed".
That doesn’t change what is claimed about this algorithm. (And I’m
glad you tacitly admit that it is less efficient than my suggested
improvement, showing it is clearly not "optimal" in any respect
whatsoever.) It is false in many ways.
C-B
- Hide quoted text — Show quoted text -
> See Sunday’s article in the
> references listed on that page for extensive analysis of that variation,
> along with a few others.
[David C. Ullrich, to Charlie-Boo]
>>> …
>>> In case you missed what seems to me to be the most important
>>> point in all the things that Tim has said about this:
>>> The "precompiling" involves only the _target_ string. In particular
>>> the "precompiling" time is independent of the length of the
>>> _text_.
[Tim Peters]
>> All true, although I used "pattern" for the needle and "target" for the
>> haystack.
[David C. Ullrich]
> Sorry.
Wasn’t offended
>>> The idea is that we’re searching for a short target in a long
>>> text – hence the "precompiling" time is essentially irrelevant.
>> Well … yes & no. The specific version of Boyer-Moore he was looking at
>> precomputes shifts for every alphabet symbol, so for a large alphabet
>> (like
>> Unicode) that specific version is often impractical. There are more-or-
>> less obvious ways to worm around that, though.
> Hmm. I woulda thought that we regarded all characters not appearing
> in needle as equivalent, so the size of the alphabet was irrelevant…
> Oh. I suppose that’s one of the "more or less obvious ways to worm
> around that"
Right! The specific version of B-M in question uses dense arrays, with an
explicitly intialized slot for every alphabet symbol.
> – this requires that the lookup table be some sort of associative
> array, while with a short alphabet regarding all characters as different
> allows us to use a straight array for the lookup table, taking longer
> to "compile" and taking more space but making lookups much faster. ???
Yup. Since there are dozens of ways to implement associative mappings,
there are dozens of variations on this theme )most of which don’t make a
real lick of difference).
…
>>> It’s kind of curious that someone who’s breaking new ground in CS,
>>> continually going where no man has gone before, is unaware of all
>>> this stuff. I mean _I_ knew what Tim’s reply was going to say,
>>> although I’m not a programmer, and no I didn’t look at those
>>> links either – this is all very well-known stuff (definition:
>>> an aspect of theoretical CS that’s known even to _me_ is
>>> "well-known".)
>> Hey, I was impressed! I know a lot of professional programmers, and
>> I bet fewer than 5% of them know the essentials of how this algorithm
>> works. I’ll be /very/ impressed if you can explain why this discussion
>> is on sci.logic and sci.math
> Well, I’m afraid I can’t answer the last question – all that springs
> to mind is "it’s in both groups because Charley posted it there",
> which can’t be the right answer.
Not impressive at all. That only explains /how/ the discussion got here,
not /why/ — I’m looking for a rigorous psychological explanation
> Regarding whether I’m impressive… oops, wrong question, we already
> know that<g>. Regarding whether all this is further evidence that
> I’m an impressive guy, I didn’t mean to give the idea that I actually
> had an efficient working version of the algorithm in my head.
I expect I got the impression you intended to make.
> What I "know" is just a vague idea of how the algorithm works and why
> it works that way – vague, but nonetheless accurate enough to see
> why Charlie’s comments were all wrong, and accurate enough that
> your comments were utterly unsurprising.
Exactly the impression I got.
> Something like this: Say needle has length m, and to simplify the
> notation let’s say string indices are 1-based. "First look at
> haystack[m]. If that character does not occur in needle then
> you know that there is no match starting at any of the first
> m positions, so you can look next at haystack[2*m]. Otoh if
> haystack[m] does appear in needle then [mumble about how it's
> clear that some preprocessing on needle tells you what to
> look at next]".
> What would result if I actually thought about it and tried
> to write an implementation for fun is not clear.
I expect you’d rediscover either the "Quick Search" or "Horspool" variants
quickly. If haystack[m] doesn’t match needle[m], then in one of those you
shift the needle right so that the rightmost occurrence of haystack[m] in
the needle "lines up" with haystack[m]; in the other you do much the same,
but instead line up haystack[m+1] with its rightmost occurence in the
needle. The latter (Quick Search) is especially easy to work out, because
it does exactly the same thing even if haystack[m] does match needle[m] but
the needle and haystack don’t match in some earlier position. For example,
for needle "abac",
shift["a"] = 2 (the rightmost ‘a’ is "2 from the right end")
shift["b"] = 3 ( " " ’b’ is "3 " " " " ")
shift["c"] = 1 (etc)
shift[anything else] = 5
Those can be computed with a very simple left-to-right scan of the needle:
for i = 1 to m:
shift[needle[i]] = m+1 – i
It makes no difference to correctness (in this variant) in which order you
compare needle and haystack positions, because the same shift count is used
regardless of which position fails to match. Example:
haystack: xbacdabbcabacab
needle: abac
^
"x" doesn’t match "a" at needle[1], so we look at shift[haystack[5]] =
shift["d"] = 5, and jump to:
haystack: xbacdabbcabacab
needle: abac
^
Then "b" doesn’t match "a" at needle[3], so we look at shift[haystack[10]] =
shift["a"] = 2, and jump to:
haystack: xbacdabbcabacab
needle: abac
^
Another shift of 2 and we find a match.
Full-blown B-M is substantially more complicated, but the easy variant above
is something you can rederive from scratch correctly in just a few minutes
after you’re done it once, and is very effective all by itself in "typical"
search scenarios.
> If I actually had a practical problem to solve I’d look it
> up…
After the above, I hope you won’t need to
> Surely you’re not saying that less than 5% of the
> programmers you know "understand" the algorithm at
> _that_ level of detail?
Yup, that was what I meant. Most professional programmers don’t know much
about string search algorithms, because they don’t /implement/ them, they
just /use/ whatever the programming language du jour supplies. And while I
don’t know about current curricula, last time I had a theoretical CompSci
algorithm course was in the very late 70′s, and all they covered was the
theoretically superior (due to excellent worst-case behavior) then-newish
Knuth-Morris-Pratt algorithm.
The ideas here appear obvious in retrospect, but they really aren’t to most
people. The B-M algorithm was first published in 1977, and the "shockingly
simple" Quick Search variant in 1990. I should add that most professional
programmers are too busy wrestling with flaky web technologies on overdue
projects to bother reading papers <0.4 wink>.
Of course /some/ programmers know a great deal about string searching.
"galathea" specifically mentioned bioinformatics, and it’s certainly true
that everyone programming in that field has to know this stuff cold.
> Ok, I was surprised at one thing you said, that being that
> the algorithm is provably optimal.
I sure hope I didn’t say that! My regrets & apologies if I did.
Charlie-Boo quoted bits of Lecroq out of context, and I tried to impress
upon him that context was important. Like so, which is what Lecroq actually
wrote:
When searching for a^(m-1)b in b^n the algorithm makes only O(n/m)
comparisons, which is the absolute minimum for any string-matching
algorithm in the model where the pattern only is preprocessed.
What he claims there is highly qualified, restricted to two specific
strings, and making only an O() claim. By the time Charlie-Boo regurgitated
that bit, it was a rant against Lecroq claiming that B-M did an "absolute
minimum" number of comparisons, period. Of course that’s not true; its
worst-case behavior is O(m*n), same as dirt-dumb sequential search. But B-M
is optimal in the highly qualified way Lecroq claimed.
> I would have thought that there were sort of infinitely many ways
> to decide what to do when you’ve found a character in haystack that
> does appear in needle, giving faster searches after slower
> preprocessing. For example, the other day thinking about
> what I’d say if someone challenged me to actually explain
> the algorithm the fanciest thing that sprang to mind was
> a tree of all substrings of needle, where paths down the
> tree spell out substrings in reverse order… I don’t
> recall seeing anything like that the time that I did
> look at an actual implementation in some book.
There are search algorithms with worst-case search time O(n), where n is the
) to view the K-M-P algorithm as an especially efficient way to
length of the haystack. Knuth-Morris-Pratt is one of them (worst-case # of
comparisons 2*n), and you’ve started down the road to that. More general
worst-case linear-time constructions work for general regular expressions
(the theoretical CompSci kind of regexp, not the kind implemented in Python
or Perl or Emacs or …), although preprocessing time there can be
worst-case exponential time in the size of the needle. It’s possible (even
helpful
build a recognizing automata for the especially simple class of regular
expressions consisting of fixed strings.
Tim Peters wrote:
> Charlie-Boo quoted bits of Lecroq out of context, and I tried to impress
> upon him that context was important. Like so, which is what Lecroq actually
> wrote:
> When searching for a^(m-1)b in b^n the algorithm makes only O(n/m)
> comparisons, which is the absolute minimum for any string-matching
> algorithm in the model where the pattern only is preprocessed.
> What he claims there is highly qualified
"The Boyer-Moore algorithm is considered as the most efficient
string-matching algorithm in usual applications."
A simple change (from "=A" to "~=B") makes it stop earlier in
some circumstances, so it cannot be optimal. Any claim of proving that
it is optimal is necessarily not sound.
Get real.
C-B
[Tim Peters]
>> I agree you can find the quoted phrases verbatim on:
>> http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
>> but you’ve yanked them out of sorely needed context.
[Charlie-Boo]
> Then give the meaning he meant, the meaning that I give, why what I say
> gives that meaning instead of the real one, and how that differs from
> his meaning.
Repeated from a different reply:
Charlie-Boo quoted bits of Lecroq out of context, and I tried to
impress upon him that context was important. Like so, which is
what Lecroq actually wrote:
When searching for a^(m-1)b in b^n the algorithm makes only
O(n/m) comparisons, which is the absolute minimum for any
string-matching algorithm in the model where the pattern only
is preprocessed.
What he claims there is highly qualified, restricted to two
specific strings, and making only an O() claim. By the time
Charlie-Boo regurgitated that bit, it was a rant against Lecroq
claiming that B-M did an "absolute minimum" number of comparisons,
period. Of course that’s not true; its worst-case behavior is
O(m*n), same as dirt-dumb sequential search. But B-M is optimal
in the highly qualified way Lecroq claimed.
Do you still need an explanation for how what he actually claimed differs
from your out-of-context caricature of his claim?
>>> How do you know what % knows about it?
>> An educated guess, based on technical discussions with at least
>> hundreds of professional programmers over a span of decades.
> How do you relate that experience to one specific algorithm?
Very easily, and you’re again ignoring context: I was remarking on David
Ullrich’s correct understanding of the broad strokes. They apply to
literally dozens of variations on this general /class/ of string-search
algorithms. The single specific version of B-M you’re going on about is
just one of them, and the level of understanding David showed applies to all
of them. It’s at that high level too that I believe I have a good
understanding of how much colleagues have known.
> And please don’t claim there is a linear ordering of what people know.
> Algorithms can be long and very simple, or short and very complex, and
> are valued for these differing attributes.
I agree, but don’t see the relevance.
> The Moore algorithm, in contrast, has no clear advantage. It uses
> precompiling that involves different types of processing (and thus
> different weights), making it impossible to compare to simple
> sequential comparisons. It is simply apples and oranges to compare a
> variable number of gets to a variable number of array sets. It’s
> only clear difference is the increase in complexity.
You’re ignoring that it runs much faster than simple sequential comparison
in many applications. That’s why it’s used. Tying your brain into knots
trying to refute that is futile, since it’s easily demonstrated by timing
implementations. Wall-clock time doesn’t give a rip about "different
weights", etc.
> It is also not true that it is optimal in any way as described, simply
> because of the minor changes that make it logically "smarter".
See above: the claim for optimality Lecroq made is correct, although you
ignored almost everything he actually said.
>>> And why should they?
>> I didn’t say that they should.
> Then there’s no relevance as to how many people are aware of it.
I didn’t say that there was. I said I was impressed by how much David knew
about it. Remarking that I’d bet fewer than 5% of professional programmers
know as much was just a way of indicating why I was impressed.
BTW, why do you care? I’m getting the impression that I stumbled into a
long-running pissing contest between you and David here.
> It’s funny. You are saying that it is too unknown, and of course the
> problem is that it is the opposite: a well-published algorithm that
> shouldn’t be.
LOL — sorry, I gave up trying to parse to that.
>> I know what he does for a living; I have no idea what you do, so
>> have no opinion on whether your lack of knowledge here is, e.g.,
>> expected or appalling.
> I agree. His being a teacher makes his lack of knowledge appalling.
Huh? As far as I’m concerned, he demonstrated he knows more about it than
you.
> (Can you imagine his telling the little kids how wonderful Principia
> Mathematica is, with all it’s 1,000+ pages, while Occam rolls over in
> disbelief that something so long, complex and containing so many
> primitives is seen as being a good design!)
What?
> How about unsubstantiated personal attacks: expected or appalling?
As I said, I have no opinion on that wrt you. I do hold the opinion that
you don’t know what you’re talking about here, based on a growing string of
false claims and failures to understand what Lecroq said. I still have no
idea about whether I should expect that from you or not.
>>> (Everything I have ever read by Moore was BS.)
>>> The algorithm described is pitifully inefficient. SImply making the
>>> changes I described makes it faster and never slower, starting with
>>> comparing the character before the matched section to the pattern
>>> rather than to the text character that failed.
>> Honestly, just about everything you’ve posted about Lecroq’s site betrays
>> that you understood very little of what you read. To be fair, I
>> understood very little of your claimed improvements either.
> You understood "very little"
About your claimed improvements, yes.
> but draw conclusions concerning "just about everything you’ve posted".
About Lecroq’s site, yes.
> Is that logically possible?
Yes, if you pay attention to context. Your claimed improvements were not
statements about Lecroq’s site, for example. That was you making statements
about your own ideas. Your statements about what Lecroq wrote were
repeatedly wrong.
>> I already suggested that you look at the simpler "Quick Search" B-M
>> variant first, which indeed does not compare "the text character that
>> failed".
> That doesn’t change what is claimed about this algorithm.
True. It’s something you should look at if you want to learn something
instead of persisting in misunderstanding what Lecroq wrote about B-M. You
should find what he wrote about B-M easier to understand if you study his
presentation of the much simpler Quick Search first.
> (And I’m glad you tacitly admit that it is less efficient than my
> suggested improvement, showing it is clearly not "optimal" in any respect
> whatsoever.)
But it is optimal in exactly the specific way Lecroq claimed it was optimal,
as quoted above. Maybe you don’t know what "a^(m-1)b in b^n" means?
/Something’s/ bonkers here, because you continue to miss his meaning
completely.
- Hide quoted text — Show quoted text -
Tim Peters wrote:
> [Tim Peters]
> >> I agree you can find the quoted phrases verbatim on:
> >> http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
> >> but you’ve yanked them out of sorely needed context.
> [Charlie-Boo]
> > Then give the meaning he meant, the meaning that I give, why what I say
> > gives that meaning instead of the real one, and how that differs from
> > his meaning.
> Repeated from a different reply:
> Charlie-Boo quoted bits of Lecroq out of context, and I tried to
> impress upon him that context was important. Like so, which is
> what Lecroq actually wrote:
> When searching for a^(m-1)b in b^n the algorithm makes only
> O(n/m) comparisons, which is the absolute minimum for any
> string-matching algorithm in the model where the pattern only
> is preprocessed.
> What he claims there is highly qualified, restricted to two
> specific strings, and making only an O() claim. By the time
> Charlie-Boo regurgitated that bit, it was a rant against Lecroq
> claiming that B-M did an "absolute minimum" number of comparisons,
> period. Of course that’s not true; its worst-case behavior is
> O(m*n), same as dirt-dumb sequential search. But B-M is optimal
> in the highly qualified way Lecroq claimed.
> Do you still need an explanation for how what he actually claimed differs
> from your out-of-context caricature of his claim?
[Charlie-Boo]
> And yet he calls this very algorithm "the most efficient" and "the
> absolute minimum"!
You quoted "absolute minimum" but left out the reference to "the most
efficient" (speaking of taking things out of context):
"The Boyer-Moore algorithm is considered as the most efficient
string-matching algorithm in usual applications."
Not quite highly qualified.
C-B
[Tim Peters]
>>>> I agree you can find the quoted phrases verbatim on:
>>>> http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
>>>> but you’ve yanked them out of sorely needed context.
[Charlie-Boo]
>>> Then give the meaning he meant, the meaning that I give, why what I say
>>> gives that meaning instead of the real one, and how that differs from
>>> his meaning.
[Tim Peters]
- Hide quoted text — Show quoted text -
>> Repeated from a different reply:
>> Charlie-Boo quoted bits of Lecroq out of context, and I tried to
>> impress upon him that context was important. Like so, which is
>> what Lecroq actually wrote:
>> When searching for a^(m-1)b in b^n the algorithm makes only
>> O(n/m) comparisons, which is the absolute minimum for any
>> string-matching algorithm in the model where the pattern only
>> is preprocessed.
>> What he claims there is highly qualified, restricted to two
>> specific strings, and making only an O() claim. By the time
>> Charlie-Boo regurgitated that bit, it was a rant against Lecroq
>> claiming that B-M did an "absolute minimum" number of comparisons,
>> period. Of course that’s not true; its worst-case behavior is
>> O(m*n), same as dirt-dumb sequential search. But B-M is optimal
>> in the highly qualified way Lecroq claimed.
>> Do you still need an explanation for how what he actually claimed differs
>> from your out-of-context caricature of his claim?
[Charlie-Boo retroactively quoting Charlie-Boo] [quote A]
>> And yet he calls this very algorithm "the most efficient" and "the
>> absolute minimum"!
[Charlie-Boo]
> You quoted "absolute minimum" but left out the reference to "the most
> efficient" (speaking of taking things out of context):
Yes, the "different reply" wasn’t made to you to begin with, and wasn’t
responding to your "[quote A]" above, although it not coincidentally
addressed the "the absolute minimum" part of your caricature (which you made
more than once, not just in "[quote A]" above). I suppose you’re irked
because I didn’t write an entirely different explanation for you alone, but
the idea that you fairly characterized what Lecroq said is so transparently
goofy that repeatedly explaining why exceeded my limit for enduring pedantic
tedium.
> "The Boyer-Moore algorithm is considered as the most efficient
> string-matching algorithm in usual applications."
> Not quite highly qualified.
Not sure I follow. For example, you believe that:
the most efficient
is a reasonably accurate rephrasing of:
is considered as the most efficient string-matching algorithm in
usual applications
? If so, good luck with the rest of your life
Tim Peters wrote:
> > You quoted "absolute minimum" but left out the reference to "the most
> > efficient" (speaking of taking things out of context):
> Yes, the "different reply" wasn’t made to you to begin with, and wasn’t
> responding to your "[quote A]" above, although it not coincidentally
> addressed the "the absolute minimum" part of your caricature (which you made
> more than once, not just in "[quote A]" above). I suppose you’re irked
> because I didn’t write an entirely different explanation for you alone, but
> the idea that you fairly characterized what Lecroq said is so transparently
> goofy that repeatedly explaining why exceeded my limit for enduring pedantic
> tedium.
You left out the quote that shows they didn’t refer to your special
cases only. That’s all.
> > "The Boyer-Moore algorithm is considered as the most efficient
> > string-matching algorithm in usual applications."
> > Not quite highly qualified.
> Not sure I follow.
You used the phrase "highly qualified" when referring to one quote.
You left out the other quote, which I give above and is NOT "highly
qualified" (IMEO.) So you gave the wrong impression when you didn’t
include the entire quote.
> For example, you believe that:
> the most efficient
> is a reasonably accurate rephrasing of:
> is considered as the most efficient string-matching algorithm in
> usual applications
> ? If so, good luck with the rest of your life
It is the rephrasing done by the author who wants very much to say they
did the absolute maximum best and nobody else can ever do better. Like
when Chaitin calls his purported theorem the "strongest possible Godel
theorem" without explaining what that means and how he knows it.
(Rosser 1936 is already stronger than Godel 1931.)
So they say "is considered" instead of "is", the latter being a formal
mathematical statement and the former being undefined and committing
them to nothing.
And they throw in "in usual applications" (another undefined term) to
cover their tracks.
Just call me the bloodhound.
C-B
…
[Charlie-Boo]
>>> "The Boyer-Moore algorithm is considered as the most efficient
>>> string-matching algorithm in usual applications."
>>> Not quite highly qualified.
[Tim Peters]
>> Not sure I follow.
[Charlie-Boo]
> You used the phrase "highly qualified" when referring to one quote.
Which, in context, specifically referred to the "absolute minimum" part,
which was highly qualified.
> You left out the other quote, which I give above and is NOT "highly
> qualified" (IMEO.)
Which was about a different part.
> So you gave the wrong impression when you didn’t include the entire
> quote.
Possibly. Let’s see whether anyone else feels bamboozled by it.
>> For example, you believe that:
>> the most efficient
>> is a reasonably accurate rephrasing of:
>> is considered as the most efficient string-matching algorithm in
>> usual applications
>> ? If so, good luck with the rest of your life
> It is the rephrasing done by the author who wants very much to say they
> did the absolute maximum best and nobody else can ever do better. Like
> when Chaitin calls his purported theorem the "strongest possible Godel
> theorem" without explaining what that means and how he knows it.
> (Rosser 1936 is already stronger than Godel 1931.)
Weak, since Lecroq isn’t the originator of the B-M algorithm, and has no
stake in what others think of it.
> So they say "is considered" instead of "is", the latter being a formal
> mathematical statement and the former being undefined and committing
> them to nothing.
> And they throw in "in usual applications" (another undefined term) to
> cover their tracks.
Which is exactly why your "the most efficient" caricature gives a wrong
impression of what Lecroq wrote here too, but for a different reason. In
the other case you stripped necessary formality from a formal statement,
while in this case you stripped the informality from an /in/formal
statement, making him appear to be an idiot in both cases.
http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
Description
The Boyer-Moore algorithm is considered as the most efficient
string-matching algorithm in usual applications. A simplified
version of it or the entire algorithm is often implemented in
text editors for the «search» and «substitute» commands.
Juat an informal introductory paragraph, not a formal claim in any sense
that B-M is "the most efficient" such algorithm period. In contrast, his
later claim of optimality in a specific case was meant to be a formal claim,
and was stated with care and without weasel-words. I expect that anyone
reading in good faith can tell the difference without problem.
> Just call me the bloodhound.
OK — but I suggest you’d benefit from more tracking and less barking.
Tim Peters wrote:
> In the other case you stripped necessary formality from a formal statement,
> while in this case you stripped the informality from an /in/formal
> statement, making him appear to be an idiot in both cases.
If it looks like an idiot, walks and talks like a idiot, . . .
If the public sees it that way, who am I to interfere with freedom of
opinion? I realize of course that I hold an unfair advantage over
others with my clout and prestige, allowing me to sway public opinion
to whatever side I choose. In fact, I have even been accused of
leading my loyal followers down the wrong path in my pronouncements.
> The Boyer-Moore algorithm is considered as the most efficient
> string-matching algorithm in usual applications. A simplified
> version of it or the entire algorithm is often implemented in
> text editors for the «search» and «substitute» commands.
> Juat an informal introductory paragraph, not a formal claim in any sense
> that B-M is "the most efficient" such algorithm period. In contrast, his
> later claim of optimality in a specific case was meant to be a formal claim,
> and was stated with care and without weasel-words.
No, the first statement is meant to be a claim that it is the best
possible algorithm, and the later formulas are meant to obfuscate the
situation with the hope that you then believe the first statement.
> > Just call me the bloodhound.
> OK — but I suggest you’d benefit from more tracking and less barking.
Barking up the wrong directed graph?
C-B
On Tue, 12 Sep 2006 17:44:27 -0400, "Tim Peters" <tim….@comcast.net>
wrote:
>[large amount of interesting stuff snipped]
Thanks.
So I guess the fact that I find all this fascinating means
I can never be a programmer. Oh well. (Heh: if the dept
head gets his way I _may_ soon be bogged down in various
web technologies, so I guess there’s hope.)
************************
David C. Ullrich
On Tue, 12 Sep 2006 15:45:42 -0400, "Tim Peters" <tim….@comcast.net>
wrote:
- Hide quoted text — Show quoted text -
>[Tim Peters]
>>> The specific version of Boyer-Moore he was looking at
>>> precomputes shifts for every alphabet symbol, so for a large alphabet
>>> (like Unicode) that specific version is often impractical.
>[Charlie-Boo]
>> And yet he calls this very algorithm "the most efficient" and "the
>> absolute minimum"!
>I agree you can find the quoted phrases verbatim on:
> http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
>but you’ve yanked them out of sorely needed context. Now you’re either
>intentionally misrepresenting him, or you don’t know any better. But since
>it’s the second time you’ve done that in this thread, and this is the second
>time I’ve called you on it, I see no reason to continue responding: looks
>like you’re either determined to argue in bad faith, or you’re ineducable.
You’re catching on.
- Hide quoted text — Show quoted text -
>BTW, as I also told you before, the introduction on that site makes clear
>that he’s assuming the alphabet is a subset of ASCII, and the specific
>version of Boyer-Moore he presents exploits that. He didn’t claim that this
>specific version of Boyer-Moore was suitable for all alphabets.
>[Tim, to David Ullrich]
>>> Hey, I was impressed! I know a lot of professional programmers, and I
>>> bet
>>> fewer than 5% of them know the essentials of how this algorithm works.
>> How do you know what % knows about it?
>An educated guess, based on technical discussions with at least hundreds of
>professional programmers over a span of decades.
>> And why should they?
>I didn’t say that they should. I said I was impressed by how much David
>knew despite not working in a particularly related field. I know what he
>does for a living; I have no idea what you do, so have no opinion on whether
>your lack of knowledge here is, e.g., expected or appalling.
>> (Everything I have ever read by Moore was BS.)
>> The algorithm described is pitifully inefficient. SImply making the
>> changes I described makes it faster and never slower, starting with
>> comparing the character before the matched section to the pattern
>> rather than to the text character that failed.
>Honestly, just about everything you’ve posted about Lecroq’s site betrays
>that you understood very little of what you read. To be fair, I understood
>very little of your claimed improvements either. I already suggested that
>you look at the simpler "Quick Search" B-M variant first, which indeed does
>not compare "the text character that failed". See Sunday’s article in the
>references listed on that page for extensive analysis of that variation,
>along with a few others.
************************
David C. Ullrich
On Tue, 12 Sep 2006 19:09:08 -0400, "Tim Peters" <tim….@comcast.net>
wrote:
>[Tim Peters]
>>> [...]
>BTW, why do you care? I’m getting the impression that I stumbled into a
>long-running pissing contest between you and David here.
<giggle> You really had no idea what you were getting into here?
See, the problem is that Charlie’s understandably pissed off at a lot
of people because he don’t get no respect. He’s the only person who
has ever formalized any aspect of computer science. He’s the only
person who’s ever made a _working_ automated theorem prover – all
the other work there has been just theoretical bullshit. As far as
I know he’s also the only person who’s ever recognized that BM is
just bullshit as well…
It’s remarkable, all that stuff in all those books, he’s the only
one out there pointing out that the emperor has no clothes, and
nobody seems to take him seriously. You’d be irritated too.
If you’d only shut up and listen you could learn a lot from the
guy. Just like with I-forget-the-name over in sci.math.
************************
David C. Ullrich
On 12 Sep 2006 11:53:05 -0700, "Charlie-Boo" <shymath…@gmail.com>
wrote:
- Hide quoted text — Show quoted text -
>David C. Ullrich wrote:
>> On 9 Sep 2006 14:54:25 -0700, "Charlie-Boo" <shymath…@gmail.com>
>> wrote:
>> >Tim Peters wrote:
>> >> [Charlie-Boo]
>> >> > But there is a "shift" involved that also takes character comparisons.
>> >> ? Unclear why you believe that; it’s not true.
>> >Exactly how do you determine "the segment Y[. . .] with its rightmost
>> >occurrence in x that is preceded by a different character from x[i] "?
>> >That is work not included in his figure of 17.
>> >> The shift at each stage is determined by precomputed tables.
>> >Then you are not counting the cost of the precompiling.
>> In case you missed what seems to me to be the most important
>> point in all the things that Tim has said about this:
>> The "precompiling" involves only the _target_ string. In particular
>> the "precompiling" time is independent of the length of the
>> _text_. The idea is that we’re searching for a short target
>> in a long text
>What does it mean by "The idea is that"? Where does it say that in
>the text?
Um, for example where it states that the algorithm or simplified
versions of it are commonly used for search&replace in text
editors.
>> – hence the "precompiling" time is essentially
>> irrelevant.
>> (Irrelevant because it takes orders of magnitude less time than
>> the actual search, following the preprocessing stage.)
>What does "essentially irrelevant" mean?
This time is essentially irrelevant to that time if this
time is much smaller than that time.
>We aren’t only
>searching with short patterns or long text lines. Why is that special
>case more relevant than others, e.g. short texts? You are arbitrarily
>declaring what the frequencies will be like.
>How do you know how long it will take to create the table? Accessing a
>single variable determines the contents of a single particular memory
>location. This is about as primitive as can be. But creating a table
>from a pattern is a complex procedure. You cannot so arbitrarily
>declare the opposite – especially on baseless assumptions about
>frequencies.
Um. It happens really a lot that I search text in a text editor for
a certain string. What I’m searching for is a pattern like "10.1.1"
or "hilbert transform". The text is typically at least several KB
long, usually at least tens of KB. That’s fairly typical – those
"baseless assumptions" are in fact _realistic_ assumptions.
Here’s a quiz. Define g(m,n) = 1 + n, f(n,m) = m^2 + n/m. Which
is larger, g(10, 10000) or f(10, 10000)?
(Answer: g(10, 10000) is larger. Why? Because the fact that 1 is
so much smaller than 10^2 is essentially irrelevant next to the
fact that 10000/10 is so much smaller than 10000.)
- Hide quoted text — Show quoted text -
>> >> > If you do precompile, then no comparison can be made as to the
>> >> > efficiencies.
>> >>[...]
>> >Then how can he call it "the most efficient" and "the absolute
>> >minimum"?
>> >Then what is the significance of this algorithm? What is its advantage
>> >over a simple sequential scan? The sequential scan is much simpler.
>> >What is the advantage of this algorithm over that?
>> It’s much faster (when searching for a short target in a long text).
>That is an informal characterization which carries no mathematical
>weight. You are ignoring, or making baseless assumptions concerning,
>variables such as the cost of creating the table or the distribution of
>search pattern and text lengths.
Realistic assumptions.
>If you want faster, create an array of linked lists of each character.
>Then the look-up will skip (N-1)/N of search positions (symbol
>comparisons) where N = the number of different symbols i.e. about
>98-99%!
Fascinating. At the start of this you said you didn’t think that
a simple sequential search could be beat.
>> It’s kind of curious that someone who’s breaking new ground in CS,
>> continually going where no man has gone before, is unaware of all
>> this stuff.
>That’s a lie. You never established my unawareness of anything.
No, _you_ established that in your first post. Quote:
"Note 1. Faulty Count
He is not counting the characters compared to make the shift!
Note 2. Sequential produces the same result more efficiently
I don’t think there is anything that you can do to improve over a
simple (faster) sequential scan without precompiling. Added
complexity
means addition inefficiency."
Note 1 is simply wrong. And Note 2 is hilariously wrong.
>You attack the messenger because you have no basis to attack the
>message.
>> I mean _I_ knew what Tim’s reply was going to say,
>> although I’m not a programmer, and no I didn’t look at those
>> links either
>Mental telepathy?
Uh, no, just less than total ignorance of how those search
algorithms work.
>> – this is all very well-known stuff (definition:
>> an aspect of theoretical CS that’s known even to _me_ is
>> "well-known".)
>Oh, I have no doubt that lots of people slurp up that . . .
>> >C-B
************************
David C. Ullrich
David C. Ullrich wrote:
> <giggle> You really had no idea what you were getting into here?
Sorry, but giggling is only cute when babies do it.
> See, the problem is that Charlie’s understandably pissed off at a lot
> of people because he don’t get no respect.
I applaud your decision to give up on Mathematical Logic and try your
hand at Psychology instead. Maybe you’ll do better in that field.
Now, what makes you feel there is so much anger around you? Do you
simply feel more at ease discussing principles of Psychology than
principles of Mathematical Logic?
> He’s the only person who
> has ever formalized any aspect of computer science.
Why don’t you give an example if other people have done it? Every
time you are unable to substantiate a claim is simply more evidence
that it is not true. Fine with me!
> He’s the only
> person who’s ever made a _working_ automated theorem prover – all
> the other work there has been just theoretical bullshit.
Not so. Who ever said that? Please quote.
> It’s remarkable, all that stuff in all those books, he’s the only
> one out there pointing out that the emperor has no clothes
Totally false.
"All that stuff in all those books." I love it. You are a little boy
in awe of all the pretty, fancy books.
Student: "Professor Ullrich, can you prove your last point?"
DU: "All that stuff in all those books."
> nobody seems to take him seriously. You’d be irritated too.
What’s worse than making unsubstantiated statements? Making
statements that by their nature can never be substantiated or refuted.
Why should anyone take that seriously? Really. What’s the sense of
it all?
BTW If you really want to know the truth about how I "feel": You try
like hell to prove me wrong. If you could ever find an example of
someone formally generating a computer program or an incompleteness
theorem, you surely would plaster it here. But you don’t.
You see, you do a lot of my work for me. Your persistent reference to
people’s reputation and emotional state only shows me there are no
Mathematical principles to back up what you so desperately wish was
true.
- Hide quoted text — Show quoted text -
> ************************
> David C. Ullrich