Logic — math, philosophy & computational aspects

logic, math, philosophy, math games, math help, mathematical logic, philosophy of education, math facts




The Boyer-Moore Fast String Searching Algorithm: Just Moore BS?

This algorithm determines the occurrences of a string of characters
(symbols) called the pattern within another string called the text.
Refer to links A and B below.

http://www-igm.univ-mlv.fr/~lecroq/string/node14.html     A

http://www-igm.univ-mlv.fr/~lecroq/string/examples/exp14.html    B

Link B gives an example and concludes that 17 character comparisons are
performed.  If you count the number of matching trailing characters
plus one in each of the 5 attempts, (1,3,8,3,2) the 17 figure is the
number of trailing characters checked in the pattern vs. the text above
it, until a failure.  But there is a "shift" involved that also
takes character comparisons.

Link A defines the "Shift" of at least the first 2 attempts (shift
= 1 and 4) as moving to "the segment Y[. . .] (what matched before)
with its rightmost occurrence in x (the pattern) that is preceded by a
different character from x[i] (the character that failed the check)."

Note 1. Faulty Count

He is not counting the characters compared to make the shift!

Note 2. Sequential produces the same result more efficiently

I don’t think there is anything that you can do to improve over a
simple (faster) sequential scan without precompiling.  Added complexity
means addition inefficiency.

If you do precompile, then no comparison can be made as to the
efficiencies.  Creating a table keyed to a character string is untold
orders of magnitude slower than retrieving a character.  The author
says this algorithm is NOT precompiling, referring to its being minimal
in some contexts "for any string-matching algorithm in the model
where the pattern only is preprocessed."

So he is claiming that there is no precompiling of the text line being
searched, and is determining the rightmost occurrence by uncounted
character comparisons.

Note 3. Inefficient logical check (easily improved)

The shift algorithm makes the shift when the character to the left of
the previously matched segment Y[. . .] is not the same as the previous
character encountered in the text and which failed.  But we really want
it to match the next character in the text.  So we will have to check
it again, to establish that it is not only not the same character that
failed before, but that it is the correct character in the text.

So simply changing this to check that it is the matching character in
the text would do an additional shift when it does not match the text
and is not equal to the previously failing character.  This would
sometimes then detect that there is no more pattern to check and we
immediately see the match without checking that character again.

Note 4. Additional repeated retrievals and comparisons (more complex
and overhead to fix)

Furthermore, as specified you will have to compare those same
characters again that you did to make the shift (and were not counted
by this web site.)  You will have to go back to the end of the pattern
and compare the part to the right of what was checked in the pattern.
In the process, you will recheck the part that was used to make the
shift, finding the previously matched substring in the pattern.

Then notes 1-4 show that they are not accurately counting the number of
character comparisons performed, and that there are simple improvements
to what they call "the most efficient" and "the absolute
minimum", refuting these two claims.

So is it just Moore BS?  Is there anything you can do to change the
expected results from a substring search without compiling?  (I can
think of 1 extreme case: assume the strings are binary i.e. there are
only 2 characters, and optimize from that.)

C-B

posted by admin in Uncategorized and have Comments (24)






24 Responses to “The Boyer-Moore Fast String Searching Algorithm: Just Moore BS?”

  1. admin says:

    [Charlie-Boo]

    > This algorithm determines the occurrences of a string of characters
    > (symbols) called the pattern within another string called the text.
    > Refer to links A and B below.

    > http://www-igm.univ-mlv.fr/~lecroq/string/node14.html     A

    > http://www-igm.univ-mlv.fr/~lecroq/string/examples/exp14.html    B

    > Link B gives an example and concludes that 17 character comparisons are
    > performed.  If you count the number of matching trailing characters
    > plus one in each of the 5 attempts, (1,3,8,3,2) the 17 figure is the
    > number of trailing characters checked in the pattern vs. the text above
    > it, until a failure.

    In link B, the characters compared at each stage are shown shaded, and
    labeled with little integers.  They do sum to 17.  Each stage starts by
    comparing the last character of the pattern to its current position in the
    target, and proceeding "left" until a mismatch is found or the pattern is
    exhausted.

    > But there is a "shift" involved

    Right.

    > that also takes character comparisons.

    ?  Unclear why you believe that; it’s not true.  The shift at each stage is
    determined by precomputed tables.  Look at the /implementation/ in link A,
    where this is very clear in function BM() — the only character comparisons
    done during the search are due to the "x[i] == y[i+j]" portion of the `for`
    loop, and that’s executed 17 times in the link B example.

    > Link A defines the "Shift" of at least the first 2 attempts (shift
    > = 1 and 4) as moving to "the segment Y[. . .] (what matched before)
    > with its rightmost occurrence in x (the pattern) that is preceded by a
    > different character from x[i] (the character that failed the check)."

    And all possible such shifts are precomputed, before the search proper
    begins, via the preBmGs() and preBmBc() routines.

    > Note 1. Faulty Count

    > He is not counting the characters compared to make the shift!

    There are none.

    > Note 2. Sequential produces the same result more efficiently

    > I don’t think there is anything that you can do to improve over a
    > simple (faster) sequential scan without precompiling.  Added complexity
    > means addition inefficiency.

    Don’t know what you mean by precompiling.  Link A defines what it means by
    "preprocessing", and the pattern definitely is preprocessed.

    > If you do precompile, then no comparison can be made as to the
    > efficiencies.

    Comparing what to what?

    > Creating a table keyed to a character string is untold orders of
    > magnitude slower than retrieving a character.

    Also unsure what that means, but the BM preprocessing steps given in link A
    require, as the page says, O(m + |A|) time and space, where m is the length
    of the pattern, and |A| is the size of the alphabet.  That’s based on an
    implementation that uses alphabet symbols to index a dense array directly;
    throughout that site, the alphabet is assumed to be a subset of ASCII (as it
    says in the introduction) so that’s reasonable in context.  There are of
    course variations that use more space-efficient data structures when the
    alphabet is large.

    Nothing here uses a table keyed to arbitrary strings in any case; at worst
    this uses tables keyed by (individual) alphabet symbols.

    > The author says this algorithm is NOT precompiling,

    Again, the word "precompiling" appears nowhere in the text, and the author
    is extremely clear about that the algorithm does do preprocessing on the
    pattern.

    > referring to its being minimal in some contexts "for any string-matching
    > algorithm in the model where the pattern only is preprocessed."

    Yes, it preprocesses the pattern, and in some cases makes a provably minimal
    number of character comparisons across all possible string-matching
    algorithms "where the pattern only is preprocessed".  The text gives a
    specific example of that.

    > So he is claiming that there is no precompiling of the text line being
    > searched,

    Yes.  But, as above, do note that the size of the alphabet also affects the
    preprocessing time required for this version of the algorithm.

    - Hide quoted text — Show quoted text -

    > …

  2. admin says:

    Tim Peters wrote:
    > [Charlie-Boo]
    > > But there is a "shift" involved that also takes character comparisons.
    > ?  Unclear why you believe that; it’s not true.

    Exactly how do you determine "the segment Y[. . .] with its rightmost
    occurrence in x that is preceded by a different character from x[i] "?
    That is work not included in his figure of 17.

    >  The shift at each stage is determined by precomputed tables.

    Then you are not counting the cost of the precompiling.

    > > If you do precompile, then no comparison can be made as to the
    > > efficiencies.

    > Comparing what to what?

    His algorithm to an algorithm that doesn’t use any precompiling.

    > That’s based on an
    > implementation that uses alphabet symbols to index a dense array directly;

    Then you are still comparing apples to oranges and leaving out the
    difference in the 17 figure.  He talks about number of comparisons, and
    now there is an unkown amount of additional work needed to produce
    tables.  Thus the 17 figure is wrong.

    > > The author says this algorithm is NOT precompiling,
    > Yes, it preprocesses the pattern

    The skipping process is a function of both the pattern and the text
    line, so you would need to precompile the text line as well.  You might
    as well precompile the whole answer!

    Not to mention that you can’t compare the cost of creating a table
    with a different process of retrieving from the text line.

    Not to mention that this is not included in the 17 figure.

    > note that the size of the alphabet also affects the
    > preprocessing time required for this version of the algorithm.

    Then how can he call it "the most efficient" and "the absolute
    minimum"?

    Then what is the significance of this algorithm?  What is its advantage
    over a simple sequential scan?  The sequential scan is much simpler.
    What is the advantage of this algorithm over that?

    C-B

  3. admin says:

    [Charlie-Boo]

    >>> But there is a "shift" involved that also takes character comparisons.

    [Tim Peters]

    >> ?  Unclear why you believe that; it’s not true.

    [Charlie-Boo]

    > Exactly how do you determine "the segment Y[. . .] with its rightmost
    > occurrence in x that is preceded by a different character from x[i] "?

    Via lookup in tables computed during pattern preprocessing.

    > That is work not included in his figure of 17.

    Correct.  Already explained what 17 does count, and pointed to the exact
    fragment of the implementation code it corresponds to.

    >>  The shift at each stage is determined by precomputed tables.
    > Then you are not counting the cost of the precompiling.

    Correct.  The preprocessing and searching costs are accounted for separately
    in his writeup, and are summarized in the first section ("Main features") at
    the top of:

        http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

    Separating them is standard practice in the literature, BTW.

    >>> If you do precompile, then no comparison can be made as to the
    >>> efficiencies.
    >> Comparing what to what?
    > His algorithm to an algorithm that doesn’t use any precompiling.

    Then, as he says, the algorithm requires O(n/m) comparisons to fail to find
    a^(m-1)b in b^n , which is far better (by about a factor of m) than
    sequential search with no preprocessing.

    >> That’s based on an implementation that uses alphabet symbols to index
    >> a dense array directly;
    > Then you are still comparing apples to oranges

    I’m not comparing anything.

    > and leaving out the difference in the 17 figure.

    17 is the correct count of the number of comparisons.  If you don’t believe
    it, compile the code and /run/ the example yourself.  Count the number of
    comparisons and display it at the end.  You’ll get 17 too.

    He didn’t claim comparisons are the /only/ expense — it’s fine if you want
    to count other operations too, but Lecroq is counting what he /said/ he was
    counting.  What he’s doing is standard when discussing this kind of
    algorithm; the cost of preprocessing the pattern can often be amortized over
    many searches, so it’s of practical value to analyze preprocessing and
    searching costs separately.

    > He talks about number of comparisons, and now there is an unkown
    > amount of additional work needed to produce tables.

    It’s not unknown.  If you can’t read C code, read the "Main features"
    section at the top of:

        http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

    As that says, & as I said last time, preprocessing for this version of the
    algorithm takes O(m + |A|) time, where m is the length of the pattern and
    |A| is the size of the alphabet.

    > Thus the 17 figure is wrong.

    No.

    >>> The author says this algorithm is NOT precompiling,
    >> Yes, it preprocesses the pattern
    > The skipping process is a function of both the pattern and the text
    > line, so you would need to precompile the text line as well.

    No.  Preprocessing doesn’t look at the target string /at all/.  Look at the
    code:  the target string isn’t even passed to the functions that do the
    preprocessing.  If the pattern is, e.g., "xxxxx", then for /any/ target
    string that contains "a" in the position of the final "x", you know that
    it’s impossible for the pattern to match if you shift it right one, two,
    three, or four positions:  that would still leave an "x" lined up with the
    same "a".  You need to shift at least five to have any possibility of
    matching.  This is independent of target string; it depends only on the
    pattern and the alphabet, and as I also said last time, shifts are
    precomputed for each symbol of the /alphabet/.

    > You might as well precompile the whole answer!

    No.  In practice n (the length of the target string) is potentially
    unbounded, while preprocessing takes O(m + |A|) time independent of n.  The
    result of preprocessing is a set of tables that can be used to try to speed
    searches for the pattern in /any/ target string.

    > Not to mention that you can’t compare the cost of creating a table
    > with a different process of retrieving from the text line.

    Why is that?  Is one of them a magical process that deviously frustrates all
    attempts to time it ;-) ?

    > Not to mention that this is not included in the 17 figure.

    17 counts the number of comparisons in the example.  Period.

    >> note that the size of the alphabet also affects the
    >> preprocessing time required for this version of the algorithm.
    > Then how can he call it "the most efficient" and "the absolute
    > minimum"?

    He didn’t say any such things /without qualification/.  You were actually
    more careful to avoid caricaturing his statements last time, so this is
    degenerating now.  Read what he actually said again — you were closer to
    undertanding it last time around.

    > Then what is the significance of this algorithm?  What is its advantage
    > over a simple sequential scan?  The sequential scan is much simpler.
    > What is the advantage of this algorithm over that?

    Speed, and it’s widely /used/ because of that.  It can be extremely useful
    when the pattern is short compared to the target, so, e.g., it’s a method of
    choice when implementing search in text editors.

    You’re clearly uncomfortable with his style of presentation, so I suggest
    looking at a simpler algorithm first.  In practice, most people don’t
    implement the full Boyer-Moore algorithm.  Simpler variations more often
    implemented also covered by Lecroq at that site include "Quick Search" and
    "Horspool".  I suggest looking at Quick Search first, which is implemented
    in about a dozen lines of C total.  The uses a variant of only the
    "bad-character shift" portion of full-blown BM, and should be much easier to
    understand.

  4. admin says:

    On 9 Sep 2006 14:54:25 -0700, "Charlie-Boo" <shymath…@gmail.com>
    wrote:

    >Tim Peters wrote:
    >> [Charlie-Boo]

    >> > But there is a "shift" involved that also takes character comparisons.

    >> ?  Unclear why you believe that; it’s not true.

    >Exactly how do you determine "the segment Y[. . .] with its rightmost
    >occurrence in x that is preceded by a different character from x[i] "?
    >That is work not included in his figure of 17.

    >>  The shift at each stage is determined by precomputed tables.

    >Then you are not counting the cost of the precompiling.

    In case you missed what seems to me to be the most important
    point in all the things that Tim has said about this:

    The "precompiling" involves only the _target_ string. In particular
    the "precompiling" time is independent of the length of the
    _text_. The idea is that we’re searching for a short target
    in a long text – hence the "precompiling" time is essentially
    irrelevant.

    (Irrelevant because it takes orders of magnitude less time than
    the actual search, following the preprocessing stage.)

    >> > If you do precompile, then no comparison can be made as to the
    >> > efficiencies.
    >>[...]

    >Then how can he call it "the most efficient" and "the absolute
    >minimum"?

    >Then what is the significance of this algorithm?  What is its advantage
    >over a simple sequential scan?  The sequential scan is much simpler.
    >What is the advantage of this algorithm over that?

    It’s much faster (when searching for a short target in a long text).

    It’s kind of curious that someone who’s breaking new ground in CS,
    continually going where no man has gone before, is unaware of all
    this stuff. I mean _I_ knew what Tim’s reply was going to say,
    although I’m not a programmer, and no I didn’t look at those
    links either – this is all very well-known stuff (definition:
    an aspect of theoretical CS that’s known even to _me_ is
    "well-known".)

    >C-B

    ************************

    David C. Ullrich

  5. admin says:

    [David C. Ullrich, to Charlie-Boo]

    > …
    > In case you missed what seems to me to be the most important
    > point in all the things that Tim has said about this:

    > The "precompiling" involves only the _target_ string. In particular
    > the "precompiling" time is independent of the length of the
    > _text_.

    All true, although I used "pattern" for the needle and "target" for the
    haystack.

    > The idea is that we’re searching for a short target in a long
    > text – hence the "precompiling" time is essentially irrelevant.

    Well … yes & no.  The specific version of Boyer-Moore he was looking at
    precomputes shifts for every alphabet symbol, so for a large alphabet (like
    Unicode) that specific version is often impractical.  There are more-or-less
    obvious ways to worm around that, though.

    There are also practical applications where the string searched is
    relatively short, but the same pattern is searched for in millions (billions
    …) of distinct strings.  For example, "on the fly" analysis of
    ever-growing log files of various kinds fits that niche.

    > (Irrelevant because it takes orders of magnitude less time than
    > the actual search, following the preprocessing stage.)

    Yup, same thing in the end.

    > …
    > It’s kind of curious that someone who’s breaking new ground in CS,
    > continually going where no man has gone before, is unaware of all
    > this stuff. I mean _I_ knew what Tim’s reply was going to say,
    > although I’m not a programmer, and no I didn’t look at those
    > links either – this is all very well-known stuff (definition:
    > an aspect of theoretical CS that’s known even to _me_ is
    > "well-known".)

    Hey, I was impressed!  I know a lot of professional programmers, and I bet
    fewer than 5% of them know the essentials of how this algorithm works.  I’ll
    be /very/ impressed if you can explain why this discussion is on sci.logic
    and sci.math ;-)

  6. admin says:

    On Tue, 12 Sep 2006 01:36:08 -0400, "Tim Peters" <tim….@comcast.net>
    wrote:

    >[David C. Ullrich, to Charlie-Boo]
    >> …
    >> In case you missed what seems to me to be the most important
    >> point in all the things that Tim has said about this:

    >> The "precompiling" involves only the _target_ string. In particular
    >> the "precompiling" time is independent of the length of the
    >> _text_.

    >All true, although I used "pattern" for the needle and "target" for the
    >haystack.

    Sorry.

    >> The idea is that we’re searching for a short target in a long
    >> text – hence the "precompiling" time is essentially irrelevant.

    >Well … yes & no.  The specific version of Boyer-Moore he was looking at
    >precomputes shifts for every alphabet symbol, so for a large alphabet (like
    >Unicode) that specific version is often impractical.  There are more-or-less
    >obvious ways to worm around that, though.

    Hmm. I woulda thought that we regarded all characters not appearing
    in needle as equivalent, so the size of the alphabet was irrelevant…

    Oh. I suppose that’s one of the "more or less obvious ways to worm
    around that" – this requires that the lookup table be some sort
    of associative array, while with a short alphabet regarding all
    characters as different allows us to use a straight array for the
    lookup table, taking longer to "compile" and taking more space
    but making lookups much faster. ???

    - Hide quoted text — Show quoted text -

    >There are also practical applications where the string searched is
    >relatively short, but the same pattern is searched for in millions (billions
    >…) of distinct strings.  For example, "on the fly" analysis of
    >ever-growing log files of various kinds fits that niche.

    >> (Irrelevant because it takes orders of magnitude less time than
    >> the actual search, following the preprocessing stage.)

    >Yup, same thing in the end.

    >> …
    >> It’s kind of curious that someone who’s breaking new ground in CS,
    >> continually going where no man has gone before, is unaware of all
    >> this stuff. I mean _I_ knew what Tim’s reply was going to say,
    >> although I’m not a programmer, and no I didn’t look at those
    >> links either – this is all very well-known stuff (definition:
    >> an aspect of theoretical CS that’s known even to _me_ is
    >> "well-known".)

    >Hey, I was impressed!  I know a lot of professional programmers, and I bet
    >fewer than 5% of them know the essentials of how this algorithm works.  I’ll
    >be /very/ impressed if you can explain why this discussion is on sci.logic
    >and sci.math ;-)

    Well, I’m afraid I can’t answer the last question – all that springs
    to mind is "it’s in both groups because Charley posted it there",
    which can’t be the right answer.

    Regarding whether I’m impressive… oops, wrong question, we already
    know that<g>. Regarding whether all this is further evidence that
    I’m an impressive guy, I didn’t mean to give the idea that I actually
    had an efficient working version of the algorithm in my head. What
    I "know" is just a vague idea of how the algorithm works and why
    it works that way – vague, but nonetheless accurate enough to see
    why Charlie’s comments were all wrong, and accurate enough that
    your comments were utterly unsurprising.

    Something like this: Say needle has length m, and to simplify the
    notation let’s say string indices are 1-based. "First look at
    haystack[m]. If that character does not occur in needle then
    you know that there is no match starting at any of the first
    m positions, so you can look next at haystack[2*m]. Otoh if
    haystack[m] does appear in needle then [mumble about how it's
    clear that some preprocessing on needle tells you what to
    look at next]".

    What would result if I actually thought about it and tried
    to write an implementation for fun is not clear. If I
    actually had a practical problem to solve I’d look it
    up… Surely you’re not saying that less than 5% of the
    programmers you know "understand" the algorithm at
    _that_ level of detail?

    Ok, I was surprised at one thing you said, that being that
    the algorithm is provably optimal. I would have thought that
    there were sort of infinitely many ways to decide what to
    do when you’ve found a character in haystack that does
    appear in needle, giving faster searches after slower
    preprocessing. For example, the other day thinking about
    what I’d say if someone challenged me to actually explain
    the algorithm the fanciest thing that sprang to mind was
    a tree of all substrings of needle, where paths down the
    tree spell out substrings in reverse order… I don’t
    recall seeing anything like that the time that I did
    look at an actual implementation in some book.

    ************************

    David C. Ullrich

  7. admin says:

    In article <V6Cdnb4zAZyi2pvYnZ2dnUVZ_sOdn…@comcast.com>, "Tim Peters"
    <tim….@comcast.net> dox:

    !! [David C. Ullrich, to Charlie-Boo]
    [...]
    !! > It’s kind of curious that someone who’s breaking new ground in CS,
    !! > continually going where no man has gone before, is unaware of all
    !! > this stuff. I mean _I_ knew what Tim’s reply was going to say,
    !! > although I’m not a programmer, and no I didn’t look at those
    !! > links either – this is all very well-known stuff (definition:
    !! > an aspect of theoretical CS that’s known even to _me_ is
    !! > "well-known".)
    !!
    !! Hey, I was impressed!  I know a lot of professional programmers, and I bet
    !! fewer than 5% of them know the essentials of how this algorithm works.  I’ll
    !! be /very/ impressed if you can explain why this discussion is on sci.logic
    !! and sci.math ;-)

    its much more common these days
      due to the big 90′s push in bioinformatics

    at least those programmers who went to more serious institutions
      that prepare them for solid careers
    instead of the .NET hacks that fill so many local colleges

    now that we have huge online searchable databases
      and DNA sequence matching
    the demand for programmers with a good grasp of string algorithms
    has grown

    the 5% is much better than the 1% or less of the previous decade

    (i bet c-b posts to sci.logic
      because he knows he will learn from the responses there
      … though i doubt it will be admitted)

    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    galathaea: prankster, fablist, magician, liar

  8. admin says:

    - Hide quoted text — Show quoted text -

    David C. Ullrich wrote:
    > On 9 Sep 2006 14:54:25 -0700, "Charlie-Boo" <shymath…@gmail.com>
    > wrote:

    > >Tim Peters wrote:
    > >> [Charlie-Boo]

    > >> > But there is a "shift" involved that also takes character comparisons.

    > >> ?  Unclear why you believe that; it’s not true.

    > >Exactly how do you determine "the segment Y[. . .] with its rightmost
    > >occurrence in x that is preceded by a different character from x[i] "?
    > >That is work not included in his figure of 17.

    > >>  The shift at each stage is determined by precomputed tables.

    > >Then you are not counting the cost of the precompiling.

    > In case you missed what seems to me to be the most important
    > point in all the things that Tim has said about this:

    > The "precompiling" involves only the _target_ string. In particular
    > the "precompiling" time is independent of the length of the
    > _text_. The idea is that we’re searching for a short target
    > in a long text

    What does it mean by "The idea is that"?  Where does it say that in
    the text?

    > – hence the "precompiling" time is essentially
    > irrelevant.

    > (Irrelevant because it takes orders of magnitude less time than
    > the actual search, following the preprocessing stage.)

    What does "essentially irrelevant" mean?  We aren’t only
    searching with short patterns or long text lines.  Why is that special
    case more relevant than others, e.g. short texts?  You are arbitrarily
    declaring what the frequencies will be like.

    How do you know how long it will take to create the table?  Accessing a
    single variable determines the contents of a single particular memory
    location.  This is about as primitive as can be.  But creating a table
    from a pattern is a complex procedure.  You cannot so arbitrarily
    declare the opposite – especially on baseless assumptions about
    frequencies.

    > >> > If you do precompile, then no comparison can be made as to the
    > >> > efficiencies.
    > >>[...]

    > >Then how can he call it "the most efficient" and "the absolute
    > >minimum"?

    > >Then what is the significance of this algorithm?  What is its advantage
    > >over a simple sequential scan?  The sequential scan is much simpler.
    > >What is the advantage of this algorithm over that?

    > It’s much faster (when searching for a short target in a long text).

    That is an informal characterization which carries no mathematical
    weight.  You are ignoring, or making baseless assumptions concerning,
    variables such as the cost of creating the table or the distribution of
    search pattern and text lengths.

    If you want faster, create an array of linked lists of each character.
    Then the look-up will skip (N-1)/N of search positions (symbol
    comparisons) where N = the number of different symbols i.e. about
    98-99%!

    > It’s kind of curious that someone who’s breaking new ground in CS,
    > continually going where no man has gone before, is unaware of all
    > this stuff.

    That’s a lie.  You never established my unawareness of anything.

    You attack the messenger because you have no basis to attack the
    message.

    > I mean _I_ knew what Tim’s reply was going to say,
    > although I’m not a programmer, and no I didn’t look at those
    > links either

    Mental telepathy?

    > – this is all very well-known stuff (definition:
    > an aspect of theoretical CS that’s known even to _me_ is
    > "well-known".)

    Oh, I have no doubt that lots of people slurp up that . . .

    - Hide quoted text — Show quoted text -

    > >C-B

    > ************************

    > David C. Ullrich

  9. admin says:

    Tim Peters wrote:
    > The specific version of Boyer-Moore he was looking at
    > precomputes shifts for every alphabet symbol, so for a large alphabet (like
    > Unicode) that specific version is often impractical.

    And yet he calls this very algorithm "the most efficient" and "the
    absolute minimum"!

    > Hey, I was impressed!  I know a lot of professional programmers, and I bet
    > fewer than 5% of them know the essentials of how this algorithm works.

    How do you know what % knows about it?  And why should they?
    (Everything I have ever read by Moore was BS.)

    The algorithm described is pitifully inefficient.  SImply making the
    changes I described makes it faster and never slower, starting with
    comparing the character before the matched section to the pattern
    rather than to the text character that failed.

    C-B

    - Hide quoted text — Show quoted text -

    > I’ll
    > be /very/ impressed if you can explain why this discussion is on sci.logic
    > and sci.math ;-)

  10. admin says:

    [Tim Peters]

    >> The specific version of Boyer-Moore he was looking at
    >> precomputes shifts for every alphabet symbol, so for a large alphabet
    >> (like Unicode) that specific version is often impractical.

    [Charlie-Boo]

    > And yet he calls this very algorithm "the most efficient" and "the
    > absolute minimum"!

    I agree you can find the quoted phrases verbatim on:

        http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

    but you’ve yanked them out of sorely needed context.  Now you’re either
    intentionally misrepresenting him, or you don’t know any better.  But since
    it’s the second time you’ve done that in this thread, and this is the second
    time I’ve called you on it, I see no reason to continue responding:  looks
    like you’re either determined to argue in bad faith, or you’re ineducable.

    BTW, as I also told you before, the introduction on that site makes clear
    that he’s assuming the alphabet is a subset of ASCII, and the specific
    version of Boyer-Moore he presents exploits that.  He didn’t claim that this
    specific version of Boyer-Moore was suitable for all alphabets.

    [Tim, to David Ullrich]

    >> Hey, I was impressed!  I know a lot of professional programmers, and I
    >> bet
    >> fewer than 5% of them know the essentials of how this algorithm works.
    > How do you know what % knows about it?

    An educated guess, based on technical discussions with at least hundreds of
    professional programmers over a span of decades.

    > And why should they?

    I didn’t say that they should.  I said I was impressed by how much David
    knew despite not working in a particularly related field.  I know what he
    does for a living; I have no idea what you do, so have no opinion on whether
    your lack of knowledge here is, e.g., expected or appalling.

    > (Everything I have ever read by Moore was BS.)

    > The algorithm described is pitifully inefficient.  SImply making the
    > changes I described makes it faster and never slower, starting with
    > comparing the character before the matched section to the pattern
    > rather than to the text character that failed.

    Honestly, just about everything you’ve posted about Lecroq’s site betrays
    that you understood very little of what you read.  To be fair, I understood
    very little of your claimed improvements either.  I already suggested that
    you look at the simpler "Quick Search" B-M variant first, which indeed does
    not compare "the text character that failed".  See Sunday’s article in the
    references listed on that page for extensive analysis of that variation,
    along with a few others.

  11. admin says:

    Tim Peters wrote:
    > I agree you can find the quoted phrases verbatim on:

    >     http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

    > but you’ve yanked them out of sorely needed context.

    Then give the meaning he meant, the meaning that I give, why what I say
    gives that meaning instead of the real one, and how that differs from
    his meaning.

    > > How do you know what % knows about it?

    > An educated guess, based on technical discussions with at least hundreds of
    > professional programmers over a span of decades.

    How do you relate that experience to one specific algorithm?  And
    please don’t claim there is a linear ordering of what people know.
    Algorithms can be long and very simple, or short and very complex, and
    are valued for these differing attributes.

    The Moore algorithm, in contrast, has no clear advantage.  It uses
    precompiling that involves different types of processing (and thus
    different weights), making it impossible to compare to simple
    sequential comparisons.  It is simply apples and oranges to compare a
    variable number of gets to a variable number of array sets.  It’s
    only clear difference is the increase in complexity.

    It is also not true that it is optimal in any way as described, simply
    because of the minor changes that make it logically "smarter".

    > > And why should they?

    > I didn’t say that they should.

    Then there’s no relevance as to how many people are aware of it.
    It’s funny.  You are saying that it is too unknown, and of course the
    problem is that it is the opposite: a well-published algorithm that
    shouldn’t be.

    > I know what he
    > does for a living; I have no idea what you do, so have no opinion on whether
    > your lack of knowledge here is, e.g., expected or appalling.

    I agree.  His being a teacher makes his lack of knowledge appalling.

    (Can you imagine his telling the little kids how wonderful Principia
    Mathematica is, with all it’s 1,000+ pages, while Occam rolls over in
    disbelief that something so long, complex and containing so many
    primitives is seen as being a good design!)

    How about unsubstantiated personal attacks: expected or appalling?

    > > (Everything I have ever read by Moore was BS.)

    > > The algorithm described is pitifully inefficient.  SImply making the
    > > changes I described makes it faster and never slower, starting with
    > > comparing the character before the matched section to the pattern
    > > rather than to the text character that failed.

    > Honestly, just about everything you’ve posted about Lecroq’s site betrays
    > that you understood very little of what you read.  To be fair, I understood
    > very little of your claimed improvements either.

    You understood "very little" but draw conclusions concerning
    "just about everything you’ve posted".  Is that logically
    possible?

    > I already suggested that
    > you look at the simpler "Quick Search" B-M variant first, which indeed does
    > not compare "the text character that failed".

    That doesn’t change what is claimed about this algorithm.  (And I’m
    glad you tacitly admit that it is less efficient than my suggested
    improvement, showing it is clearly not "optimal" in any respect
    whatsoever.)  It is false in many ways.

    C-B

    - Hide quoted text — Show quoted text -

    > See Sunday’s article in the
    > references listed on that page for extensive analysis of that variation,
    > along with a few others.

  12. admin says:

    [David C. Ullrich, to Charlie-Boo]

    >>> …
    >>> In case you missed what seems to me to be the most important
    >>> point in all the things that Tim has said about this:

    >>> The "precompiling" involves only the _target_ string. In particular
    >>> the "precompiling" time is independent of the length of the
    >>> _text_.

    [Tim Peters]

    >> All true, although I used "pattern" for the needle and "target" for the
    >> haystack.

    [David C. Ullrich]

    > Sorry.

    Wasn’t offended ;-)

    >>> The idea is that we’re searching for a short target in a long
    >>> text – hence the "precompiling" time is essentially irrelevant.
    >> Well … yes & no.  The specific version of Boyer-Moore he was looking at
    >> precomputes shifts for every alphabet symbol, so for a large alphabet
    >> (like
    >> Unicode) that specific version is often impractical.  There are more-or-
    >> less obvious ways to worm around that, though.
    > Hmm. I woulda thought that we regarded all characters not appearing
    > in needle as equivalent, so the size of the alphabet was irrelevant…

    > Oh. I suppose that’s one of the "more or less obvious ways to worm
    > around that"

    Right!  The specific version of B-M in question uses dense arrays, with an
    explicitly intialized slot for every alphabet symbol.

    > – this requires that the lookup table be some sort of associative
    > array, while with a short alphabet regarding all characters as different
    > allows us to use a straight array for the lookup table, taking longer
    > to "compile" and taking more space but making lookups much faster. ???

    Yup.  Since there are dozens of ways to implement associative mappings,
    there are dozens of variations on this theme )most of which don’t make a
    real lick of difference).

    >>> It’s kind of curious that someone who’s breaking new ground in CS,
    >>> continually going where no man has gone before, is unaware of all
    >>> this stuff. I mean _I_ knew what Tim’s reply was going to say,
    >>> although I’m not a programmer, and no I didn’t look at those
    >>> links either – this is all very well-known stuff (definition:
    >>> an aspect of theoretical CS that’s known even to _me_ is
    >>> "well-known".)
    >> Hey, I was impressed!  I know a lot of professional programmers, and
    >> I bet fewer than 5% of them know the essentials of how this algorithm
    >> works.  I’ll be /very/ impressed if you can explain why this discussion
    >> is on sci.logic and sci.math ;-)
    > Well, I’m afraid I can’t answer the last question – all that springs
    > to mind is "it’s in both groups because Charley posted it there",
    > which can’t be the right answer.

    Not impressive at all.  That only explains /how/ the discussion got here,
    not /why/ — I’m looking for a rigorous psychological explanation ;-)

    > Regarding whether I’m impressive… oops, wrong question, we already
    > know that<g>. Regarding whether all this is further evidence that
    > I’m an impressive guy, I didn’t mean to give the idea that I actually
    > had an efficient working version of the algorithm in my head.

    I expect I got the impression you intended to make.

    > What I "know" is just a vague idea of how the algorithm works and why
    > it works that way – vague, but nonetheless accurate enough to see
    > why Charlie’s comments were all wrong, and accurate enough that
    > your comments were utterly unsurprising.

    Exactly the impression I got.

    > Something like this: Say needle has length m, and to simplify the
    > notation let’s say string indices are 1-based. "First look at
    > haystack[m]. If that character does not occur in needle then
    > you know that there is no match starting at any of the first
    > m positions, so you can look next at haystack[2*m]. Otoh if
    > haystack[m] does appear in needle then [mumble about how it's
    > clear that some preprocessing on needle tells you what to
    > look at next]".

    > What would result if I actually thought about it and tried
    > to write an implementation for fun is not clear.

    I expect you’d rediscover either the "Quick Search" or "Horspool" variants
    quickly.  If haystack[m] doesn’t match needle[m], then in one of those you
    shift the needle right so that the rightmost occurrence of haystack[m] in
    the needle "lines up" with haystack[m]; in the other you do much the same,
    but instead line up haystack[m+1] with its rightmost occurence in the
    needle.  The latter (Quick Search) is especially easy to work out, because
    it does exactly the same thing even if haystack[m] does match needle[m] but
    the needle and haystack don’t match in some earlier position.  For example,
    for needle "abac",

    shift["a"] = 2  (the rightmost ‘a’ is "2 from the right end")
    shift["b"] = 3  ( "     "      ’b’ is "3   "   "    "    " ")
    shift["c"] = 1  (etc)
    shift[anything else] = 5

    Those can be computed with a very simple left-to-right scan of the needle:

        for i = 1 to m:
            shift[needle[i]] = m+1 – i

    It makes no difference to correctness (in this variant) in which order you
    compare needle and haystack positions, because the same shift count is used
    regardless of which position fails to match.  Example:

    haystack:  xbacdabbcabacab
      needle:  abac
                   ^

    "x" doesn’t match "a" at needle[1], so we look at shift[haystack[5]] =
    shift["d"] = 5, and jump to:

    haystack:  xbacdabbcabacab
      needle:       abac
                        ^
    Then "b" doesn’t match "a" at needle[3], so we look at shift[haystack[10]] =
    shift["a"] = 2, and jump to:

    haystack:  xbacdabbcabacab
      needle:         abac
                          ^

    Another shift of 2 and we find a match.

    Full-blown B-M is substantially more complicated, but the easy variant above
    is something you can rederive from scratch correctly in just a few minutes
    after you’re done it once, and is very effective all by itself in "typical"
    search scenarios.

    > If I actually had a practical problem to solve I’d look it
    > up…

    After the above, I hope you won’t need to ;-)

    > Surely you’re not saying that less than 5% of the
    > programmers you know "understand" the algorithm at
    > _that_ level of detail?

    Yup, that was what I meant.  Most professional programmers don’t know much
    about string search algorithms, because they don’t /implement/ them, they
    just /use/ whatever the programming language du jour supplies.  And while I
    don’t know about current curricula, last time I had a theoretical CompSci
    algorithm course was in the very late 70′s, and all they covered was the
    theoretically superior (due to excellent worst-case behavior) then-newish
    Knuth-Morris-Pratt algorithm.

    The ideas here appear obvious in retrospect, but they really aren’t to most
    people.  The B-M algorithm was first published in 1977, and the "shockingly
    simple" Quick Search variant in 1990.  I should add that most professional
    programmers are too busy wrestling with flaky web technologies on overdue
    projects to bother reading papers <0.4 wink>.

    Of course /some/ programmers know a great deal about string searching.
    "galathea" specifically mentioned bioinformatics, and it’s certainly true
    that everyone programming in that field has to know this stuff cold.

    > Ok, I was surprised at one thing you said, that being that
    > the algorithm is provably optimal.

    I sure hope I didn’t say that!  My regrets & apologies if I did.
    Charlie-Boo quoted bits of Lecroq out of context, and I tried to impress
    upon him that context was important.  Like so, which is what Lecroq actually
    wrote:

        When searching for a^(m-1)b in b^n the algorithm makes only O(n/m)
        comparisons, which is the absolute minimum for any string-matching
        algorithm in the model where the pattern only is preprocessed.

    What he claims there is highly qualified, restricted to two specific
    strings, and making only an O() claim.  By the time Charlie-Boo regurgitated
    that bit, it was a rant against Lecroq claiming that B-M did an "absolute
    minimum" number of comparisons, period.  Of course that’s not true; its
    worst-case behavior is O(m*n), same as dirt-dumb sequential search.  But B-M
    is optimal in the highly qualified way Lecroq claimed.

    > I would have thought that there were sort of infinitely many ways
    > to decide what to do when you’ve found a character in haystack that
    > does appear in needle, giving faster searches after slower
    > preprocessing. For example, the other day thinking about
    > what I’d say if someone challenged me to actually explain
    > the algorithm the fanciest thing that sprang to mind was
    > a tree of all substrings of needle, where paths down the
    > tree spell out substrings in reverse order… I don’t
    > recall seeing anything like that the time that I did
    > look at an actual implementation in some book.

    There are search algorithms with worst-case search time O(n), where n is the
    length of the haystack.  Knuth-Morris-Pratt is one of them (worst-case # of
    comparisons 2*n), and you’ve started down the road to that.  More general
    worst-case linear-time constructions work for general regular expressions
    (the theoretical CompSci kind of regexp, not the kind implemented in Python
    or Perl or Emacs or …), although preprocessing time there can be
    worst-case exponential time in the size of the needle.  It’s possible (even
    helpful ;-) ) to view the K-M-P algorithm as an especially efficient way to
    build a recognizing automata for the especially simple class of regular
    expressions consisting of fixed strings.

  13. admin says:

    Tim Peters wrote:
    > Charlie-Boo quoted bits of Lecroq out of context, and I tried to impress
    > upon him that context was important.  Like so, which is what Lecroq actually
    > wrote:

    >     When searching for a^(m-1)b in b^n the algorithm makes only O(n/m)
    >     comparisons, which is the absolute minimum for any string-matching
    >     algorithm in the model where the pattern only is preprocessed.

    > What he claims there is highly qualified

    "The Boyer-Moore algorithm is considered as the most efficient
    string-matching algorithm in usual applications."

    A simple change (from "=A" to "~=B") makes it stop earlier in
    some circumstances, so it cannot be optimal.  Any claim of proving that
    it is optimal is necessarily not sound.

    Get real.

    C-B

  14. admin says:

    [Tim Peters]

    >> I agree you can find the quoted phrases verbatim on:

    >>     http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

    >> but you’ve yanked them out of sorely needed context.

    [Charlie-Boo]

    > Then give the meaning he meant, the meaning that I give, why what I say
    > gives that meaning instead of the real one, and how that differs from
    > his meaning.

    Repeated from a different reply:

        Charlie-Boo quoted bits of Lecroq out of context, and I tried to
        impress upon him that context was important.  Like so, which is
        what Lecroq actually wrote:

            When searching for a^(m-1)b in b^n the algorithm makes only
            O(n/m) comparisons, which is the absolute minimum for any
            string-matching algorithm in the model where the pattern only
            is preprocessed.

        What he claims there is highly qualified, restricted to two
        specific strings, and making only an O() claim.  By the time
        Charlie-Boo regurgitated that bit, it was a rant against Lecroq
        claiming that B-M did an "absolute minimum" number of comparisons,
        period.  Of course that’s not true; its worst-case behavior is
        O(m*n), same as dirt-dumb sequential search.  But B-M is optimal
        in the highly qualified way Lecroq claimed.

    Do you still need an explanation for how what he actually claimed differs
    from your out-of-context caricature of his claim?

    >>> How do you know what % knows about it?
    >> An educated guess, based on technical discussions with at least
    >> hundreds of professional programmers over a span of decades.
    > How do you relate that experience to one specific algorithm?

    Very easily, and you’re again ignoring context:  I was remarking on David
    Ullrich’s correct understanding of the broad strokes.  They apply to
    literally dozens of variations on this general /class/ of string-search
    algorithms.  The single specific version of B-M you’re going on about is
    just one of them, and the level of understanding David showed applies to all
    of them.  It’s at that high level too that I believe I have a good
    understanding of how much colleagues have known.

    > And please don’t claim there is a linear ordering of what people know.
    > Algorithms can be long and very simple, or short and very complex, and
    > are valued for these differing attributes.

    I agree, but don’t see the relevance.

    > The Moore algorithm, in contrast, has no clear advantage.  It uses
    > precompiling that involves different types of processing (and thus
    > different weights), making it impossible to compare to simple
    > sequential comparisons.  It is simply apples and oranges to compare a
    > variable number of gets to a variable number of array sets.  It’s
    > only clear difference is the increase in complexity.

    You’re ignoring that it runs much faster than simple sequential comparison
    in many applications.  That’s why it’s used.  Tying your brain into knots
    trying to refute that is futile, since it’s easily demonstrated by timing
    implementations.  Wall-clock time doesn’t give a rip about "different
    weights", etc.

    > It is also not true that it is optimal in any way as described, simply
    > because of the minor changes that make it logically "smarter".

    See above:  the claim for optimality Lecroq made is correct, although you
    ignored almost everything he actually said.

    >>> And why should they?
    >> I didn’t say that they should.
    > Then there’s no relevance as to how many people are aware of it.

    I didn’t say that there was.  I said I was impressed by how much David knew
    about it.  Remarking that I’d bet fewer than 5% of professional programmers
    know as much was just a way of indicating why I was impressed.

    BTW, why do you care?  I’m getting the impression that I stumbled into a
    long-running pissing contest between you and David here.

    > It’s funny.  You are saying that it is too unknown, and of course the
    > problem is that it is the opposite: a well-published algorithm that
    > shouldn’t be.

    LOL — sorry, I gave up trying to parse to that.

    >> I know what he does for a living; I have no idea what you do, so
    >> have no opinion on whether your lack of knowledge here is, e.g.,
    >> expected or appalling.
    > I agree.  His being a teacher makes his lack of knowledge appalling.

    Huh?  As far as I’m concerned, he demonstrated he knows more about it than
    you.

    > (Can you imagine his telling the little kids how wonderful Principia
    > Mathematica is, with all it’s 1,000+ pages, while Occam rolls over in
    > disbelief that something so long, complex and containing so many
    > primitives is seen as being a good design!)

    What?

    > How about unsubstantiated personal attacks: expected or appalling?

    As I said, I have no opinion on that wrt you.  I do hold the opinion that
    you don’t know what you’re talking about here, based on a growing string of
    false claims and failures to understand what Lecroq said.  I still have no
    idea about whether I should expect that from you or not.

    >>> (Everything I have ever read by Moore was BS.)

    >>> The algorithm described is pitifully inefficient.  SImply making the
    >>> changes I described makes it faster and never slower, starting with
    >>> comparing the character before the matched section to the pattern
    >>> rather than to the text character that failed.
    >> Honestly, just about everything you’ve posted about Lecroq’s site betrays
    >> that you understood very little of what you read.  To be fair, I
    >> understood very little of your claimed improvements either.
    > You understood "very little"

    About your claimed improvements, yes.

    > but draw conclusions concerning "just about everything you’ve posted".

    About Lecroq’s site, yes.

    > Is that logically possible?

    Yes, if you pay attention to context.  Your claimed improvements were not
    statements about Lecroq’s site, for example.  That was you making statements
    about your own ideas.  Your statements about what Lecroq wrote were
    repeatedly wrong.

    >> I already suggested that you look at the simpler "Quick Search" B-M
    >> variant first, which indeed does not compare "the text character that
    >> failed".
    > That doesn’t change what is claimed about this algorithm.

    True.  It’s something you should look at if you want to learn something
    instead of persisting in misunderstanding what Lecroq wrote about B-M.  You
    should find what he wrote about B-M easier to understand if you study his
    presentation of the much simpler Quick Search first.

    > (And I’m glad you tacitly admit that it is less efficient than my
    > suggested improvement, showing it is clearly not "optimal" in any respect
    > whatsoever.)

    But it is optimal in exactly the specific way Lecroq claimed it was optimal,
    as quoted above.  Maybe you don’t know what "a^(m-1)b in b^n" means?
    /Something’s/ bonkers here, because you continue to miss his meaning
    completely.

  15. admin says:

    - Hide quoted text — Show quoted text -

    Tim Peters wrote:
    > [Tim Peters]
    > >> I agree you can find the quoted phrases verbatim on:

    > >>     http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

    > >> but you’ve yanked them out of sorely needed context.

    > [Charlie-Boo]
    > > Then give the meaning he meant, the meaning that I give, why what I say
    > > gives that meaning instead of the real one, and how that differs from
    > > his meaning.

    > Repeated from a different reply:

    >     Charlie-Boo quoted bits of Lecroq out of context, and I tried to
    >     impress upon him that context was important.  Like so, which is
    >     what Lecroq actually wrote:

    >         When searching for a^(m-1)b in b^n the algorithm makes only
    >         O(n/m) comparisons, which is the absolute minimum for any
    >         string-matching algorithm in the model where the pattern only
    >         is preprocessed.

    >     What he claims there is highly qualified, restricted to two
    >     specific strings, and making only an O() claim.  By the time
    >     Charlie-Boo regurgitated that bit, it was a rant against Lecroq
    >     claiming that B-M did an "absolute minimum" number of comparisons,
    >     period.  Of course that’s not true; its worst-case behavior is
    >     O(m*n), same as dirt-dumb sequential search.  But B-M is optimal
    >     in the highly qualified way Lecroq claimed.

    > Do you still need an explanation for how what he actually claimed differs
    > from your out-of-context caricature of his claim?

    [Charlie-Boo]

    > And yet he calls this very algorithm "the most efficient" and "the
    > absolute minimum"!

    You quoted "absolute minimum" but left out the reference to "the most
    efficient" (speaking of taking things out of context):

    "The Boyer-Moore algorithm is considered as the most efficient
    string-matching algorithm in usual applications."

    Not quite highly qualified.

    C-B

  16. admin says:

    [Tim Peters]

    >>>> I agree you can find the quoted phrases verbatim on:

    >>>>     http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

    >>>> but you’ve yanked them out of sorely needed context.

    [Charlie-Boo]

    >>> Then give the meaning he meant, the meaning that I give, why what I say
    >>> gives that meaning instead of the real one, and how that differs from
    >>> his meaning.

    [Tim Peters]

    - Hide quoted text — Show quoted text -

    >> Repeated from a different reply:

    >>     Charlie-Boo quoted bits of Lecroq out of context, and I tried to
    >>     impress upon him that context was important.  Like so, which is
    >>     what Lecroq actually wrote:

    >>         When searching for a^(m-1)b in b^n the algorithm makes only
    >>         O(n/m) comparisons, which is the absolute minimum for any
    >>         string-matching algorithm in the model where the pattern only
    >>         is preprocessed.

    >>     What he claims there is highly qualified, restricted to two
    >>     specific strings, and making only an O() claim.  By the time
    >>     Charlie-Boo regurgitated that bit, it was a rant against Lecroq
    >>     claiming that B-M did an "absolute minimum" number of comparisons,
    >>     period.  Of course that’s not true; its worst-case behavior is
    >>     O(m*n), same as dirt-dumb sequential search.  But B-M is optimal
    >>     in the highly qualified way Lecroq claimed.

    >> Do you still need an explanation for how what he actually claimed differs
    >> from your out-of-context caricature of his claim?

    [Charlie-Boo retroactively quoting Charlie-Boo]   [quote A]

    >> And yet he calls this very algorithm "the most efficient" and "the
    >> absolute minimum"!

    [Charlie-Boo]

    > You quoted "absolute minimum" but left out the reference to "the most
    > efficient" (speaking of taking things out of context):

    Yes, the "different reply" wasn’t made to you to begin with, and wasn’t
    responding to your "[quote A]" above, although it not coincidentally
    addressed the "the absolute minimum" part of your caricature (which you made
    more than once, not just in "[quote A]" above).  I suppose you’re irked
    because I didn’t write an entirely different explanation for you alone, but
    the idea that you fairly characterized what Lecroq said is so transparently
    goofy that repeatedly explaining why exceeded my limit for enduring pedantic
    tedium.

    > "The Boyer-Moore algorithm is considered as the most efficient
    > string-matching algorithm in usual applications."

    > Not quite highly qualified.

    Not sure I follow.  For example, you believe that:

        the most efficient

    is a reasonably accurate rephrasing of:

        is considered as the most efficient string-matching algorithm in
        usual applications

    ?  If so, good luck with the rest of your life ;-)

  17. admin says:

    Tim Peters wrote:
    > > You quoted "absolute minimum" but left out the reference to "the most
    > > efficient" (speaking of taking things out of context):

    > Yes, the "different reply" wasn’t made to you to begin with, and wasn’t
    > responding to your "[quote A]" above, although it not coincidentally
    > addressed the "the absolute minimum" part of your caricature (which you made
    > more than once, not just in "[quote A]" above).  I suppose you’re irked
    > because I didn’t write an entirely different explanation for you alone, but
    > the idea that you fairly characterized what Lecroq said is so transparently
    > goofy that repeatedly explaining why exceeded my limit for enduring pedantic
    > tedium.

    You left out the quote that shows they didn’t refer to your special
    cases only.  That’s all.

    > > "The Boyer-Moore algorithm is considered as the most efficient
    > > string-matching algorithm in usual applications."

    > > Not quite highly qualified.

    > Not sure I follow.

    You used the phrase "highly qualified" when referring to one quote.
    You left out the other quote, which I give above and is NOT "highly
    qualified" (IMEO.)  So you gave the wrong impression when you didn’t
    include the entire quote.

    >     For example, you believe that:

    >     the most efficient

    > is a reasonably accurate rephrasing of:

    >     is considered as the most efficient string-matching algorithm in
    >     usual applications

    > ?  If so, good luck with the rest of your life ;-)

    It is the rephrasing done by the author who wants very much to say they
    did the absolute maximum best and nobody else can ever do better.  Like
    when Chaitin calls his purported theorem the "strongest possible Godel
    theorem" without explaining what that means and how he knows it.
    (Rosser 1936 is already stronger than Godel 1931.)

    So they say "is considered" instead of "is", the latter being a formal
    mathematical statement and the former being undefined and committing
    them to nothing.

    And they throw in "in usual applications" (another undefined term) to
    cover their tracks.

    Just call me the bloodhound.

    C-B

  18. admin says:


    [Charlie-Boo]

    >>> "The Boyer-Moore algorithm is considered as the most efficient
    >>> string-matching algorithm in usual applications."

    >>> Not quite highly qualified.

    [Tim Peters]

    >> Not sure I follow.

    [Charlie-Boo]

    > You used the phrase "highly qualified" when referring to one quote.

    Which, in context, specifically referred to the "absolute minimum" part,
    which was highly qualified.

    > You left out the other quote, which I give above and is NOT "highly
    > qualified" (IMEO.)

    Which was about a different part.

    > So you gave the wrong impression when you didn’t include the entire
    > quote.

    Possibly.  Let’s see whether anyone else feels bamboozled by it.

    >>     For example, you believe that:

    >>     the most efficient

    >> is a reasonably accurate rephrasing of:

    >>     is considered as the most efficient string-matching algorithm in
    >>     usual applications

    >> ?  If so, good luck with the rest of your life ;-)
    > It is the rephrasing done by the author who wants very much to say they
    > did the absolute maximum best and nobody else can ever do better.  Like
    > when Chaitin calls his purported theorem the "strongest possible Godel
    > theorem" without explaining what that means and how he knows it.
    > (Rosser 1936 is already stronger than Godel 1931.)

    Weak, since Lecroq isn’t the originator of the B-M algorithm, and has no
    stake in what others think of it.

    > So they say "is considered" instead of "is", the latter being a formal
    > mathematical statement and the former being undefined and committing
    > them to nothing.

    > And they throw in "in usual applications" (another undefined term) to
    > cover their tracks.

    Which is exactly why your "the most efficient" caricature gives a wrong
    impression of what Lecroq wrote here too, but for a different reason.  In
    the other case you stripped necessary formality from a formal statement,
    while in this case you stripped the informality from an /in/formal
    statement, making him appear to be an idiot in both cases.

        http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

        Description

        The Boyer-Moore algorithm is considered as the most efficient
        string-matching algorithm in usual applications. A simplified
        version of it or the entire algorithm is often implemented in
        text editors for the «search» and «substitute» commands.

    Juat an informal introductory paragraph, not a formal claim in any sense
    that B-M is "the most efficient" such algorithm period.  In contrast, his
    later claim of optimality in a specific case was meant to be a formal claim,
    and was stated with care and without weasel-words.  I expect that anyone
    reading in good faith can tell the difference without problem.

    > Just call me the bloodhound.

    OK — but I suggest you’d benefit from more tracking and less barking.

  19. admin says:

    Tim Peters wrote:
    > In the other case you stripped necessary formality from a formal statement,
    > while in this case you stripped the informality from an /in/formal
    > statement, making him appear to be an idiot in both cases.

    If it looks like an idiot, walks and talks like a idiot, . . .

    If the public sees it that way, who am I to interfere with freedom of
    opinion?  I realize of course that I hold an unfair advantage over
    others with my clout and prestige, allowing me to sway public opinion
    to whatever side I choose.  In fact, I have even been accused of
    leading my loyal followers down the wrong path in my pronouncements.

    >     The Boyer-Moore algorithm is considered as the most efficient
    >     string-matching algorithm in usual applications. A simplified
    >     version of it or the entire algorithm is often implemented in
    >     text editors for the «search» and «substitute» commands.

    > Juat an informal introductory paragraph, not a formal claim in any sense
    > that B-M is "the most efficient" such algorithm period.  In contrast, his
    > later claim of optimality in a specific case was meant to be a formal claim,
    > and was stated with care and without weasel-words.

    No, the first statement is meant to be a claim that it is the best
    possible algorithm, and the later formulas are meant to obfuscate the
    situation with the hope that you then believe the first statement.

    > > Just call me the bloodhound.

    >  OK — but I suggest you’d benefit from more tracking and less barking.

    Barking up the wrong directed graph?

    C-B

  20. admin says:

    On Tue, 12 Sep 2006 17:44:27 -0400, "Tim Peters" <tim….@comcast.net>
    wrote:

    >[large amount of interesting stuff snipped]

    Thanks.

    So I guess the fact that I find all this fascinating means
    I can never be a programmer. Oh well. (Heh: if the dept
    head gets his way I _may_ soon be bogged down in various
    web technologies, so I guess there’s hope.)

    ************************

    David C. Ullrich

  21. admin says:

    On Tue, 12 Sep 2006 15:45:42 -0400, "Tim Peters" <tim….@comcast.net>
    wrote:

    - Hide quoted text — Show quoted text -

    >[Tim Peters]
    >>> The specific version of Boyer-Moore he was looking at
    >>> precomputes shifts for every alphabet symbol, so for a large alphabet
    >>> (like Unicode) that specific version is often impractical.

    >[Charlie-Boo]
    >> And yet he calls this very algorithm "the most efficient" and "the
    >> absolute minimum"!

    >I agree you can find the quoted phrases verbatim on:

    >    http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

    >but you’ve yanked them out of sorely needed context.  Now you’re either
    >intentionally misrepresenting him, or you don’t know any better.  But since
    >it’s the second time you’ve done that in this thread, and this is the second
    >time I’ve called you on it, I see no reason to continue responding:  looks
    >like you’re either determined to argue in bad faith, or you’re ineducable.

    You’re catching on.

    - Hide quoted text — Show quoted text -

    >BTW, as I also told you before, the introduction on that site makes clear
    >that he’s assuming the alphabet is a subset of ASCII, and the specific
    >version of Boyer-Moore he presents exploits that.  He didn’t claim that this
    >specific version of Boyer-Moore was suitable for all alphabets.

    >[Tim, to David Ullrich]
    >>> Hey, I was impressed!  I know a lot of professional programmers, and I
    >>> bet
    >>> fewer than 5% of them know the essentials of how this algorithm works.

    >> How do you know what % knows about it?

    >An educated guess, based on technical discussions with at least hundreds of
    >professional programmers over a span of decades.

    >> And why should they?

    >I didn’t say that they should.  I said I was impressed by how much David
    >knew despite not working in a particularly related field.  I know what he
    >does for a living; I have no idea what you do, so have no opinion on whether
    >your lack of knowledge here is, e.g., expected or appalling.

    >> (Everything I have ever read by Moore was BS.)

    >> The algorithm described is pitifully inefficient.  SImply making the
    >> changes I described makes it faster and never slower, starting with
    >> comparing the character before the matched section to the pattern
    >> rather than to the text character that failed.

    >Honestly, just about everything you’ve posted about Lecroq’s site betrays
    >that you understood very little of what you read.  To be fair, I understood
    >very little of your claimed improvements either.  I already suggested that
    >you look at the simpler "Quick Search" B-M variant first, which indeed does
    >not compare "the text character that failed".  See Sunday’s article in the
    >references listed on that page for extensive analysis of that variation,
    >along with a few others.

    ************************

    David C. Ullrich

  22. admin says:

    On Tue, 12 Sep 2006 19:09:08 -0400, "Tim Peters" <tim….@comcast.net>
    wrote:

    >[Tim Peters]
    >>> [...]

    >BTW, why do you care?  I’m getting the impression that I stumbled into a
    >long-running pissing contest between you and David here.

    <giggle> You really had no idea what you were getting into here?

    See, the problem is that Charlie’s understandably pissed off at a lot
    of people because he don’t get no respect. He’s the only person who
    has ever formalized any aspect of computer science. He’s the only
    person who’s ever made a _working_ automated theorem prover – all
    the other work there has been just theoretical bullshit. As far as
    I know he’s also the only person who’s ever recognized that BM is
    just bullshit as well…

    It’s remarkable, all that stuff in all those books, he’s the only
    one out there pointing out that the emperor has no clothes, and
    nobody seems to take him seriously. You’d be irritated too.

    If you’d only shut up and listen you could learn a lot from the
    guy. Just like with I-forget-the-name over in sci.math.

    ************************

    David C. Ullrich

  23. admin says:

    On 12 Sep 2006 11:53:05 -0700, "Charlie-Boo" <shymath…@gmail.com>
    wrote:

    - Hide quoted text — Show quoted text -

    >David C. Ullrich wrote:
    >> On 9 Sep 2006 14:54:25 -0700, "Charlie-Boo" <shymath…@gmail.com>
    >> wrote:

    >> >Tim Peters wrote:
    >> >> [Charlie-Boo]

    >> >> > But there is a "shift" involved that also takes character comparisons.

    >> >> ?  Unclear why you believe that; it’s not true.

    >> >Exactly how do you determine "the segment Y[. . .] with its rightmost
    >> >occurrence in x that is preceded by a different character from x[i] "?
    >> >That is work not included in his figure of 17.

    >> >>  The shift at each stage is determined by precomputed tables.

    >> >Then you are not counting the cost of the precompiling.

    >> In case you missed what seems to me to be the most important
    >> point in all the things that Tim has said about this:

    >> The "precompiling" involves only the _target_ string. In particular
    >> the "precompiling" time is independent of the length of the
    >> _text_. The idea is that we’re searching for a short target
    >> in a long text

    >What does it mean by "The idea is that"?  Where does it say that in
    >the text?

    Um, for example where it states that the algorithm or simplified
    versions of it are commonly used for search&replace in text
    editors.

    >> – hence the "precompiling" time is essentially
    >> irrelevant.

    >> (Irrelevant because it takes orders of magnitude less time than
    >> the actual search, following the preprocessing stage.)

    >What does "essentially irrelevant" mean?  

    This time is essentially irrelevant to that time if this
    time is much smaller than that time.

    >We aren’t only
    >searching with short patterns or long text lines.  Why is that special
    >case more relevant than others, e.g. short texts?  You are arbitrarily
    >declaring what the frequencies will be like.

    >How do you know how long it will take to create the table?  Accessing a
    >single variable determines the contents of a single particular memory
    >location.  This is about as primitive as can be.  But creating a table
    >from a pattern is a complex procedure.  You cannot so arbitrarily
    >declare the opposite – especially on baseless assumptions about
    >frequencies.

    Um. It happens really a lot that I search text in a text editor for
    a certain string. What I’m searching for is a pattern like "10.1.1"
    or "hilbert transform". The text is typically at least several KB
    long, usually at least tens of KB. That’s fairly typical – those
    "baseless assumptions" are in fact _realistic_ assumptions.

    Here’s a quiz. Define g(m,n) = 1 + n, f(n,m) = m^2 + n/m. Which
    is larger, g(10, 10000) or f(10, 10000)?

    (Answer: g(10, 10000) is larger. Why? Because the fact that 1 is
    so much smaller than 10^2 is essentially irrelevant next to the
    fact that 10000/10 is so much smaller than 10000.)

    - Hide quoted text — Show quoted text -

    >> >> > If you do precompile, then no comparison can be made as to the
    >> >> > efficiencies.
    >> >>[...]

    >> >Then how can he call it "the most efficient" and "the absolute
    >> >minimum"?

    >> >Then what is the significance of this algorithm?  What is its advantage
    >> >over a simple sequential scan?  The sequential scan is much simpler.
    >> >What is the advantage of this algorithm over that?

    >> It’s much faster (when searching for a short target in a long text).

    >That is an informal characterization which carries no mathematical
    >weight.  You are ignoring, or making baseless assumptions concerning,
    >variables such as the cost of creating the table or the distribution of
    >search pattern and text lengths.

    Realistic assumptions.

    >If you want faster, create an array of linked lists of each character.
    >Then the look-up will skip (N-1)/N of search positions (symbol
    >comparisons) where N = the number of different symbols i.e. about
    >98-99%!

    Fascinating. At the start of this you said you didn’t think that
    a simple sequential search could be beat.

    >> It’s kind of curious that someone who’s breaking new ground in CS,
    >> continually going where no man has gone before, is unaware of all
    >> this stuff.

    >That’s a lie.  You never established my unawareness of anything.

    No, _you_ established that in your first post. Quote:

    "Note 1. Faulty Count

    He is not counting the characters compared to make the shift!

    Note 2. Sequential produces the same result more efficiently

    I don’t think there is anything that you can do to improve over a
    simple (faster) sequential scan without precompiling.  Added
    complexity
    means addition inefficiency."

    Note 1 is simply wrong. And Note 2 is hilariously wrong.

    >You attack the messenger because you have no basis to attack the
    >message.

    >> I mean _I_ knew what Tim’s reply was going to say,
    >> although I’m not a programmer, and no I didn’t look at those
    >> links either

    >Mental telepathy?

    Uh, no, just less than total ignorance of how those search
    algorithms work.

    >> – this is all very well-known stuff (definition:
    >> an aspect of theoretical CS that’s known even to _me_ is
    >> "well-known".)

    >Oh, I have no doubt that lots of people slurp up that . . .

    >> >C-B

    ************************

    David C. Ullrich

  24. admin says:

    David C. Ullrich wrote:
    > <giggle> You really had no idea what you were getting into here?

    Sorry, but giggling is only cute when babies do it.

    > See, the problem is that Charlie’s understandably pissed off at a lot
    > of people because he don’t get no respect.

    I applaud your decision to give up on Mathematical Logic and try your
    hand at Psychology instead.  Maybe you’ll do better in that field.

    Now, what makes you feel there is so much anger around you?  Do you
    simply feel more at ease discussing principles of Psychology than
    principles of Mathematical Logic?

    > He’s the only person who
    > has ever formalized any aspect of computer science.

    Why don’t you give an example if other people have done it?  Every
    time you are unable to substantiate a claim is simply more evidence
    that it is not true.  Fine with me!

    > He’s the only
    > person who’s ever made a _working_ automated theorem prover – all
    > the other work there has been just theoretical bullshit.

    Not so.  Who ever said that?  Please quote.

    > It’s remarkable, all that stuff in all those books, he’s the only
    > one out there pointing out that the emperor has no clothes

    Totally false.

    "All that stuff in all those books."  I love it.  You are a little boy
    in awe of all the pretty, fancy books.

    Student: "Professor Ullrich, can you prove your last point?"
    DU: "All that stuff in all those books."

    > nobody seems to take him seriously. You’d be irritated too.

    What’s worse than making unsubstantiated statements?  Making
    statements that by their nature can never be substantiated or refuted.

    Why should anyone take that seriously?  Really.  What’s the sense of
    it all?

    BTW If you really want to know the truth about how I "feel":  You try
    like hell to prove me wrong.  If you could ever find an example of
    someone formally generating a computer program or an incompleteness
    theorem, you surely would plaster it here.  But you don’t.

    You see, you do a lot of my work for me.  Your persistent reference to
    people’s reputation and emotional state only shows me there are no
    Mathematical principles to back up what you so desperately wish was
    true.

    - Hide quoted text — Show quoted text -

    > ************************

    > David C. Ullrich







Place your comment

You must be logged in to post a comment.