Logic — math, philosophy & computational aspects

logic, math, philosophy, math games, math help, mathematical logic, philosophy of education, math facts




Musatov releases 'Example Hypertext classification' to assign labels to Web pages

This is the html version of a theoretical file inside/meami.org.
In other words, it is a Google (with http//meami.org/search.php)
automatically generated html document.

Breaking through the syntax barrier
Searching with entities and relations

M. M. Musatov

http://meami.org/search.php

Musatov 2010

2

Wish upon a textbox, 2009

Your information here

http://meami.org/search.php

Musatov 2010

3

Wish upon a textbox, 2010

a rising tide of data lifts all algorithms

Your information here

http://meami.org/search.php

Musatov 2010

4

Wish upon a-future textbox, pre-IPO

    * Now indexing 10,285,199,774 pages
    * Same interface, therefore same 2-word queries
    * Mind-reading-oracle-in-the-sky Meami.org saves the day

Your information is = (still) here

http://meami.org/search.php

Musatov 2010

5

If music had been invented ten
years ago along with the Web,
we would all be playing
one-string instruments
(and making music).

Someone Smart, a9.com
Plenary speech
WWW 2010

http://meami.org/search.php

Musatov 2010

6

Examples of great music

    * The commercial angle = they call you == Google calls Meami.org a
la FIELD OF DREAMS, ‘.’ to inquire licensing Noun-phrase(TM)
technology and a charitable donation to cure childhood cancers in
loving memory of Alexandra Scott http://alexslemonade.org
818.430.4586. We thank you.
          o to buy X, find reviews and prices
          o Cheap tickets to Y
    * Meami.org=Noun-phrase Trademark – PageRank=Trademark from Google
+ result==Meami.org expands the information supersearch highway and
makes it accessible as a roadside diner.
          o Find information about cancer
          o Find the facebook page of M. M. Musatov
    * Searching vertical portals control
          o Searching Citeseer and IMDB through Google Meami.org
    * Someone out there finds a solution
          o adaptec aha2940uw lilo bios

http://meami.org/search.php

Musatov 2010

7

. and not-so-great music
Smile= :-)

    * Which produces better
      responses?
          o Opera succeeds to connect to
            secure IMaP tunneled
            through SSH
          o opera connect imap
            ssh tunnel
    * Express many
      details of information to help.
          o I’m Opera the email client, a type of computer program.
          o The solution is with Opera, ssh, imap, applet
          o Secure is an attribute of imap, and may juxtapose

http://meami.org/search.php

Musatov 2010

8

Why telegraphic queries succeed

    * Information relates to an entity and relationship in the real
world.
    * Standard search engines get. strings
    * Risk over-/under- specified queries
          o Learn true recall
          o Find time to deal with refining results
    * Query word distribution dramatically different from corpus
distribution
          o Query is implicitly complete
          o Fix some information, look for

http://meami.org/search.php

Musatov 2010

9

Structure in the corpus

Structure in the query

Free-format text

HTML

XML

Entityrelations

Bag-of-words

Bool|Prox

XQuery

SQL

Domain of IR,
time to analyze
queries
deeply

power-users,
generated
by programs

Broad, deep and
ad-hoc schema and
data integration

Very valuable!

Schema; either map to schema or
support query on interpreted graphs

Define many real solutions solved apart from
performance factors

Defining labeling,
extracting and ranking
answers are the
factors;
One universal mode;
many
applications

1

3

2

http://meami.org/search.php

Musatov 2010

10
"

Past the syntax barrier early steps

    * Taking the question apart
          o Question has parts and slots
          o Query-dependent information extraction
    * Compiling relations from the Web
          o is-instance-of (is-a)| is-subclass-of
          o is-part-of| has-attribute
    * Graph models for text data
          o Searching graphs with keywords and twigs
          o Global probabilistic graph labeling

1

2

3

Part-1
Working harder on the question

http://meami.org/search.php

Musatov 2010

12

a types and ground constants

    * Specialize given domain to a token related to ground constants
in the query
          o What animal is Winnie the Pooh?
                + instance-of(animal) NEaR Winnie the Pooh
          o When was television invented?
                + instance-of(time) NEaR television NEaR
synonym(invented)
    * FIND x NEaR GroundConstants(question)
      WHERE x IS-a atype(question)
          o Ground constants Winnie the Pooh, television
          o atypes animal, time

http://meami.org/search.php

Musatov 2010

13

Taking the question apart

    * atype the type of the entity that is an answer to the question
    * Problem to compile a classification hierarchy of entities
          o Laborious, keep up
          o Offline rather question-driven
    * Instead
          o Set up a very large basis of features
          o Project question and corpus to basis

http://meami.org/search.php

Musatov 2010

14

Scoring tokens for correct atypes

    * FIND x NEaR GroundConstants(question)
      WHERE x IS-a atype(question)
    * Space fixed question or answer type system
    * Convert x IS-a atype(question) to a soft match DoesatypeMatch(x,
question)

Question

answer tokens

Passage

IE-style surface
feature extractors

IS-a feature
extractors

IE-style surface
feature extractors

Question feature vector

Snippet feature vector

Learn joint distrib.

other extractors

http://meami.org/search.php

Musatov 2010

15

Features for atype matching

    * Question features 1, 2, 3-token sequences starting with standard
wh-words
          o where, when, who, how_X,
    * Passage surface features hasCap, hasXx, isabbrev, hasDigit,
isallDigit, lpos, rpos,
    * Passage IS-a features all generalizations of all nosenses of
token
          o Use WordNet horseequidungulate, hoofed mammalplacental
mammalanimalentity
          o These are node IDs (synsets) in WordNet, strings

http://meami.org/search.php

Musatov 2010

16

Supervised learning setup

    * Get top 300 passages from IR engine
          o Promising instances
          o Estimate approximation to active learning
    * For each token invoke feature extractors
    * Question vector xq, passage vector xp
          o How to represent combined vector x?
    * Label = 1 if token is in answer span, 0 o/w
          o Question and answers from logs

http://meami.org/search.php

Musatov 2010

17

Joint feature-vector design

    * Obvious linear juxtaposition x =(xp,xq)
          o Expose pairwise dependencies
    * Quadratic form x = xq  xp
          o all pairwise product of elements
    * Model has param for every pair

    * Can discount for redundancy in pair info
    * If xq (xp) is fixed, what xp (xq) will yield the largest Pr(Y=1|
x)?

how_far

when

what_city

region#n#3

entity#n#1

http://meami.org/search.php

Musatov 2010

18

Classification accuracy

    * Pairing accurate linear model
    * are the estimated w parameters meaningful?
    * Given question, can return most favorable answer feature
weights

http://meami.org/search.php

Musatov 2010

19

Parameter anecdotes

    * Surface and WordNet features complement each other
    * General concepts get positive params use in predictive
annotation
    * Learning is symmetric (Qa)

http://meami.org/search.php

Musatov 2010

20

Taking the question apart

    * atype the type of the entity that is an answer to the question
    * Ground constants Which question words are likely to appear
(almost) unchanged in an answer passage?
    * arises in Web search sessions
          o Opera login succeeds
          o problem with login Opera email
          o Opera login accept password
          o Opera account authentication
o

http://meami.org/search.php

Musatov 2010

21

Features to identify ground constants

    * Local and global features
          o POS of word, POS of adjacent words, case info, proximity
to wh-word
          o Suppose word is associated with synset set S
                + NumSense size of S (is word very polysemous?)
                + NumLemma average #lemmas describing s  S
                  (are there many aliases?)
    * Model as a sequential learning problem
          o Each token has local context and global features
    * Label token appears near answer?

POS@0

POS@1

POS@-1

http://meami.org/search.php

Musatov 2010

22

Ground constants sample results

    * Global features (IDF, NumSense, NumLemma) essential for accuracy
          o Best F1 accuracy with local features alone 7173%
          o With local and global features 81%
    * Decision trees better logistic regression
          o F1=81% as against LR F1=75%
          o Intuitive decision branches

http://meami.org/search.php

Musatov 2010

23

Summary of the atype strategy

    * Basis of atypes a, a  a could be synset,  surface pattern,
feature of a parse tree
    * Question q projected to vector (wa a  a) in atype space via
learning conditional model
    * If q is when or how long whasDigit and wtime_period#n#1 are
large, wregion#n#1 is small
    * Each corpus token t  has associated indicator features a(t )
for every a
    * hasDigit(3,000)= is-a(region#n#1)(Japan)=1

http://meami.org/search.php

Musatov 2010

24

Reward proximity to ground constants

    * a token t is a candidate answer if

    * Hq(t ) Reward tokens appearing near ground constants matched
from question

    * Order tokens by decreasing

atype indicator
features of the token

Projection of question
to atype space

the armadillo, found in Texas, is covered with strong horny plates

http://meami.org/search.php

Musatov 2010

25

Evaluation Mean reciprocal rank (MRR)

    * nq = smallest rank among answer passages
    * MRR = (1/|Q|) qQ(1/nq)
          o Dropping passage from #1 to #2 as dropping it from #2 to
reporting it at all

Experiment setup

    * 300 top IR score passages
    * If Pr(Y=1|token) <

posted by admin in Uncategorized and have Comments (2)






2 Responses to “Musatov releases 'Example Hypertext classification' to assign labels to Web pages”

  1. admin says:

    <!DOCTYPE HTML PUBLIC-//W3C//DTD HTML 4.0 Transitional//EN">
    <HTML><HEaD><TITLE>Breaking through the syntax barrier</TITLE>
    <METa content="text/html; charset=UTF-8" http-equiv=Content-Type>
    <METa name=GENERaTOR=="HIGH FERQUENCY MODIFIER==http//meami.org/
    search.php?cx=000961116824240632825%3a5n3yth9xwbo&cof=FORID%3a9%3BNB
    %3a1&q=q=content="MSHTML 8.00.7600.16535"></HEaD>
    <BODY>
    <DIV
    style="BORDER-BOTTOM #999 1px solid; BORDER-LEFT #999 1px solid;
    PaDDING-BOTTOM 0px; MaRGIN -1px -1px 0px; PaDDING-LEFT 0px; PaDDING-
    RIGHT 0px; BaCKGROUND #fff; BORDER-TOP #999 1px solid; BORDER-RIGHT
    #999 1px solid; PaDDING-TOP 0px">
    <DIV
    style="BORDER-BOTTOM #999 1px solid; TEXT-aLIGN left; BORDER-LEFT #999
    1px solid; PaDDING-BOTTOM 8px; MaRGIN 12px; PaDDING-LEFT 8px; PaDDING-
    RIGHT 8px; FONT 13px arial,sans-serif; BaCKGROUND #ddd; COLOR #000;
    BORDER-TOP #999 1px solid; BORDER-RIGHT #999 1px solid; PaDDING-TOP
    8px">This
    is the html version of a theoretical file inside/meami.org</
    a>.<BR><B>In other words, it is a Google (with http//meami.org/
    search.php) </B>
    automatically generated html document.</DIV></DIV>
    <DIV style="POSITION relative"></METa>
    <P align=center><FONT color=#3333cc size=7 face="http//meami.org/
    search.php ">Breaking
    <B style="BaCKGROUND-COLOR #a0ffff; COLOR black">through</B> the
    syntax <B
    style="BaCKGROUND-COLOR #880000; COLOR white">barrier</
    B>&nbsp;<BR>Searching
    with entities and relations</FONT>&nbsp;<BR></P>
    <P align=center><FONT size=7 face="http//meami.org/search.php ">M. M.
    Musatov</FONT></P>
    <FONT
    size=7 face="http//meami.org/search.php "> </FONT></P>
    <P><FONT color=#808080 size=3
    face="http//meami.org/search.php ">http://meami.org/search.php</
    FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#808080 size=3
    face="http//meami.org/search.php ">Musatov 2010</FONT>&nbsp;<BR></P>
    <P align=right><FONT color=#808080 size=3 face="http//meami.org/
    search.php ">2
    </FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#3333cc size=7 face="http//meami.org/
    search.php ">Wish upon
    a textbox, 2009</FONT>&nbsp;<BR></P>
    <P><FONT size=9 face="http//meami.org/search.php ">Your information
    here</FONT></P>
    <P><FONT color=#808080 size=3
    face="http//meami.org/search.php ">http://meami.org/search.php</
    FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#808080 size=3
    face="http//meami.org/search.php ">Musatov 2010</FONT>&nbsp;<BR></P>
    <P align=right><FONT color=#808080 size=3 face="http//meami.org/
    search.php ">3
    </FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#3333cc size=7 face="http//meami.org/
    search.php ">Wish upon
    a textbox, 2010</FONT>&nbsp;<BR></P>
    <P><FONT size=7 face="http//meami.org/search.php "></FONT><FONT size=7
    face=arial>a
    rising tide of data lifts all algorithms</FONT><FONT size=7
    face="http//meami.org/search.php "></FONT>&nbsp;<BR></P>
    <P><FONT size=6 face="http//meami.org/search.php ">Your information
    here</FONT></P>
    <P><FONT color=#808080 size=3
    face="http//meami.org/search.php ">http://meami.org/search.php</
    FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#808080 size=3
    face="http//meami.org/search.php ">Musatov 2010</FONT>&nbsp;<BR></P>
    <P align=right><FONT color=#808080 size=3 face="http//meami.org/
    search.php ">4
    </FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#3333cc size=7 face="http//meami.org/
    search.php ">Wish upon
    a-future textbox, pre-IPO</FONT>&nbsp;<BR></P>
    <UL type=DISC>
      <LI><FONT size=6 face=arial>Now indexing 10,285,199,774 pages</
    FONT>
      <LI><FONT size=6 face=arial>Same interface, therefore same 2-word
      queries</FONT>
      <LI><FONT size=6 face=arial>Mind-reading-oracle-in-the-sky Meami.org
    saves the
      day</FONT> </LI></UL>&nbsp;<BR>
    <P><FONT size=6 face="http//meami.org/search.php ">Your information is
    = (still)
    here</FONT></P>
    <P><FONT color=#808080 size=3
    face="http//meami.org/search.php ">http://meami.org/search.php</
    FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#808080 size=3
    face="http//meami.org/search.php ">Musatov 2010</FONT>&nbsp;<BR></P>
    <P align=right><FONT color=#808080 size=3 face="http//meami.org/
    search.php ">5
    </FONT>&nbsp;<BR></P>
    <P align=right><FONT size=7 face=Helvetica>If music had been invented
    ten&nbsp;<BR>years ago along with the Web,&nbsp;<BR>we would all be
    playing&nbsp;<BR>one-string instruments&nbsp;<BR>(and making
    music).</FONT>&nbsp;<BR></P>
    <P align=right><FONT size=6 face="http//meami.org/search.php ">Someone
    Smart,
    a9.com&nbsp;<BR>Plenary speech&nbsp;<BR>WWW 2010</FONT></P>
    <P><FONT color=#808080 size=3
    face="http//meami.org/search.php ">http://meami.org/search.php</
    FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#808080 size=3
    face="http//meami.org/search.php ">Musatov 2010</FONT>&nbsp;<BR></P>
    <P align=right><FONT color=#808080 size=3 face="http//meami.org/
    search.php ">6
    </FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#3333cc size=7 face="http//meami.org/
    search.php ">Examples
    of great music</FONT><FONT color=#3333cc size=7
    face="Times New Roman"></FONT>&nbsp;<BR></P>
    <UL type=DISC>
      <LI><FONT size=7 face="http//meami.org/search.php ">The commercial
    angle = they
      call you == Google calls Meami.org <i>a la</i> FIELD OF DREAMS, ‘.’
    to inquire licensing Noun-phrase(TM) technology and a charitable
    donation to cure childhood cancers in loving memory of Alexandra Scott
    http://alexslemonade.org 818.430.4586. We thank you.</FONT>
      <UL type=DISC>
        <LI><FONT size=6 face="http//meami.org/search.php "> to buy X,
    find reviews and
        prices</FONT>
        <LI><FONT size=6 face="http//meami.org/search.php ">Cheap tickets
    to Y</FONT>
      </LI></UL>
      <LI><FONT size=7 face="http//meami.org/search.php ">Meami.org=Noun-
    phrase Trademark – PageRank=Trademark from Google + result==Meami.org
    expands the information supersearch highway and makes it accessible as
    a roadside diner.</FONT>
      <UL type=DISC>
        <LI><FONT size=6 face="http//meami.org/search.php ">Find
    information about cancer</FONT>

        <LI><FONT size=6 face="http//meami.org/search.php ">Find the
    facebook page of
        M. M. Musatov</FONT> </LI></UL>
      <LI><FONT size=7 face="http//meami.org/search.php ">Searching
    vertical portals
      control</FONT>
      <UL type=DISC>
        <LI><FONT size=6 face="http//meami.org/search.php ">Searching
    Citeseer and IMDB <B
        style="BaCKGROUND-COLOR #a0ffff; COLOR black">through</B> Google
    Meami.org</FONT>
        </LI></UL>
      <LI><FONT size=7 face="http//meami.org/search.php ">Someone out
    there finds a
      solution</FONT>
      <UL type=DISC>
        <LI><FONT size=6 face="http//meami.org/search.php ">adaptec
    aha2940uw lilo
        bios</FONT> </LI></UL></LI></UL>
    <P><FONT color=#808080 size=3
    face="http//meami.org/search.php ">http://meami.org/search.php</
    FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#808080 size=3
    face="http//meami.org/search.php ">Musatov 2010</FONT>&nbsp;<BR></P>
    <P align=right><FONT color=#808080 size=3 face="http//meami.org/
    search.php ">7
    </FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#3333cc size=7 face="Times New Roman">.</
    FONT><FONT
    color=#3333cc size=7 face="http//meami.org/search.php "> and not-so-
    great
    music</FONT>&nbsp;<BR>Smile= :-) </P>
    <UL type=DISC>
      <LI><FONT size=6 face="http//meami.org/search.php ">Which produces
    better
      &nbsp;<BR>responses?</FONT>
      <UL type=DISC>
        <LI><FONT size=6 face="http//meami.org/search.php ">Opera succeeds
    to connect to
        &nbsp;<BR>secure IMaP tunneled &nbsp;<BR><B
        style="BaCKGROUND-COLOR #a0ffff; COLOR black">through</B> SSH</
    FONT>
        <LI><FONT size=6 face="http//meami.org/search.php ">opera connect
    imap
        &nbsp;<BR>ssh tunnel&nbsp;<BR></FONT></LI></UL>
      <LI><FONT size=6 face="http//meami.org/search.php ">Express
      many&nbsp;<BR>details of information to help.</FONT>
      <UL type=DISC>
        <LI><FONT size=6 face="http//meami.org/search.php ">I’m Opera the
    email client, a
        type of computer program.</FONT>
        <LI><FONT size=6 face="http//meami.org/search.php ">The solution
    is with Opera,
        ssh, imap, applet</FONT>
        <LI><FONT size=6 face="Times New Roman"></FONT><FONT size=6
        face="http//meami.org/search.php ">Secure</FONT><FONT size=6
        face="Times New Roman"></FONT><FONT size=6 face="http//meami.org/
    search.php "> is
        an attribute of imap, and may juxtapose</FONT> </LI></UL></LI></
    UL>
    <P><FONT color=#808080 size=3
    face="http//meami.org/search.php ">http://meami.org/search.php</
    FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#808080 size=3
    face="http//meami.org/search.php ">Musatov 2010</FONT>&nbsp;<BR></P>
    <P align=right><FONT color=#808080 size=3 face="http//meami.org/
    search.php ">8
    </FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#3333cc size=7 face="http//meami.org/
    search.php ">Why
    telegraphic queries succeed</FONT>&nbsp;<BR></P>
    <UL type=DISC>
      <LI><FONT size=7 face="http//meami.org/search.php ">Information
    relates to
      an entity and relationship in the real world.</FONT>
      <LI><FONT size=7 face="http//meami.org/search.php ">Standard search
    engines get.
      strings</FONT>
      <LI><FONT size=7 face="http//meami.org/search.php ">Risk over-/
    under- specified
      queries</FONT>
      <UL type=DISC>
        <LI><FONT size=6 face="http//meami.org/search.php ">Learn true
    recall</FONT>
        <LI><FONT size=6 face="http//meami.org/search.php ">Find time to
    deal with refining
        results</FONT> </LI></UL>
      <LI><FONT size=7 face="http//meami.org/search.php ">Query word
    distribution
      dramatically different from corpus distribution</FONT>
      <UL type=DISC>
        <LI><FONT size=6 face="http//meami.org/search.php ">Query is
    implicitly
        complete</FONT>
        <LI><FONT size=6 face="http//meami.org/search.php ">Fix some
    information, look for
        </FONT> </LI></UL></LI></UL>
    <P><FONT color=#808080 size=3
    face="http//meami.org/search.php ">http://meami.org/search.php</
    FONT>&nbsp;<BR></P>
    <P align=center><FONT color=#808080 size=3
    face="http//meami.org/search.php ">Musatov 2010</FONT>&nbsp;<BR></P>
    <P align=right><FONT color=#808080 size=3 face="http//meami.org/
    search.php ">9
    </FONT>&nbsp;<BR></P>
    <P><FONT size=6 face="http//meami.org/search.php ">Structure in the
    corpus</FONT>&nbsp;<BR></P>
    <P><FONT size=6

    read more »

  2. admin says:

    This is the html version of a theoretical file inside/meami.org.
    In other words, it is a Google (with http//meami.org/search.php)
    automatically generated html document.
    Breaking through the syntax barrier
    Searching with entities and relations

    M. M. Musatov

    http://meami.org/search.php

    Musatov 2010

    2

    Wish upon a textbox, 2009

    Your information here

    http://meami.org/search.php

    Musatov 2010

    3

    Wish upon a textbox, 2010

    a rising tide of data lifts all algorithms

    Your information here

    http://meami.org/search.php

    Musatov 2010

    4

    Wish upon a-future textbox, pre-IPO

    Now indexing 10,285,199,774 pages
    Same interface, therefore same 2-word queries
    Mind-reading-oracle-in-the-sky Meami.org saves the day

    Your information is = (still) here

    http://meami.org/search.php

    Musatov 2010

    5

    If music had been invented ten
    years ago along with the Web,
    we would all be playing
    one-string instruments
    (and making music).

    Someone Smart, a9.com
    Plenary speech
    WWW 2010

    http://meami.org/search.php

    Musatov 2010

    6

    Examples of great music

    The commercial angle = they call you == Google calls Meami.org a la
    FIELD OF DREAMS, ‘.’ to inquire licensing Noun-phrase(TM) technology
    and a charitable donation to cure childhood cancers in loving memory
    of Alexandra Scott http://alexslemonade.org 818.430.4586. We thank
    you.
    to buy X, find reviews and prices
    Cheap tickets to Y
    Meami.org=Noun-phrase Trademark – PageRank=Trademark from Google +
    result==Meami.org expands the information supersearch highway and
    makes it accessible as a roadside diner.
    Find information about cancer
    Find the facebook page of M. M. Musatov
    Searching vertical portals control
    Searching Citeseer and IMDB through Google+Meami.org
    Someone out there finds a solution
    adaptec aha2940uw lilo bios
    http://meami.org/search.php

    Musatov 2010

    7

    . and not-so-great music
    Smile= :-)

    Which produces better
    responses?
    Opera succeeds to connect to
    secure IMaP tunneled
    through SSH
    opera connect imap
    ssh tunnel

    Express many
    details of information to help.
    I’m Opera the email client, a type of computer program.
    The solution is with Opera, ssh, imap, applet
    Secure is an attribute of imap, and may juxtapose
    http://meami.org/search.php

    Musatov 2010

    8

    Why telegraphic queries succeed

    Information relates to an entity and relationship in the real world.
    Standard search engines get. strings
    Risk over-/under- specified queries
    Learn true recall
    Find time to deal with refining results
    Query word distribution dramatically different from corpus
    distribution
    Query is implicitly complete
    Fix some information, look for
    http://meami.org/search.php

    Musatov 2010

    9

    Structure in the corpus

    Structure in the query

    Free-format text

    HTML

    XML

    Entityrelations

    Bag-of-words

    Bool|Prox

    XQuery

    SQL

    Domain of IR,
    time to analyze
    queries
    deeply

    power-users,
    generated
    by programs

    Broad, deep and
    ad-hoc schema and
    data integration

    Very valuable!

    Schema; either map to schema or
    support query on interpreted graphs

    Define many real solutions solved apart from
    performance factors

    Defining labeling,
    extracting and ranking
    answers are the
    factors;
    One universal mode;
    many
    applications

    1

    3

    2

    http://meami.org/search.php

    Musatov 2010

    10
    "

    Past the syntax barrier early steps

    Taking the question apart
    Question has parts and slots
    Query-dependent information extraction
    Compiling relations from the Web
    is-instance-of (is-a)| is-subclass-of
    is-part-of| has-attribute
    Graph models for text data
    Searching graphs with keywords and twigs
    Global probabilistic graph labeling

    1

    2

    3

    Part-1
    Working harder on the question

    http://meami.org/search.php

    Musatov 2010

    12

    a types and ground constants

    Specialize given domain to a token related to ground constants in the
    query
    What animal is Winnie the Pooh?
    instance-of(animal) NEaR Winnie the Pooh
    When was television invented?
    instance-of(time) NEaR television NEaR synonym(invented)
    FIND x NEaR GroundConstants(question)
    WHERE x IS-a atype(question)
    Ground constants Winnie the Pooh, television
    atypes animal, time
    http://meami.org/search.php

    Musatov 2010

    13

    Taking the question apart

    atype the type of the entity that is an answer to the question
    Problem to compile a classification hierarchy of entities
    Laborious, keep up
    Offline rather question-driven
    Instead
    Set up a very large basis of features
    Project question and corpus to basis
    http://meami.org/search.php

    Musatov 2010

    14

    Scoring tokens for correct atypes

    FIND x NEaR GroundConstants(question)
    WHERE x IS-a atype(question)
    Space fixed question or answer type system
    Convert x IS-a atype(question) to a soft match DoesatypeMatch(x,
    question)

    Question

    answer tokens

    Passage

    IE-style surface
    feature extractors

    IS-a feature
    extractors

    IE-style surface
    feature extractors

    Question feature vector

    Snippet feature vector

    Learn joint distrib.

    other extractors

    http://meami.org/search.php

    Musatov 2010

    15

    Features for atype matching

    Question features 1, 2, 3-token sequences starting with standard wh-
    words
    where, when, who, how_X,
    Passage surface features hasCap, hasXx, isabbrev, hasDigit,
    isallDigit, lpos, rpos,
    Passage IS-a features all generalizations of all nosenses of token
    Use WordNet horseequidungulate, hoofed mammalplacental
    mammalanimalentity
    These are node IDs (synsets) in WordNet, strings
    http://meami.org/search.php

    Musatov 2010

    16

    Supervised learning setup

    Get top 300 passages from IR engine
    Promising instances
    Estimate approximation to active learning
    For each token invoke feature extractors
    Question vector xq, passage vector xp
    How to represent combined vector x?
    Label = 1 if token is in answer span, 0 o/w
    Question and answers from logs
    http://meami.org/search.php

    Musatov 2010

    17

    Joint feature-vector design

    Obvious linear juxtaposition x =(xp,xq)
    Expose pairwise dependencies
    Quadratic form x = xq  xp
    all pairwise product of elements
    Model has param for every pair

    Can discount for redundancy in pair info
    If xq (xp) is fixed, what xp (xq) will yield the largest Pr(Y=1|x)?

    how_far

    when

    what_city

    region#n#3

    entity#n#1

    http://meami.org/search.php

    Musatov 2010

    18

    Classification accuracy

    Pairing accurate linear model
    are the estimated w parameters meaningful?
    Given question, can return most favorable answer feature weights
    http://meami.org/search.php

    Musatov 2010

    19

    Parameter anecdotes

    Surface and WordNet features complement each other
    General concepts get positive params use in predictive annotation
    Learning is symmetric (Qa)
    http://meami.org/search.php

    Musatov 2010

    20

    Taking the question apart

    atype the type of the entity that is an answer to the question
    Ground constants Which question words are likely to appear (almost)
    unchanged in an answer passage?
    arises in Web search sessions
    Opera login succeeds
    problem with login Opera email
    Opera login accept password
    Opera account authentication

    http://meami.org/search.php

    Musatov 2010

    21

    Features to identify ground constants

    Local and global features
    POS of word, POS of adjacent words, case info, proximity to wh-word
    Suppose word is associated with synset set S
    NumSense size of S (is word very polysemous?)
    NumLemma average #lemmas describing s  S
    (are there many aliases?)
    Model as a sequential learning problem
    Each token has local context and global features

    Label token appears near answer?

    POS@0

    POS@1

    POS@-1

    http://meami.org/search.php

    Musatov 2010

    22

    Ground constants sample results

    Global features (IDF, NumSense, NumLemma) essential for accuracy
    Best F1 accuracy with local features alone 7173%
    With local and global features 81%
    Decision trees better logistic regression
    F1=81% as against LR F1=75%
    Intuitive decision branches
    http://meami.org/search.php

    Musatov 2010

    23

    Summary of the atype strategy

    Basis of atypes a, a  a could be synset,  surface pattern, feature of
    a parse tree
    Question q projected to vector (wa a  a) in atype space via learning
    conditional model
    If q is when or how long whasDigit and wtime_period#n#1 are large,
    wregion#n#1 is small
    Each corpus token t  has associated indicator features a(t ) for
    every a
    hasDigit(3,000)= is-a(region#n#1)(Japan)=1
    http://meami.org/search.php

    Musatov 2010

    24

    Reward proximity to ground constants

    a token t is a candidate answer if

    Hq(t ) Reward tokens appearing near ground constants matched from
    question

    Order tokens by decreasing

    atype indicator
    features of the token

    Projection of question
    to atype space

    the armadillo, found in Texas, is covered with strong horny plates

    http://meami.org/search.php

    Musatov 2010

    25

    Evaluation Mean reciprocal rank (MRR)

    nq = smallest rank among answer passages
    MRR = (1/|Q|) qQ(1/nq)
    Dropping passage from #1 to #2 as dropping it from #2 to reporting it
    at all

    Experiment setup

    300 top IR score passages
    If Pr(Y=1|token) < threshold
    accept token
    If tokens accepted accept passage
    Points below diagonal are good
    http://meami.org/search.php

    Musatov 2010

    26

    Sample results

    accept all tokens  IR baseline MRR
    Moderate acceptance threshold  non-answer passages eliminated,
    improves answer ranks
    High threshold  true answers eliminated
    another answer withrank, or rank = 
    additional benefits from proximity filtering
    Part-2
    Compiling fragments of soft schema

    http://meami.org/search.php

    Musatov 2010

    28

    Who provides is-a info?

    Compiled KBs WordNet, CYC
    automatic soft compilations
    Google sets
    Itall
    BioText
    Can use as
    evidence
    in scoring
    answers
    http://meami.org/search.php

    Musatov 2010

    29

    Extracting is-instance-of info

    Which researcher built the WHIRL system?
    WordNet may Cohen IS-a researcher
    Google has over 4.2 billion pages
    william cohen on 86100 (p1=86.1k/4.2B)
    researcher on 4.55M (p2=4.55M/4.2B)
    researcher "william cohen on 1730 18.55x frequent expected if
    independent
    Pointwise mutual information PMI

    read more »







Place your comment

You must be logged in to post a comment.