This is the html version of a theoretical file inside/meami.org.
In other words, it is a Google (with http//meami.org/search.php)
automatically generated html document.
Breaking through the syntax barrier
Searching with entities and relations
M. M. Musatov
http://meami.org/search.php
Musatov 2010
2
Wish upon a textbox, 2009
Your information here
http://meami.org/search.php
Musatov 2010
3
Wish upon a textbox, 2010
a rising tide of data lifts all algorithms
Your information here
http://meami.org/search.php
Musatov 2010
4
Wish upon a-future textbox, pre-IPO
* Now indexing 10,285,199,774 pages
* Same interface, therefore same 2-word queries
* Mind-reading-oracle-in-the-sky Meami.org saves the day
Your information is = (still) here
http://meami.org/search.php
Musatov 2010
5
If music had been invented ten
years ago along with the Web,
we would all be playing
one-string instruments
(and making music).
Someone Smart, a9.com
Plenary speech
WWW 2010
http://meami.org/search.php
Musatov 2010
6
Examples of great music
* The commercial angle = they call you == Google calls Meami.org a
la FIELD OF DREAMS, ‘.’ to inquire licensing Noun-phrase(TM)
technology and a charitable donation to cure childhood cancers in
loving memory of Alexandra Scott http://alexslemonade.org
818.430.4586. We thank you.
o to buy X, find reviews and prices
o Cheap tickets to Y
* Meami.org=Noun-phrase Trademark – PageRank=Trademark from Google
+ result==Meami.org expands the information supersearch highway and
makes it accessible as a roadside diner.
o Find information about cancer
o Find the facebook page of M. M. Musatov
* Searching vertical portals control
o Searching Citeseer and IMDB through Google Meami.org
* Someone out there finds a solution
o adaptec aha2940uw lilo bios
http://meami.org/search.php
Musatov 2010
7
. and not-so-great music
Smile=
* Which produces better
responses?
o Opera succeeds to connect to
secure IMaP tunneled
through SSH
o opera connect imap
ssh tunnel
* Express many
details of information to help.
o I’m Opera the email client, a type of computer program.
o The solution is with Opera, ssh, imap, applet
o Secure is an attribute of imap, and may juxtapose
http://meami.org/search.php
Musatov 2010
8
Why telegraphic queries succeed
* Information relates to an entity and relationship in the real
world.
* Standard search engines get. strings
* Risk over-/under- specified queries
o Learn true recall
o Find time to deal with refining results
* Query word distribution dramatically different from corpus
distribution
o Query is implicitly complete
o Fix some information, look for
http://meami.org/search.php
Musatov 2010
9
Structure in the corpus
Structure in the query
Free-format text
HTML
XML
Entityrelations
Bag-of-words
Bool|Prox
XQuery
SQL
Domain of IR,
time to analyze
queries
deeply
power-users,
generated
by programs
Broad, deep and
ad-hoc schema and
data integration
Very valuable!
Schema; either map to schema or
support query on interpreted graphs
Define many real solutions solved apart from
performance factors
Defining labeling,
extracting and ranking
answers are the
factors;
One universal mode;
many
applications
1
3
2
http://meami.org/search.php
Musatov 2010
10
"
Past the syntax barrier early steps
* Taking the question apart
o Question has parts and slots
o Query-dependent information extraction
* Compiling relations from the Web
o is-instance-of (is-a)| is-subclass-of
o is-part-of| has-attribute
* Graph models for text data
o Searching graphs with keywords and twigs
o Global probabilistic graph labeling
1
2
3
Part-1
Working harder on the question
http://meami.org/search.php
Musatov 2010
12
a types and ground constants
* Specialize given domain to a token related to ground constants
in the query
o What animal is Winnie the Pooh?
+ instance-of(animal) NEaR Winnie the Pooh
o When was television invented?
+ instance-of(time) NEaR television NEaR
synonym(invented)
* FIND x NEaR GroundConstants(question)
WHERE x IS-a atype(question)
o Ground constants Winnie the Pooh, television
o atypes animal, time
http://meami.org/search.php
Musatov 2010
13
Taking the question apart
* atype the type of the entity that is an answer to the question
* Problem to compile a classification hierarchy of entities
o Laborious, keep up
o Offline rather question-driven
* Instead
o Set up a very large basis of features
o Project question and corpus to basis
http://meami.org/search.php
Musatov 2010
14
Scoring tokens for correct atypes
* FIND x NEaR GroundConstants(question)
WHERE x IS-a atype(question)
* Space fixed question or answer type system
* Convert x IS-a atype(question) to a soft match DoesatypeMatch(x,
question)
Question
answer tokens
Passage
IE-style surface
feature extractors
IS-a feature
extractors
IE-style surface
feature extractors
Question feature vector
Snippet feature vector
Learn joint distrib.
other extractors
http://meami.org/search.php
Musatov 2010
15
Features for atype matching
* Question features 1, 2, 3-token sequences starting with standard
wh-words
o where, when, who, how_X,
* Passage surface features hasCap, hasXx, isabbrev, hasDigit,
isallDigit, lpos, rpos,
* Passage IS-a features all generalizations of all nosenses of
token
o Use WordNet horseequidungulate, hoofed mammalplacental
mammalanimalentity
o These are node IDs (synsets) in WordNet, strings
http://meami.org/search.php
Musatov 2010
16
Supervised learning setup
* Get top 300 passages from IR engine
o Promising instances
o Estimate approximation to active learning
* For each token invoke feature extractors
* Question vector xq, passage vector xp
o How to represent combined vector x?
* Label = 1 if token is in answer span, 0 o/w
o Question and answers from logs
http://meami.org/search.php
Musatov 2010
17
Joint feature-vector design
* Obvious linear juxtaposition x =(xp,xq)
o Expose pairwise dependencies
* Quadratic form x = xq xp
o all pairwise product of elements
* Model has param for every pair
* Can discount for redundancy in pair info
* If xq (xp) is fixed, what xp (xq) will yield the largest Pr(Y=1|
x)?
how_far
when
what_city
region#n#3
entity#n#1
http://meami.org/search.php
Musatov 2010
18
Classification accuracy
* Pairing accurate linear model
* are the estimated w parameters meaningful?
* Given question, can return most favorable answer feature
weights
http://meami.org/search.php
Musatov 2010
19
Parameter anecdotes
* Surface and WordNet features complement each other
* General concepts get positive params use in predictive
annotation
* Learning is symmetric (Qa)
http://meami.org/search.php
Musatov 2010
20
Taking the question apart
* atype the type of the entity that is an answer to the question
* Ground constants Which question words are likely to appear
(almost) unchanged in an answer passage?
* arises in Web search sessions
o Opera login succeeds
o problem with login Opera email
o Opera login accept password
o Opera account authentication
o
http://meami.org/search.php
Musatov 2010
21
Features to identify ground constants
* Local and global features
o POS of word, POS of adjacent words, case info, proximity
to wh-word
o Suppose word is associated with synset set S
+ NumSense size of S (is word very polysemous?)
+ NumLemma average #lemmas describing s S
(are there many aliases?)
* Model as a sequential learning problem
o Each token has local context and global features
* Label token appears near answer?
POS@0
POS@1
POS@-1
http://meami.org/search.php
Musatov 2010
22
Ground constants sample results
* Global features (IDF, NumSense, NumLemma) essential for accuracy
o Best F1 accuracy with local features alone 7173%
o With local and global features 81%
* Decision trees better logistic regression
o F1=81% as against LR F1=75%
o Intuitive decision branches
http://meami.org/search.php
Musatov 2010
23
Summary of the atype strategy
* Basis of atypes a, a a could be synset, surface pattern,
feature of a parse tree
* Question q projected to vector (wa a a) in atype space via
learning conditional model
* If q is when or how long whasDigit and wtime_period#n#1 are
large, wregion#n#1 is small
* Each corpus token t has associated indicator features a(t )
for every a
* hasDigit(3,000)= is-a(region#n#1)(Japan)=1
http://meami.org/search.php
Musatov 2010
24
Reward proximity to ground constants
* a token t is a candidate answer if
* Hq(t ) Reward tokens appearing near ground constants matched
from question
* Order tokens by decreasing
atype indicator
features of the token
Projection of question
to atype space
the armadillo, found in Texas, is covered with strong horny plates
http://meami.org/search.php
Musatov 2010
25
Evaluation Mean reciprocal rank (MRR)
* nq = smallest rank among answer passages
* MRR = (1/|Q|) qQ(1/nq)
o Dropping passage from #1 to #2 as dropping it from #2 to
reporting it at all
Experiment setup
* 300 top IR score passages
* If Pr(Y=1|token) <
…












<!DOCTYPE HTML PUBLIC-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEaD><TITLE>Breaking through the syntax barrier</TITLE>
<METa content="text/html; charset=UTF-8" http-equiv=Content-Type>
<METa name=GENERaTOR=="HIGH FERQUENCY MODIFIER==http//meami.org/
search.php?cx=000961116824240632825%3a5n3yth9xwbo&cof=FORID%3a9%3BNB
%3a1&q=q=content="MSHTML 8.00.7600.16535"></HEaD>
<BODY>
<DIV
style="BORDER-BOTTOM #999 1px solid; BORDER-LEFT #999 1px solid;
PaDDING-BOTTOM 0px; MaRGIN -1px -1px 0px; PaDDING-LEFT 0px; PaDDING-
RIGHT 0px; BaCKGROUND #fff; BORDER-TOP #999 1px solid; BORDER-RIGHT
#999 1px solid; PaDDING-TOP 0px">
<DIV
style="BORDER-BOTTOM #999 1px solid; TEXT-aLIGN left; BORDER-LEFT #999
1px solid; PaDDING-BOTTOM 8px; MaRGIN 12px; PaDDING-LEFT 8px; PaDDING-
RIGHT 8px; FONT 13px arial,sans-serif; BaCKGROUND #ddd; COLOR #000;
BORDER-TOP #999 1px solid; BORDER-RIGHT #999 1px solid; PaDDING-TOP
8px">This
is the html version of a theoretical file inside/meami.org</
a>.<BR><B>In other words, it is a Google (with http//meami.org/
search.php) </B>
automatically generated html document.</DIV></DIV>
<DIV style="POSITION relative"></METa>
<P align=center><FONT color=#3333cc size=7 face="http//meami.org/
search.php ">Breaking
<B style="BaCKGROUND-COLOR #a0ffff; COLOR black">through</B> the
syntax <B
style="BaCKGROUND-COLOR #880000; COLOR white">barrier</
B> <BR>Searching
with entities and relations</FONT> <BR></P>
<P align=center><FONT size=7 face="http//meami.org/search.php ">M. M.
Musatov</FONT></P>
<FONT
size=7 face="http//meami.org/search.php "> </FONT></P>
<P><FONT color=#808080 size=3
face="http//meami.org/search.php ">http://meami.org/search.php</
FONT> <BR></P>
<P align=center><FONT color=#808080 size=3
face="http//meami.org/search.php ">Musatov 2010</FONT> <BR></P>
<P align=right><FONT color=#808080 size=3 face="http//meami.org/
search.php ">2
</FONT> <BR></P>
<P align=center><FONT color=#3333cc size=7 face="http//meami.org/
search.php ">Wish upon
a textbox, 2009</FONT> <BR></P>
<P><FONT size=9 face="http//meami.org/search.php ">Your information
here</FONT></P>
<P><FONT color=#808080 size=3
face="http//meami.org/search.php ">http://meami.org/search.php</
FONT> <BR></P>
<P align=center><FONT color=#808080 size=3
face="http//meami.org/search.php ">Musatov 2010</FONT> <BR></P>
<P align=right><FONT color=#808080 size=3 face="http//meami.org/
search.php ">3
</FONT> <BR></P>
<P align=center><FONT color=#3333cc size=7 face="http//meami.org/
search.php ">Wish upon
a textbox, 2010</FONT> <BR></P>
<P><FONT size=7 face="http//meami.org/search.php "></FONT><FONT size=7
face=arial>a
rising tide of data lifts all algorithms</FONT><FONT size=7
face="http//meami.org/search.php "></FONT> <BR></P>
<P><FONT size=6 face="http//meami.org/search.php ">Your information
here</FONT></P>
<P><FONT color=#808080 size=3
face="http//meami.org/search.php ">http://meami.org/search.php</
FONT> <BR></P>
<P align=center><FONT color=#808080 size=3
face="http//meami.org/search.php ">Musatov 2010</FONT> <BR></P>
<P align=right><FONT color=#808080 size=3 face="http//meami.org/
search.php ">4
</FONT> <BR></P>
<P align=center><FONT color=#3333cc size=7 face="http//meami.org/
search.php ">Wish upon
a-future textbox, pre-IPO</FONT> <BR></P>
<UL type=DISC>
<LI><FONT size=6 face=arial>Now indexing 10,285,199,774 pages</
FONT>
<LI><FONT size=6 face=arial>Same interface, therefore same 2-word
queries</FONT>
<LI><FONT size=6 face=arial>Mind-reading-oracle-in-the-sky Meami.org
saves the
day</FONT> </LI></UL> <BR>
<P><FONT size=6 face="http//meami.org/search.php ">Your information is
= (still)
here</FONT></P>
<P><FONT color=#808080 size=3
face="http//meami.org/search.php ">http://meami.org/search.php</
FONT> <BR></P>
<P align=center><FONT color=#808080 size=3
face="http//meami.org/search.php ">Musatov 2010</FONT> <BR></P>
<P align=right><FONT color=#808080 size=3 face="http//meami.org/
search.php ">5
</FONT> <BR></P>
<P align=right><FONT size=7 face=Helvetica>If music had been invented
ten <BR>years ago along with the Web, <BR>we would all be
playing <BR>one-string instruments <BR>(and making
music).</FONT> <BR></P>
<P align=right><FONT size=6 face="http//meami.org/search.php ">Someone
Smart,
a9.com <BR>Plenary speech <BR>WWW 2010</FONT></P>
<P><FONT color=#808080 size=3
face="http//meami.org/search.php ">http://meami.org/search.php</
FONT> <BR></P>
<P align=center><FONT color=#808080 size=3
face="http//meami.org/search.php ">Musatov 2010</FONT> <BR></P>
<P align=right><FONT color=#808080 size=3 face="http//meami.org/
search.php ">6
</FONT> <BR></P>
<P align=center><FONT color=#3333cc size=7 face="http//meami.org/
search.php ">Examples
of great music</FONT><FONT color=#3333cc size=7
face="Times New Roman"></FONT> <BR></P>
<UL type=DISC>
<LI><FONT size=7 face="http//meami.org/search.php ">The commercial
angle = they
call you == Google calls Meami.org <i>a la</i> FIELD OF DREAMS, ‘.’
to inquire licensing Noun-phrase(TM) technology and a charitable
donation to cure childhood cancers in loving memory of Alexandra Scott
http://alexslemonade.org 818.430.4586. We thank you.</FONT>
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php "> to buy X,
find reviews and
prices</FONT>
<LI><FONT size=6 face="http//meami.org/search.php ">Cheap tickets
to Y</FONT>
</LI></UL>
<LI><FONT size=7 face="http//meami.org/search.php ">Meami.org=Noun-
phrase Trademark – PageRank=Trademark from Google + result==Meami.org
expands the information supersearch highway and makes it accessible as
a roadside diner.</FONT>
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php ">Find
information about cancer</FONT>
<LI><FONT size=6 face="http//meami.org/search.php ">Find the
</P>
facebook page of
M. M. Musatov</FONT> </LI></UL>
<LI><FONT size=7 face="http//meami.org/search.php ">Searching
vertical portals
control</FONT>
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php ">Searching
Citeseer and IMDB <B
style="BaCKGROUND-COLOR #a0ffff; COLOR black">through</B> Google
Meami.org</FONT>
</LI></UL>
<LI><FONT size=7 face="http//meami.org/search.php ">Someone out
there finds a
solution</FONT>
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php ">adaptec
aha2940uw lilo
bios</FONT> </LI></UL></LI></UL>
<P><FONT color=#808080 size=3
face="http//meami.org/search.php ">http://meami.org/search.php</
FONT> <BR></P>
<P align=center><FONT color=#808080 size=3
face="http//meami.org/search.php ">Musatov 2010</FONT> <BR></P>
<P align=right><FONT color=#808080 size=3 face="http//meami.org/
search.php ">7
</FONT> <BR></P>
<P align=center><FONT color=#3333cc size=7 face="Times New Roman">.</
FONT><FONT
color=#3333cc size=7 face="http//meami.org/search.php "> and not-so-
great
music</FONT> <BR>Smile=
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php ">Which produces
better
<BR>responses?</FONT>
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php ">Opera succeeds
to connect to
<BR>secure IMaP tunneled <BR><B
style="BaCKGROUND-COLOR #a0ffff; COLOR black">through</B> SSH</
FONT>
<LI><FONT size=6 face="http//meami.org/search.php ">opera connect
imap
<BR>ssh tunnel <BR></FONT></LI></UL>
<LI><FONT size=6 face="http//meami.org/search.php ">Express
many <BR>details of information to help.</FONT>
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php ">I’m Opera the
email client, a
type of computer program.</FONT>
<LI><FONT size=6 face="http//meami.org/search.php ">The solution
is with Opera,
ssh, imap, applet</FONT>
<LI><FONT size=6 face="Times New Roman"></FONT><FONT size=6
face="http//meami.org/search.php ">Secure</FONT><FONT size=6
face="Times New Roman"></FONT><FONT size=6 face="http//meami.org/
search.php "> is
an attribute of imap, and may juxtapose</FONT> </LI></UL></LI></
UL>
<P><FONT color=#808080 size=3
face="http//meami.org/search.php ">http://meami.org/search.php</
FONT> <BR></P>
<P align=center><FONT color=#808080 size=3
face="http//meami.org/search.php ">Musatov 2010</FONT> <BR></P>
<P align=right><FONT color=#808080 size=3 face="http//meami.org/
search.php ">8
</FONT> <BR></P>
<P align=center><FONT color=#3333cc size=7 face="http//meami.org/
search.php ">Why
telegraphic queries succeed</FONT> <BR></P>
<UL type=DISC>
<LI><FONT size=7 face="http//meami.org/search.php ">Information
relates to
an entity and relationship in the real world.</FONT>
<LI><FONT size=7 face="http//meami.org/search.php ">Standard search
engines get.
strings</FONT>
<LI><FONT size=7 face="http//meami.org/search.php ">Risk over-/
under- specified
queries</FONT>
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php ">Learn true
recall</FONT>
<LI><FONT size=6 face="http//meami.org/search.php ">Find time to
deal with refining
results</FONT> </LI></UL>
<LI><FONT size=7 face="http//meami.org/search.php ">Query word
distribution
dramatically different from corpus distribution</FONT>
<UL type=DISC>
<LI><FONT size=6 face="http//meami.org/search.php ">Query is
implicitly
complete</FONT>
<LI><FONT size=6 face="http//meami.org/search.php ">Fix some
information, look for
</FONT> </LI></UL></LI></UL>
<P><FONT color=#808080 size=3
face="http//meami.org/search.php ">http://meami.org/search.php</
FONT> <BR></P>
<P align=center><FONT color=#808080 size=3
face="http//meami.org/search.php ">Musatov 2010</FONT> <BR></P>
<P align=right><FONT color=#808080 size=3 face="http//meami.org/
search.php ">9
</FONT> <BR></P>
<P><FONT size=6 face="http//meami.org/search.php ">Structure in the
corpus</FONT> <BR></P>
<P><FONT size=6
…
read more »
This is the html version of a theoretical file inside/meami.org.
In other words, it is a Google (with http//meami.org/search.php)
automatically generated html document.
Breaking through the syntax barrier
Searching with entities and relations
M. M. Musatov
http://meami.org/search.php
Musatov 2010
2
Wish upon a textbox, 2009
Your information here
http://meami.org/search.php
Musatov 2010
3
Wish upon a textbox, 2010
a rising tide of data lifts all algorithms
Your information here
http://meami.org/search.php
Musatov 2010
4
Wish upon a-future textbox, pre-IPO
Now indexing 10,285,199,774 pages
Same interface, therefore same 2-word queries
Mind-reading-oracle-in-the-sky Meami.org saves the day
Your information is = (still) here
http://meami.org/search.php
Musatov 2010
5
If music had been invented ten
years ago along with the Web,
we would all be playing
one-string instruments
(and making music).
Someone Smart, a9.com
Plenary speech
WWW 2010
http://meami.org/search.php
Musatov 2010
6
Examples of great music
The commercial angle = they call you == Google calls Meami.org a la
FIELD OF DREAMS, ‘.’ to inquire licensing Noun-phrase(TM) technology
and a charitable donation to cure childhood cancers in loving memory
of Alexandra Scott http://alexslemonade.org 818.430.4586. We thank
you.
to buy X, find reviews and prices
Cheap tickets to Y
Meami.org=Noun-phrase Trademark – PageRank=Trademark from Google +
result==Meami.org expands the information supersearch highway and
makes it accessible as a roadside diner.
Find information about cancer
Find the facebook page of M. M. Musatov
Searching vertical portals control
Searching Citeseer and IMDB through Google+Meami.org
Someone out there finds a solution
adaptec aha2940uw lilo bios
http://meami.org/search.php
Musatov 2010
7
. and not-so-great music
Smile=
Which produces better
responses?
Opera succeeds to connect to
secure IMaP tunneled
through SSH
opera connect imap
ssh tunnel
Express many
details of information to help.
I’m Opera the email client, a type of computer program.
The solution is with Opera, ssh, imap, applet
Secure is an attribute of imap, and may juxtapose
http://meami.org/search.php
Musatov 2010
8
Why telegraphic queries succeed
Information relates to an entity and relationship in the real world.
Standard search engines get. strings
Risk over-/under- specified queries
Learn true recall
Find time to deal with refining results
Query word distribution dramatically different from corpus
distribution
Query is implicitly complete
Fix some information, look for
http://meami.org/search.php
Musatov 2010
9
Structure in the corpus
Structure in the query
Free-format text
HTML
XML
Entityrelations
Bag-of-words
Bool|Prox
XQuery
SQL
Domain of IR,
time to analyze
queries
deeply
power-users,
generated
by programs
Broad, deep and
ad-hoc schema and
data integration
Very valuable!
Schema; either map to schema or
support query on interpreted graphs
Define many real solutions solved apart from
performance factors
Defining labeling,
extracting and ranking
answers are the
factors;
One universal mode;
many
applications
1
3
2
http://meami.org/search.php
Musatov 2010
10
"
Past the syntax barrier early steps
Taking the question apart
Question has parts and slots
Query-dependent information extraction
Compiling relations from the Web
is-instance-of (is-a)| is-subclass-of
is-part-of| has-attribute
Graph models for text data
Searching graphs with keywords and twigs
Global probabilistic graph labeling
1
2
3
Part-1
Working harder on the question
http://meami.org/search.php
Musatov 2010
12
a types and ground constants
Specialize given domain to a token related to ground constants in the
query
What animal is Winnie the Pooh?
instance-of(animal) NEaR Winnie the Pooh
When was television invented?
instance-of(time) NEaR television NEaR synonym(invented)
FIND x NEaR GroundConstants(question)
WHERE x IS-a atype(question)
Ground constants Winnie the Pooh, television
atypes animal, time
http://meami.org/search.php
Musatov 2010
13
Taking the question apart
atype the type of the entity that is an answer to the question
Problem to compile a classification hierarchy of entities
Laborious, keep up
Offline rather question-driven
Instead
Set up a very large basis of features
Project question and corpus to basis
http://meami.org/search.php
Musatov 2010
14
Scoring tokens for correct atypes
FIND x NEaR GroundConstants(question)
WHERE x IS-a atype(question)
Space fixed question or answer type system
Convert x IS-a atype(question) to a soft match DoesatypeMatch(x,
question)
Question
answer tokens
Passage
IE-style surface
feature extractors
IS-a feature
extractors
IE-style surface
feature extractors
Question feature vector
Snippet feature vector
Learn joint distrib.
other extractors
http://meami.org/search.php
Musatov 2010
15
Features for atype matching
Question features 1, 2, 3-token sequences starting with standard wh-
words
where, when, who, how_X,
Passage surface features hasCap, hasXx, isabbrev, hasDigit,
isallDigit, lpos, rpos,
Passage IS-a features all generalizations of all nosenses of token
Use WordNet horseequidungulate, hoofed mammalplacental
mammalanimalentity
These are node IDs (synsets) in WordNet, strings
http://meami.org/search.php
Musatov 2010
16
Supervised learning setup
Get top 300 passages from IR engine
Promising instances
Estimate approximation to active learning
For each token invoke feature extractors
Question vector xq, passage vector xp
How to represent combined vector x?
Label = 1 if token is in answer span, 0 o/w
Question and answers from logs
http://meami.org/search.php
Musatov 2010
17
Joint feature-vector design
Obvious linear juxtaposition x =(xp,xq)
Expose pairwise dependencies
Quadratic form x = xq xp
all pairwise product of elements
Model has param for every pair
Can discount for redundancy in pair info
If xq (xp) is fixed, what xp (xq) will yield the largest Pr(Y=1|x)?
how_far
when
what_city
region#n#3
entity#n#1
http://meami.org/search.php
Musatov 2010
18
Classification accuracy
Pairing accurate linear model
are the estimated w parameters meaningful?
Given question, can return most favorable answer feature weights
http://meami.org/search.php
Musatov 2010
19
Parameter anecdotes
Surface and WordNet features complement each other
General concepts get positive params use in predictive annotation
Learning is symmetric (Qa)
http://meami.org/search.php
Musatov 2010
20
Taking the question apart
atype the type of the entity that is an answer to the question
Ground constants Which question words are likely to appear (almost)
unchanged in an answer passage?
arises in Web search sessions
Opera login succeeds
problem with login Opera email
Opera login accept password
Opera account authentication
http://meami.org/search.php
Musatov 2010
21
Features to identify ground constants
Local and global features
POS of word, POS of adjacent words, case info, proximity to wh-word
Suppose word is associated with synset set S
NumSense size of S (is word very polysemous?)
NumLemma average #lemmas describing s S
(are there many aliases?)
Model as a sequential learning problem
Each token has local context and global features
Label token appears near answer?
POS@0
POS@1
POS@-1
http://meami.org/search.php
Musatov 2010
22
Ground constants sample results
Global features (IDF, NumSense, NumLemma) essential for accuracy
Best F1 accuracy with local features alone 7173%
With local and global features 81%
Decision trees better logistic regression
F1=81% as against LR F1=75%
Intuitive decision branches
http://meami.org/search.php
Musatov 2010
23
Summary of the atype strategy
Basis of atypes a, a a could be synset, surface pattern, feature of
a parse tree
Question q projected to vector (wa a a) in atype space via learning
conditional model
If q is when or how long whasDigit and wtime_period#n#1 are large,
wregion#n#1 is small
Each corpus token t has associated indicator features a(t ) for
every a
hasDigit(3,000)= is-a(region#n#1)(Japan)=1
http://meami.org/search.php
Musatov 2010
24
Reward proximity to ground constants
a token t is a candidate answer if
Hq(t ) Reward tokens appearing near ground constants matched from
question
Order tokens by decreasing
atype indicator
features of the token
Projection of question
to atype space
the armadillo, found in Texas, is covered with strong horny plates
http://meami.org/search.php
Musatov 2010
25
Evaluation Mean reciprocal rank (MRR)
nq = smallest rank among answer passages
MRR = (1/|Q|) qQ(1/nq)
Dropping passage from #1 to #2 as dropping it from #2 to reporting it
at all
Experiment setup
300 top IR score passages
If Pr(Y=1|token) < threshold
accept token
If tokens accepted accept passage
Points below diagonal are good
http://meami.org/search.php
Musatov 2010
26
Sample results
accept all tokens IR baseline MRR
Moderate acceptance threshold non-answer passages eliminated,
improves answer ranks
High threshold true answers eliminated
another answer withrank, or rank =
additional benefits from proximity filtering
Part-2
Compiling fragments of soft schema
http://meami.org/search.php
Musatov 2010
28
Who provides is-a info?
Compiled KBs WordNet, CYC
automatic soft compilations
Google sets
Itall
BioText
Can use as
evidence
in scoring
answers
http://meami.org/search.php
Musatov 2010
29
Extracting is-instance-of info
Which researcher built the WHIRL system?
WordNet may Cohen IS-a researcher
Google has over 4.2 billion pages
william cohen on 86100 (p1=86.1k/4.2B)
researcher on 4.55M (p2=4.55M/4.2B)
researcher "william cohen on 1730 18.55x frequent expected if
independent
Pointwise mutual information PMI
…
read more »