Assignment 2: Examples

Example of normalized text similarity

Let us consider the following document d:

Simple Simon met a pieman going to the fair;
Said Simple Simon to the pieman "Let me taste your ware"
Said the pieman to Simple Simon "Show me first your penny"
Said Simple Simon to the pieman "Sir, I have not any!"
and the following query q:
simple simple simon
We will calculate similarity between them using normalized similarity formula

The assignment requires that we use term frequencies as weights, i.e. q_i is term frequency for the i^th term in the query, and d_i is term frequency for the i^th term in the document. For the above example, the values are as follows.

Query:

q_simple = 2 q_simon = 1

All other q_i values are 0.

Document:

d_simple = 4
d_simon = 4
d_met = 1
d_a = 1
d_pieman = 4
d_going = 1
d_to = 4
d_the = 4
d_fair = 1
d_said = 3
d_let = 1
d_me = 2 d_taste = 1
d_your = 2
d_ware = 1
d_show = 1
d_first = 1
d_penny = 1
d_sir = 1
d_I = 1
d_have = 1
d_not = 1
d_any = 1

Now when we know all the q_i and d_i values, we calculate the numerator (täljare) in the similarity formula:

q_simple · d_simple + q_simon · d_simon + q_met · d_met + q_a · d_a + q_pieman · d_pieman + q_going · d_going + q_to · d_to + q_the · d_the + q_fair · d_fair + q_said · d_said + q_let · d_let + q_me · d_me + q_taste · d_taste + q_your · d_your + q_ware · d_ware + q_show · d_show + q_first · d_first + q_penny · d_penny + q_sir · d_sir + q_I · d_I + q_have · d_have + q_not · d_not + q_any · d_any =
2 · 4 + 1 · 4 + 0 · 1 + 0 · 1 + 0 · 4 + 0 · 1 + 0 · 4 + 0 · 4 + 0 · 1 + 0 · 3 + 0 · 1 + 0 · 2 + 0 · 1 + 0 · 2 + 0 · 1 + 0 · 1 + 0 · 1 + 0 · 1 + 0 · 1 + 0 · 1 + 0 · 1 + 0 · 1 + 0 · 1 = 12
N_d in the similarity formula is the number of words in the document. The assignment says we count stop-words as well, therefore in the above example N_d = 42.
Finally, we calculate sim(q, d) as 12/42 = 0.28571.

Assignment 2

Document:
d_simple = 4 d_simon = 4 d_met = 1 d_a = 1 d_pieman = 4 d_going = 1 d_to = 4 d_the = 4 d_fair = 1 d_said = 3 d_let = 1 d_me = 2		d_taste = 1 d_your = 2 d_ware = 1 d_show = 1 d_first = 1 d_penny = 1 d_sir = 1 d_I = 1 d_have = 1 d_not = 1 d_any = 1