Impact of Page Rank

Assignment 2 in Internet Search Techniques and Business Intelligence. You work in a group of 4 people as you did for the previous assignment.

Task

The task is to simulate a simple search engine. Consider the following steps of your work:

Documents

Documents

Queries

Select one of the two queries:

    a)   DSV football
    b)   Nikos DSV

Procedure

1. Text Similarity

Match the selected query to all the documents and list the documents according to their similarity score. Calculate normalized document similarity, use term frequency as the term weights. That is, apply the following similarity formula:

Similarity formula
where

See an example of calculating similarity between a document and a query. In your report, you are not required to repeat all the details that you find in the example. Note that because term frequencies are used as the only term weights, it is possible to convert the formula to something very simple which makes the similarity calculations trivial.

2. Page Rank

Calculate the page rank of each of the 5 documents. "0.ddmm" is the page rank value of the external document that is included in the formula as any other page rank value coming from another page. dd stands for the birth day and mm stands for the birth month of the oldest person in the group. For example, if the birthday is November 24, then 0.ddmm = 0.2411.

Use any means of calculation you find appropriate. Allow 5 digits after the decimal point. Consider examples:

Verify your page rank. Remove the external page, i.e. 0.ddmm = 0. Recalculate the page rank values. The average page rank value, considering the five Documents A through E, should be 0.65301. Please note that the average is not 1 because Document E does not return its page rank back to the collection.

Typical mistakes. Be careful when you count the number of outgoing links. Document A has 3 outgoing links, Document B has 2, Document C has 2, Document D has 4, and Document E has no outgoing links.

Be careful when you do Excel iterations (if you do). Check several times whether you still have the right formula in all the Excel cells.

3. Combining Text Similarity with Page Rank

Order the documents using a different (but still very simple) formula that makes use of page ranks:

SIM1(qd) = sim(qd) + 0.5 · PRinitial(d)

where

4. Re-linking the Documents

Your task is to move up the last two documents in the document list obtained in part (3) by changing the page rank of these documents. In order to change the page rank you re-link the documents as follows:

Re-calculate the page rank of each document, re-order the documents:

SIM2(qd) = sim(qd) + 0.5 · PRnew(d)

Observe the new placement of the two previously bottom-most documents.

Report

Write:

In order to show the relevance of the documents to the query applying different formulas, fill in the table below. Remember to re-order the documents for each similarity metric according to their placement in the list.

Placement
in the list
Initial Page Rank New Page Rank sim(qd) SIM1(qd) SIM2(qd)
Doc id PR value Doc id PR value Doc id sim value Doc id SIM value Doc id SIM value
1..............................
2..............................
3..............................
4..............................
5..............................

Don't forget to write your name on the report. Attach the Excel sheets and program code if you have any. This makes error detection easier.


Eriks Sneiders