Sunday, 23 October 2011

Information Retrieval


Information retrieval is different to data retrieval! Information retrieval is based upon finding information that is not organised in a rigid structure; this is in contrast to data retrieval in which data is stored in clearly defined tables in a database. In addition information retrieval relies upon relevance: because information is unstructured, and is sought in a more general and subjective way than the deterministic methods used to query a relational database, the search system provides a relevance ranking when it displays results, and it does this by calculating probability of relevance (probabilistic model). In the words of Baeza-Yates and Ribeiro-Neto, “ the key goal of an IR system is to retrieve information which might be useful or relevant to the user.”

Access to unstructured information can only be achieved if the data is indexed. To prepare an index you must firstly define how the text can be retrieved. This involves several steps: Identifying fields, - allowing search to be restricted to certain elements eg author, title; Identifying words – separating words from each other- usually using spaces; removing stop words such as and, to, the; stemming – removing suffixes; synonyms – specifying a list of synonyms for the key terms. After the text is prepared in this way an index can be built, the most popular structure for indexes being the inverted file structure, in which a Keyword file is related to a Postings file which contains lists of documents in which keywords appear, as well as details of how many times the word appears within the document and the document ID of the document in which the keyword appears.

How is information actually retrieved?
Retrieval models: Exact Match/Best Match.

Exact match:
Dominant model is BOOLEAN logic – AND, OR and NOT operators. Pretty self explanatory.  In the exercise we did we looked at two different search engines – Google and Bing – and saw what effect using Boolean operators had on the relevance of our search results. Google didn’t pay any attention to the Boolean operators, any I found instead that using + - signs did narrow results so that they became more relevant to my subjective information needs. Bing did work with Boolean operators.

Best match:
Natural language query: ie how can I get a chocolate stain out? The results for this query will be presented in a ranked list of results, in decreasing order of relevance – also known as the probabilistic model. I never usually search using this method, instead doing keyword searches, but I actually found this quite effective.

Alternatively you can also browse by clicking through links. This is what I often do.

Query modification: if your search isn’t successful then you can modify your query either manually or automatically.

Evaluation:
Quantitative: two main measures used for quantitative evaluation of how relevant the search results are: precision and recall.
Precision = the proportion of retrieved documents that are relevant: relevant documents retrieved/total documents retrieved.
Recall = relevant documents retrieved/ total number of relevant documents in the database.

In the exercise I measured the precision of the results returned by each search engine by taking the top 5 results and seeing how many of these were relevant to my information need. I then scored each search engine a mark for how relevant their results were.

No comments:

Post a Comment

Search This Blog