Information retrieval is different to data retrieval!
Information retrieval is based upon finding information that is not organised
in a rigid structure; this is in contrast to data retrieval in which data is
stored in clearly defined tables in a database. In addition information
retrieval relies upon relevance: because information is unstructured, and is
sought in a more general and subjective way than the deterministic methods used
to query a relational database, the search system provides a relevance ranking when
it displays results, and it does this by calculating probability of relevance (probabilistic
model). In the words of Baeza-Yates and Ribeiro-Neto, “ the key goal of an IR
system is to retrieve information which might be useful or relevant to the
user.”
Access to unstructured information can only be achieved if
the data is indexed. To prepare an index you must firstly define how the text
can be retrieved. This involves several steps: Identifying fields, - allowing
search to be restricted to certain elements eg author, title; Identifying words
– separating words from each other- usually using spaces; removing stop words
such as and, to, the; stemming – removing suffixes; synonyms – specifying a
list of synonyms for the key terms. After the text is prepared in this way an
index can be built, the most popular structure for indexes being the inverted
file structure, in which a Keyword file is related to a Postings file which contains
lists of documents in which keywords appear, as well as details of how many
times the word appears within the document and the document ID of the document
in which the keyword appears.
How is information actually retrieved?
Retrieval models: Exact Match/Best Match.
Exact match:
Dominant model is BOOLEAN logic – AND, OR and NOT operators.
Pretty self explanatory. In the exercise
we did we looked at two different search engines – Google and Bing – and saw
what effect using Boolean operators had on the relevance of our search results.
Google didn’t pay any attention to the Boolean operators, any I found instead
that using + - signs did narrow results so that they became more relevant to my
subjective information needs. Bing did work with Boolean operators.
Best match:
Natural language query: ie how can I get a chocolate stain
out? The results for this query will be presented in a ranked list of results,
in decreasing order of relevance – also known as the probabilistic model. I
never usually search using this method, instead doing keyword searches, but I
actually found this quite effective.
Alternatively you can also browse by clicking through links.
This is what I often do.
Query modification: if your search isn’t successful then you
can modify your query either manually or automatically.
Evaluation:
Quantitative: two main measures used for quantitative
evaluation of how relevant the search results are: precision and recall.
Precision = the proportion of retrieved documents that are
relevant: relevant documents retrieved/total documents retrieved.
Recall = relevant documents retrieved/ total number of
relevant documents in the database.
In the exercise I measured the precision of the results
returned by each search engine by taking the top 5 results and seeing how many
of these were relevant to my information need. I then scored each search engine
a mark for how relevant their results were.
No comments:
Post a Comment