Sunday, 30 October 2011

Coursework 1:

Information retrieval: a view from the legal library.

Information retrieval: what is it? According to Baeza-Yates and Ribeiro-Neto (1999) ‘the key goal of an IR system is to retrieve information which might be useful or relevant to the user.’ It is possible to extract three core properties of information retrieval from this statement: systems, relevance and users. Information retrieval using digital technologies revolves around the interplay between these three elements and IR models have usually been viewed with an emphasis on relevance in relation to users (user-centric) or relevance in relation to systems (system-centred). Chowdhury (2004).  Both views of IR are important in defining its success even though the two concepts are quite different: from the point of view of a user what is relevant might be defined by a broad range of external factors, while from a system’s point of view what is relevant might be determined by performing calculations. However different these two approaches are, they are both important for successful information retrieval. Therefore, in this essay I am going to look at some of the system models for information retrieval, and evaluate them from the point of view of a user in the legal profession. After that I shall look at a few of the issues surrounding managing digital information using information technology.

Evaluating:

The Boolean model works by allowing users to transmit queries to a database using the AND, OR or NOT operators to refine a search. Chowdhury (2004) describes one limitation of the Boolean method of searching by highlighting the fact that it selects relevant documents by simply matching a query with an index term. Therefore the system does not provide a relevance ranking on the documents retrieved, and a user may have to order the results, e.g. chronologically, alphabetically. For users conducting legal research however, this may not be a limitation, as this kind of search will provide high recall – useful for a legal professional wanting to gain a 360’ view of a topic. Rosenfeld and Morville (2002). In addition, legal practitioners need recent commentary, and as such precision becomes less important than having lots of results that might be relevant and which can be displayed in date order.

Another IR model is Best Match. Best match searching involves a ranking of weightings of the importance of a term in a query or document, coupled with a means of using these term weightings to calculate the similarity between a document and a query. Chowdhury (2004). Those queries and documents with the highest similarity will be ranked most relevant. Natural language querying can be used in this type of IR system. Westlaw UK and Lexis Library (online legal databases) both use this kind of natural language system in addition to the Boolean method, which allows greater flexibility in searching. Baeza-Yates and Ribeiro-Neto (1999).

Hypertext browsing is a slightly different approach to information retrieval as it does not rely on search software; rather it relies upon good hypertext design which allows easy navigation in an online database. Rosenfeld and Morville (2002). The success of retrieving information through browsing can depend in some circumstances on the prior knowledge of the user – both in terms of the structure of the database and in terms of details of the information sought (dates and citation of case law for example.) If this knowledge is not present in the user, then a mixture of searching and browsing is an excellent way to retrieve information – searching initially to locate the case in the database, then browsing to find out its relevance.

Browsing has a significance which is unique to legal research because of the way that the legal system works. If we stay with the example of case law for now, we can see that hypertext provides a relevance ranking quite by accident. This is because in the English legal system, the importance of a case will be reflected by how many other cases use it as a precedent. In online legal databases like Lexis Library and Westlaw UK, cases citing a given case are represented within that case in hypertext. Therefore the more hypertext in a case law document, the more important that case is – and also the more relevant that case is to that topic of law.[1] So you can see that browsing as a form of IR in online legal databases can provide a relevance ranking of sorts. Although not directly linked technologically, this provides a neat parallel with the way that a big search engine like Google detects relevance (speaking simplistically)  by assessing the number of links or HITS to a website, and presenting websites with the most links to them as the most highly relevant to a given search. Langville and Meyer (2006).

Managing:

In order to effectively use information technologies to facilitate management of digital information it is important that user-centric and system-centred approaches work harmoniously together. This means that the information needs of a legal researcher must inform the organisation of the digital information behind the scenes.
The user interface is a good example of this, especially when providing different search fields, as not only is it vital to take into account the different fields that will be necessary for the website’s audience, but having different search fields could also impact upon the way inverted files are managed. According to Rowley and Hartley (2008) “Inverted files are often created for author names, title words, subject-indexing terms, and author-title acronyms.”  As inverted files contain the addresses of documents with relevant keywords, having separate files for different search fields is a good way to use technology to manage digital documents. In big online full-text databases like Westlaw UK and Lexis Library this could also extend to dividing indexes into categories of legal information, such as cases, legislation or journals, and then into subdivisions such as party names, subject, date.







Bibliography.


Baeza-Yates, R. and Ribeiro-Neto, B. (1999) Modern Information Retrieval, Essex, Pearson Education Ltd.

Cooke, A. (1999) A Guide to Finding Quality Information on the Internet, London, Library Association Publishing

Chowdhury, G.G. (2004) Introduction to Modern Information Retrieval, London, Facet Publishing.

Langville, A. and Meyer, C. (2006) Google’s PageRank and Beyond, Oxfordshire, Princeton University Press

Rosenfeld, L. and Morville, P. (2002) Information Architecture for the World Wide Web, CA, O’Reilly & Associates, Inc.

Rowley, J. and Hartley (2008) Organizing knowledge: an Introduction to Managing Access to Information, Hampshire, Ashgate Publishing



[1] JustCite (another online legal database) have represented this by creating a visual precedent map. 

Sunday, 23 October 2011

Information Retrieval


Information retrieval is different to data retrieval! Information retrieval is based upon finding information that is not organised in a rigid structure; this is in contrast to data retrieval in which data is stored in clearly defined tables in a database. In addition information retrieval relies upon relevance: because information is unstructured, and is sought in a more general and subjective way than the deterministic methods used to query a relational database, the search system provides a relevance ranking when it displays results, and it does this by calculating probability of relevance (probabilistic model). In the words of Baeza-Yates and Ribeiro-Neto, “ the key goal of an IR system is to retrieve information which might be useful or relevant to the user.”

Access to unstructured information can only be achieved if the data is indexed. To prepare an index you must firstly define how the text can be retrieved. This involves several steps: Identifying fields, - allowing search to be restricted to certain elements eg author, title; Identifying words – separating words from each other- usually using spaces; removing stop words such as and, to, the; stemming – removing suffixes; synonyms – specifying a list of synonyms for the key terms. After the text is prepared in this way an index can be built, the most popular structure for indexes being the inverted file structure, in which a Keyword file is related to a Postings file which contains lists of documents in which keywords appear, as well as details of how many times the word appears within the document and the document ID of the document in which the keyword appears.

How is information actually retrieved?
Retrieval models: Exact Match/Best Match.

Exact match:
Dominant model is BOOLEAN logic – AND, OR and NOT operators. Pretty self explanatory.  In the exercise we did we looked at two different search engines – Google and Bing – and saw what effect using Boolean operators had on the relevance of our search results. Google didn’t pay any attention to the Boolean operators, any I found instead that using + - signs did narrow results so that they became more relevant to my subjective information needs. Bing did work with Boolean operators.

Best match:
Natural language query: ie how can I get a chocolate stain out? The results for this query will be presented in a ranked list of results, in decreasing order of relevance – also known as the probabilistic model. I never usually search using this method, instead doing keyword searches, but I actually found this quite effective.

Alternatively you can also browse by clicking through links. This is what I often do.

Query modification: if your search isn’t successful then you can modify your query either manually or automatically.

Evaluation:
Quantitative: two main measures used for quantitative evaluation of how relevant the search results are: precision and recall.
Precision = the proportion of retrieved documents that are relevant: relevant documents retrieved/total documents retrieved.
Recall = relevant documents retrieved/ total number of relevant documents in the database.

In the exercise I measured the precision of the results returned by each search engine by taking the top 5 results and seeing how many of these were relevant to my information need. I then scored each search engine a mark for how relevant their results were.

Friday, 21 October 2011

DITA Session 3: Databases, YAY.


In week 3 our DITA session was focused on 'Structuring and querying information stored in databases'.


At first I found it hard to grasp the relevance of this topic, which may sound strange as of course databases and information retrieval are central to the role of an information professional and revolutionized the organization, management and storage of data. But when we got onto the exercises for the session I became a bit confused at the nature of the SQL querying language that we used to interrogate the database. I mean, I understood the concept of what we were doing, but I couldn't see how to apply this in the real world, as all the databases I use at work are much simpler to operate - less picky about the terms you enter! So the next day when i got to work (in a corporate law library) I asked my (only) colleague if she had had any experience with SQL  and database querying, and she gave me a very interesting insight into the development of the legal information databases that we use in our library, and how methods for querying them have changed over time. She said that when she started in librarianship in the mid 90s, it was necessary to interrogate databases using all the operators that I had experienced the day before in the practical exercise. So after that I had a bit more of an idea of how SQL fits into the information world.

Background:

Evolution from the File Approach to the Database Approach:

With the advent of computers into organizations in the 50s and 60s, data began to be stored in files, with different departments controlling different files of data. Often this lead to the duplication of data and meant that data representation was not standardized across departments. Another disadvantage of this approach was 'program-data dependency' whereby the physical structure and storage of the data files and records are defined in the application program, meaning that a change to the storage structure of a file cannot be made without also changing the application program.

The limitations of the File Based approach led to the development of databases and database management systems (DBMS). A database is defined by Connolly and Begg (2010) as: 'A shared collection of logically related data and its description, designed to meet the information needs of an organization.'  This definition highlights the communal nature of databases that makes them so successful. Instead of isolated files full of data, databases hold centralized records of data that can be managed uniformly and efficiently.

Relational databases:

This software represents the second generation of DBMSs (the first being network and hierarchical; good description can be found here :http://www.theukwebdesigncompany.com/articles/types-of-databases.php). Relational databases allow information from different tables of data to be searched simultaneously in order to respond to user queries with relevant data. Relational databases are made up of tables (relations) which each have their own names, and are made up of named columns (attributes) of data, and rows (tuples) which contain one value per column. Relational databases can be linked together using 'keys'. A primary key in a table is used to uniquely identify a row in that table. A foriegn key can be inserted into a table, and will represent the primary key of a different table. This is useful as it means that if table A is updated and its data is represented by a foriegn key in table B, it will not be necessary to update all the records in table B.









Sunday, 9 October 2011

Session 2: The Internet and the World Wide Web!

 Introduction
This week we are focusing on the internet, the world wide web and creating web pages.
To aid me in my studies I borrowed a copy of Weaving the 'World Wide Web' by its creator - Tim Berners-Lee - from the University Library. I was very excited to read about how the WWW came into existence, and the author's account is made even more fascinating to an Information Scientist because it was originally conceived  as a way of organizing the sharing of information succinctly. In our lectures for the Library and Information Science Foundation, Lyn has reminded us that there have been several thinkers throughout history who have envisaged sharing information via linking documents together, but Tim Burners-Lee happened to be in the right environment at the right stage in technological developments to allow his vision to succeed. I am also fascinated by the way that the internet allows the world wide web to mimic the human brain in the random links that can be made by using it. I can't help wishing that there was a diagram displaying all the links that have been made across the internet between web pages, a bit like this: http://infosthetics.com/archives/2009/02/nytimes_yearly_
visual_overview.html 
Has it been done already I wonder?


As well as from Burners-Lee's book, I have been using 'Internet & World Wide Web: How To Program' 3rd Edition, Deitel, and 'Cascading Style Sheets' Holzschlag, to help me with the more technical elements of this week's exercise, and I have found the later very helpful. Plus there is a very loving forward by Eric A. Meyer about Molly Holzschlag which is quite sweet, and makes me want to meet her!


The exercise:


So, first of all let me introduce you to my web page: http://www.student.city.ac.uk/~abkr563/cssexperiment.html
It all went swimmingly until the CSS (Cascading Style Sheet) needed to be inserted, and to be honest, the image won't appear right now, so I need to fix that later.


I learned from my lecture notes that a CSS could be inserted directly into the html document (embedded), or you could provide a link to a pre-made CSS within the html doc (linked). Ultimately I did the later after not being able to make the first option work for me, but this is something I need to explore further, as I feel unsatisfied at not being able to do the first option!


So to make my CSS I opened the wonderful application EditPlus 2, (it's great) and selected the CSS file extension. In order to write my CSS I followed instructions in 'Cascading Style Sheets' Holzschlag (as noted above). I made a Structured style sheet meaning that the styling corresponded directly to the structure of my html document. There were a few hiccups due to me forgetting to insert various punctuations, but now things is cool. 


But rather than recount the process I went through, I feel it would be more beneficial to highlight the key points I have learned about CSS, even though there are so many of them!


  • When following instructions for Task 5 in the exercise I applied one of the CSSs Andy supplied to my browser. This resulted in all web pages I opened looking one particular way, and illustrated the flexibility of CSS. I was reminded of what Richard said in the lecture about the choice of emphasis belonging to the web browser. In that instance he was referring to semantic tags, but to me, seeing all the web pages i opened under the influence of Andy's CSS really illustrated the fact that the browser was interpreting the instructions
  • It also made clear the nature of CSS as a STYLE language as opposed to a structural language. It gives instructions about the presentational aspects of a document, such as the colour, margin indents etc. A structural language -such as html in this instance- really gives instructions as to the layout of a document, the head, the body, paragraphs, line breaks etc. If you use a metaphor of a taylor, the structural language is the pattern, and the style language is the material used.
  • After I had created my CSS I looked at mine and looked at Andy's and noticed a clear difference, in that his tags seemed to be grouped succinctly, whereas mine seemed to be extended. So I had a look in Holzschlag's book to see if I could understand a bit better what was going on here. On page 70 I found some details about GROUPING. Grouping is a shorthand way of writing rules: for example instead of writing out: body {
    margin-top: 100px;

    margin-bottom: 20px;

    margin-right: 20px;

    margin-left: 100px;

    you can simply group each rule: body { margin: 100px 20px 20px 100px}
    I think the advantages of this are clear, but you have to make sure that the grouping order is correct, as certain properties must come before others in order for the rule to function as you wish.
  • I think the last point I would like to note is the structure of rules themselves:


    • Selector - identifies parts of the document
    Declaration - indicated the style of the selected section, comprising of properties and values. So for example a selector could be a paragraph, and a declaration could comprise of a color ( property) and the actual color you want (value).

SO that's it.........for now..............







Monday, 26 September 2011

DITA Session one; the weather is beautiful:

 Introduction:

The first thing I would like to say, is that I am an IT novice. Until now I have used my intuition to navigate my way through the world of computers, and luckily I have been successful at this, even earning the title of g0-to IT problems girl at work.


Therefore, this first session and the things I have gleaned from my reading have fascinated me. As someone who previously had no idea about the internal workings of computers or how to count in binary numbers, or why those strange characters appear sometimes when i just want to look at my word document, this session was truly illuminating and I can't wait to learn more.


To me, the highlights from my weeks learning are the following points (put briefly):


-The relationships between bits>bytes>files>documents. Each element contains more information than the last, like a crescendo.

-Open vs. proprietary formats. I have heard people say that they hate Microsoft before, and now I understand their reasoning a little bit better. I have had some experience with OpenOffice and now I am curious about how they have managed to get around the legalities.
-Binary data can be formatted in lots of different ways, and this is signified by file extensions. I would like to explore the mechanics of this in more detail.

Overview of the exercise:
In this week’s practical session we focused on exploring formatting and accessing data in different software applications, for example Notepad and Word. It is almost like a game of pairs. The type of document you save a file as must be compatible with the application you use to open the file. If it is not, you will get a nasty surprise in the form of a language that is possibly from outer-space, very scary - see the screen grab below which shows a .docx file opened in Notepad:

So, for example, when you open a .docx file (which is compatible with word) in an application that does not display formatting – such as Notepad which is compatible with .txt files – you are presented with all the code behind the formatting instructions. I learned that Notepad acts as an x-ray machine to files stored in other formats so that when one of these files is opened in Notepad the formatting instructions – margin sizes, fonts etc – are displayed. I predict that this will be quite handy when it comes to doing some Digital Architecture.  
The Exercise:
The exercise involved creating a short piece of text (about the weather, very British), saving it as a particular type of file (.txt, .docx, .html), and then opening it in applications that were not compatible with that type of file.  

First it was a .txt file compatible with Notepad. When opened in Word everything displayed fine. But when the text was saved in Word as a .docx file and opened in Notepad, the formatting codes appeared along with the text, and there was a lot of it! 

We then created a web page (.html file) and opened that in Notepad, and it was possible to view all the markup text in detail, which i found very interesting, not even being sarcastic! This particular part of the exercise illustrated the importance of open source non-proprietary formats, as although the tags are detailed and you need training in order to understand them, the characters are still recognizable.

Lastly we linked a picture to a document instead of simply inserting it, in that instead of copying and pasting where the image is static, when you embed a file within another file the image will change as you edit it.According to my reading of the lecture notes, this now rendered my piece a 'document' instead of a 'file' as now there were several files from different addresses in my computer's memory collected within one document, and this combining of files equals a document.
                                          My drawing of the weather:

Search This Blog