MoL-2001-08: Text Categorization and Prototypes

MoL-2001-08: Bergo, Alexander (2001) Text Categorization and Prototypes. [Report]

[img]
Preview
Text (Full Text (PDF))
MoL-2001-08.text.pdf

Download (407kB) | Preview
[img] Text (Full Text (PS))
MoL-2001-08.text.ps.gz

Download (292kB)
[img] Text (Abstract)
MoL-2001-08.abstract.txt

Download (2kB)

Abstract

There is a basic dichotomy in text categorization between having a good categorizer (kNN) and low processing costs (Rocchio). We will try to reconcile the two forces by developing centroid- or, as we prefer to call them, prototype-based algorithms that look for similarities and dissimilarities. The basic idea is to measure the distance between two objects, not the closeness. In general, our algorithms will consist of two phases: 1. generating prototypes, and 2. comparing documents to be classified to prototypes. In some settings, however, the two steps will be combined into a single one. This is done by first measuring the distance between the document to be categorized with the previously categorized documents. The dissimilarity score for each category are then averaged and ranked. The category which has the least average dissimilarity score is adopted to the document we want to categorize. One of the core issues, then, is to come up with a good notion of prototype. We will try to implement notions of prototype that are inspired by fields like psychology, cognitive linguistics and philosophy, mainly by measuring dissimilarities between objects. We find the pairwise distance between two objects and then, for each object sub-space, we find the mean distance from the object we want to place in n-dimensional term space and thereby assign a category. The aim is to choose the correct categories for the test documents, by adopting the least dissimilar category. Our approach to distance measuring is based on variations of the so-called Minkowski and Canberra metrics. Among others, we will use these notions of dissimilarity to implement Rocchio prototypes. Our main aim is to utilize the new approach in such a manner that it possibly can outperform the better approaches in the field today. We will try to utilize methods that perform well, both according to correctly categorizing documents, and to save computation time. The rest of the thesis is organized as follows. In Chapter 2 we recall further technical facts about kNN and the Rocchio classifiers; these will serve as our starting points. Then, in Chapter 3 we present the basic ideas about the dissimilarity measures that we will use, based on the the Canberra and Minkowski metrics. The next step is to evaluate our newly developed methods; in Chapter 4 we provide the basis for a reliable test of the systems, by taking a brief look at the Reuters collection, at ways of representing documents, and at an example of how the relevant computations are done. In Chapter 5 we present our experimental results, first for kNN, then the Rocchio classifier, and then for a variety of dissimilarity systems. The final chapter of the thesis is devoted to a discussion, conclusion and some thoughts on how to develop the systems from here onwards.

Item Type: Report
Report Nr: MoL-2001-08
Series Name: Master of Logic Thesis (MoL) Series
Year: 2001
Uncontrolled Keywords: not yet available
Date Deposited: 12 Oct 2016 14:38
Last Modified: 12 Oct 2016 14:38
URI: https://eprints.illc.uva.nl/id/eprint/721

Actions (login required)

View Item View Item