GClasses
GClasses::GVocabulary Class Reference

Detailed Description

This is a helper class which is useful for text-mining. It collects words, stems them, filters them through a list of stop-words, and assigns a discrete number to each word.

#include <GText.h>

Public Member Functions

 GVocabulary (bool stemWords)
 
 ~GVocabulary ()
 
void addStopWord (const char *szWord)
 Adds a stop word (a common word that should always be ignored) More...
 
void addTypicalStopWords ()
 Adds a typical set of stop words. More...
 
void addWord (const char *szWord, size_t nLen)
 Adds a word to the vocabulary. (If the word is too short or is in the stop-word list, it will not be added.) More...
 
void addWordsFromTextBlock (const char *text, size_t len)
 Adds all the words in the text block to the vocabulary. More...
 
size_t docCount ()
 Returns the number of documents from which words have been added so far. More...
 
GHeapheap ()
 Returns a pointer to the heap this uses to store strings. More...
 
void newDoc ()
 If you want this to track statistics about the number of docs that contain each word, and the max number of times each word occurs in any doc, then you should call this method each time you start adding words from a new document (including the first one). If you don't want to track such stats, you need never call this method. If you call this method, but you didn't call it before the first word was added, it will throw an exception. More...
 
void setMinWordSize (size_t n)
 Sets the minimum word size. Smaller words will be ignored. The default is 4. More...
 
GWordStatsstats (size_t word)
 Returns the stats about a word. Throws if you weren't tracking stats (ie if you didn't call newDoc before each new document). More...
 
double weight (size_t word)
 Computes the weight that should be added to a document vector for each occurrence of a word in the vector-space document model. It is log(number_of_docs/docs_containing_word)/max_word_frequency. More...
 
size_t wordCount ()
 Returns the number of unique words in this vocabulary. More...
 
size_t wordIndex (const char *szWord, size_t len)
 Returns the index of the specified word. Returns -1 if the word is not in the vocabulary (or is too short or is a stop word). More...
 

Protected Attributes

size_t m_docNumber
 
size_t m_minWordSize
 
GHeapm_pHeap
 
GStemmerm_pStemmer
 
GConstStringHashTablem_pStopWords
 
GConstStringToIndexHashTablem_pVocabulary
 
std::vector< GWordStats > * m_pWordStats
 
size_t m_vocabSize
 
char wordBuf [64]
 

Constructor & Destructor Documentation

GClasses::GVocabulary::GVocabulary ( bool  stemWords)
GClasses::GVocabulary::~GVocabulary ( )

Member Function Documentation

void GClasses::GVocabulary::addStopWord ( const char *  szWord)

Adds a stop word (a common word that should always be ignored)

void GClasses::GVocabulary::addTypicalStopWords ( )

Adds a typical set of stop words.

void GClasses::GVocabulary::addWord ( const char *  szWord,
size_t  nLen 
)

Adds a word to the vocabulary. (If the word is too short or is in the stop-word list, it will not be added.)

void GClasses::GVocabulary::addWordsFromTextBlock ( const char *  text,
size_t  len 
)

Adds all the words in the text block to the vocabulary.

size_t GClasses::GVocabulary::docCount ( )
inline

Returns the number of documents from which words have been added so far.

GHeap* GClasses::GVocabulary::heap ( )
inline

Returns a pointer to the heap this uses to store strings.

void GClasses::GVocabulary::newDoc ( )

If you want this to track statistics about the number of docs that contain each word, and the max number of times each word occurs in any doc, then you should call this method each time you start adding words from a new document (including the first one). If you don't want to track such stats, you need never call this method. If you call this method, but you didn't call it before the first word was added, it will throw an exception.

void GClasses::GVocabulary::setMinWordSize ( size_t  n)
inline

Sets the minimum word size. Smaller words will be ignored. The default is 4.

GWordStats& GClasses::GVocabulary::stats ( size_t  word)

Returns the stats about a word. Throws if you weren't tracking stats (ie if you didn't call newDoc before each new document).

double GClasses::GVocabulary::weight ( size_t  word)

Computes the weight that should be added to a document vector for each occurrence of a word in the vector-space document model. It is log(number_of_docs/docs_containing_word)/max_word_frequency.

size_t GClasses::GVocabulary::wordCount ( )
inline

Returns the number of unique words in this vocabulary.

size_t GClasses::GVocabulary::wordIndex ( const char *  szWord,
size_t  len 
)

Returns the index of the specified word. Returns -1 if the word is not in the vocabulary (or is too short or is a stop word).

Member Data Documentation

size_t GClasses::GVocabulary::m_docNumber
protected
size_t GClasses::GVocabulary::m_minWordSize
protected
GHeap* GClasses::GVocabulary::m_pHeap
protected
GStemmer* GClasses::GVocabulary::m_pStemmer
protected
GConstStringHashTable* GClasses::GVocabulary::m_pStopWords
protected
GConstStringToIndexHashTable* GClasses::GVocabulary::m_pVocabulary
protected
std::vector<GWordStats>* GClasses::GVocabulary::m_pWordStats
protected
size_t GClasses::GVocabulary::m_vocabSize
protected
char GClasses::GVocabulary::wordBuf[64]
protected