This is a helper class which is useful for text-mining. It collects words, stems them, filters them through a list of stop-words, and assigns a discrete number to each word.
|
| GVocabulary (bool stemWords) |
|
| ~GVocabulary () |
|
void | addStopWord (const char *szWord) |
| Adds a stop word (a common word that should always be ignored) More...
|
|
void | addTypicalStopWords () |
| Adds a typical set of stop words. More...
|
|
void | addWord (const char *szWord, size_t nLen) |
| Adds a word to the vocabulary. (If the word is too short or is in the stop-word list, it will not be added.) More...
|
|
void | addWordsFromTextBlock (const char *text, size_t len) |
| Adds all the words in the text block to the vocabulary. More...
|
|
size_t | docCount () |
| Returns the number of documents from which words have been added so far. More...
|
|
GHeap * | heap () |
| Returns a pointer to the heap this uses to store strings. More...
|
|
void | newDoc () |
| If you want this to track statistics about the number of docs that contain each word, and the max number of times each word occurs in any doc, then you should call this method each time you start adding words from a new document (including the first one). If you don't want to track such stats, you need never call this method. If you call this method, but you didn't call it before the first word was added, it will throw an exception. More...
|
|
void | setMinWordSize (size_t n) |
| Sets the minimum word size. Smaller words will be ignored. The default is 4. More...
|
|
GWordStats & | stats (size_t word) |
| Returns the stats about a word. Throws if you weren't tracking stats (ie if you didn't call newDoc before each new document). More...
|
|
double | weight (size_t word) |
| Computes the weight that should be added to a document vector for each occurrence of a word in the vector-space document model. It is log(number_of_docs/docs_containing_word)/max_word_frequency. More...
|
|
size_t | wordCount () |
| Returns the number of unique words in this vocabulary. More...
|
|
size_t | wordIndex (const char *szWord, size_t len) |
| Returns the index of the specified word. Returns -1 if the word is not in the vocabulary (or is too short or is a stop word). More...
|
|
void GClasses::GVocabulary::newDoc |
( |
| ) |
|
If you want this to track statistics about the number of docs that contain each word, and the max number of times each word occurs in any doc, then you should call this method each time you start adding words from a new document (including the first one). If you don't want to track such stats, you need never call this method. If you call this method, but you didn't call it before the first word was added, it will throw an exception.