Skip to main content

Typical Document Processing Operations

First rule of document pre-processing : Improper pre-processing schemes may lead to losss of lexical content.
Hence, pre-processings steps are unique to a problem. Having said that there are few pre-processing steps which applies to most of the application at hand.
They are : a. tokenization b. normalization  c. substitution.
Other well known pre-processing steps : a. Case folding b. Stemming c. Lemmatization d. Remove misspellings e. Punctuations.

What is a stemming operation ?
- Process of reducing a inflected word to its root. Where inflected word is a word with extra letter or letters added to nouns,verbs and adjectives in different grammatical forms.

What is lemmatization ?
- Here also there is reduction in inflected word to its root, however stemming resultant need not to be a proper word in vocabulary but in case of lemmatization word has be part of the given language vocabulary.

What is case folding and its usage ?
- Case-folding is a part of the Unicode standard that allows any two strings that differ from one another only by case to map to the same "case-folded" form, even when those strings include characters with complex case-mappings.[ convert all letters to a single case , either upper case of lower case whichever is chosen. ]. Helps in normalization and making text searches relavent.

Comments

Popular posts from this blog

ASCII to Decimal conversion

#include "msp430.h"                     ; #define controlled include file         NAME    main                    ; module name         PUBLIC  main                    ; make the main label vissible                                         ; outside this module         ORG     0FFFEh         DC16    init                    ; set reset vector to 'init' label         RSEG    CSTACK                  ; pre-declaration of segment         RSEG    CODE      ...

Event Sourcing with CQRS.

  The way event sourcing works with CQRS is to have  part of the application that models updates as writes to an event log or Kafka topic . This is paired with an event handler that subscribes to the Kafka topic, transforms the event (as required) and writes the materialized view to a read store.