Skip to main content

Typical Document Processing Operations

First rule of document pre-processing : Improper pre-processing schemes may lead to losss of lexical content.
Hence, pre-processings steps are unique to a problem. Having said that there are few pre-processing steps which applies to most of the application at hand.
They are : a. tokenization b. normalization  c. substitution.
Other well known pre-processing steps : a. Case folding b. Stemming c. Lemmatization d. Remove misspellings e. Punctuations.

What is a stemming operation ?
- Process of reducing a inflected word to its root. Where inflected word is a word with extra letter or letters added to nouns,verbs and adjectives in different grammatical forms.

What is lemmatization ?
- Here also there is reduction in inflected word to its root, however stemming resultant need not to be a proper word in vocabulary but in case of lemmatization word has be part of the given language vocabulary.

What is case folding and its usage ?
- Case-folding is a part of the Unicode standard that allows any two strings that differ from one another only by case to map to the same "case-folded" form, even when those strings include characters with complex case-mappings.[ convert all letters to a single case , either upper case of lower case whichever is chosen. ]. Helps in normalization and making text searches relavent.

Comments

Popular posts from this blog

ASCII to Decimal conversion

#include "msp430.h"                     ; #define controlled include file         NAME    main                    ; module name         PUBLIC  main                    ; make the main label vissible                                         ; outside this module         ORG     0FFFEh         DC16    init                    ; set reset vector to 'init' label         RSEG    CSTACK                  ; pre-declaration of segment         RSEG    CODE      ...

Create One-Click Shutdown and Reboot Shortcuts

First, create a shortcut on your desktop by right-clicking on the desktop, choosing New, and then choosing Shortcut. The Create Shortcut Wizard appears. In the box asking for the location of the shortcut, type shutdown. After you create the shortcut, double-clicking on it will shut down your PC. But you can do much more with a shutdown shortcut than merely shut down your PC. You can add any combination of several switches to do extra duty, like this: shutdown -r -t 01 -c "Rebooting your PC" Double-clicking on that shortcut will reboot your PC after a one-second delay and display the message "Rebooting your PC." The shutdown command includes a variety of switches you can use to customize it. Table 1-3 lists all of them and describes their use. I use this technique to create two shutdown shortcuts on my desktop—one for turning off my PC, and one for rebooting. Here are the ones I use: shutdown -s -t 03 -c "Bye Bye m8!" shutdown -r -t 03 -c ...

Event Sourcing with CQRS.

  The way event sourcing works with CQRS is to have  part of the application that models updates as writes to an event log or Kafka topic . This is paired with an event handler that subscribes to the Kafka topic, transforms the event (as required) and writes the materialized view to a read store.