First rule of document pre-processing : Improper pre-processing schemes may lead to losss of lexical content.
Hence, pre-processings steps are unique to a problem. Having said that there are few pre-processing steps which applies to most of the application at hand.
They are : a. tokenization b. normalization c. substitution.
Other well known pre-processing steps : a. Case folding b. Stemming c. Lemmatization d. Remove misspellings e. Punctuations.
What is a stemming operation ?
- Process of reducing a inflected word to its root. Where inflected word is a word with extra letter or letters added to nouns,verbs and adjectives in different grammatical forms.
What is lemmatization ?
- Here also there is reduction in inflected word to its root, however stemming resultant need not to be a proper word in vocabulary but in case of lemmatization word has be part of the given language vocabulary.
What is case folding and its usage ?
- Case-folding is a part of the Unicode standard that allows any two strings that differ from one another only by case to map to the same "case-folded" form, even when those strings include characters with complex case-mappings.[ convert all letters to a single case , either upper case of lower case whichever is chosen. ]. Helps in normalization and making text searches relavent.
Hence, pre-processings steps are unique to a problem. Having said that there are few pre-processing steps which applies to most of the application at hand.
They are : a. tokenization b. normalization c. substitution.
Other well known pre-processing steps : a. Case folding b. Stemming c. Lemmatization d. Remove misspellings e. Punctuations.
What is a stemming operation ?
- Process of reducing a inflected word to its root. Where inflected word is a word with extra letter or letters added to nouns,verbs and adjectives in different grammatical forms.
What is lemmatization ?
- Here also there is reduction in inflected word to its root, however stemming resultant need not to be a proper word in vocabulary but in case of lemmatization word has be part of the given language vocabulary.
What is case folding and its usage ?
- Case-folding is a part of the Unicode standard that allows any two strings that differ from one another only by case to map to the same "case-folded" form, even when those strings include characters with complex case-mappings.[ convert all letters to a single case , either upper case of lower case whichever is chosen. ]. Helps in normalization and making text searches relavent.
Comments
Post a Comment