Skip to main content

Typical Document Processing Operations

First rule of document pre-processing : Improper pre-processing schemes may lead to losss of lexical content.
Hence, pre-processings steps are unique to a problem. Having said that there are few pre-processing steps which applies to most of the application at hand.
They are : a. tokenization b. normalization  c. substitution.
Other well known pre-processing steps : a. Case folding b. Stemming c. Lemmatization d. Remove misspellings e. Punctuations.

What is a stemming operation ?
- Process of reducing a inflected word to its root. Where inflected word is a word with extra letter or letters added to nouns,verbs and adjectives in different grammatical forms.

What is lemmatization ?
- Here also there is reduction in inflected word to its root, however stemming resultant need not to be a proper word in vocabulary but in case of lemmatization word has be part of the given language vocabulary.

What is case folding and its usage ?
- Case-folding is a part of the Unicode standard that allows any two strings that differ from one another only by case to map to the same "case-folded" form, even when those strings include characters with complex case-mappings.[ convert all letters to a single case , either upper case of lower case whichever is chosen. ]. Helps in normalization and making text searches relavent.

Comments

Popular posts from this blog

Event Sourcing with CQRS.

  The way event sourcing works with CQRS is to have  part of the application that models updates as writes to an event log or Kafka topic . This is paired with an event handler that subscribes to the Kafka topic, transforms the event (as required) and writes the materialized view to a read store.

GraphQL microservices (GQLMS)

I'm curios of GraphQL !     -  GraphQL is an open-source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data. GraphQL was developed internally by Facebook in 2012 before being publicly released in 2015. It should be solving a problem in querying data !     -GraphQL lets you ask for what you want in a single query, saving bandwidth and reducing waterfall requests. It also enables clients to request their own unique data specifications. A case study ?!    -https://netflixtechblog.com/beyond-rest-1b76f7c20ef6 So, This is just another database technoloy ?  -  No. GraphQL is often confused with being a database technology. This is a misconception, GraphQL is a   query language   for APIs - not databases. In that sense it’s database agnostic and can be used with any kind of database or even no database at all. Source:   howtographql.com