Day 53 in MIT Sloan Fellows Class 2023, Advanced Data Analytics and Machine Learning in Finance 3, Basis of NLP

NLP = playing with unstructured string data

String data is super messy.

Before running ML or advanced analytics, we need to cleanse data.

Encoding is another tricky issue in text data analysis. Remember, Emoji.

We need to understand ASCII and Unicode briefly.

ASCII

This is the basic standard of English text since 1963. It's what you are used to a-z, A-Z, 0-9, .,/; and more
can be represented everywhere, causes fewest issues. Often used in models to avoid dimensionality explosion.

Unicode

Expanded character set to handle international languages.
Multiple standards, but most common is UTF-8 (default in Python 3)
UTF-8 used in >95% of websites. Multiple bytes of information (First byte is ASCII)

Regulral Expressions(Regex)

Regexes are sequences of characters expressing a search pattern.

For example, if you want to search the word with the following conditions,

find the string 'iphone'
only if preceeded by 'apple released the new' or 'apple announced the new'
exclude 'iphone 6'
containing spaces but no newlines
apple can be misspelled aple

Then, you should write a pattern as follows.

r'app?le (?:release|announced) the new (iphone) [^6]'

For additional practice and learning, I recommend the following websites.

buildregex.com

www.rexegg.com

regex101.com

NLP Basics

Let me introduce interesting quotes about NLP.

Machine learning is mostly counting. - Eugene Yurtsev, 2020

And to count texts properly, we definitely need sophisticated "cleansing" processes.

After cleansing, we start some analysis on n-grams. unigram and bigram are the simplest ones and if you increase the number of n, you need to get deeper context, but difficult to analyze with only counting them.

Bag of Words(BOW) is unordered, structured representation of ngrams per document. To maximize the peformance of analysis, we transform it into sparse matrix because BOW contains bunch of zeros(over 99 %).

Term frequency-inverse document frequency(TF-IDF) is another concept to evaluate "importance" of word. So we get the importance of a word in a document relative to the overall ubiquity of the word across the corpus
Note that this is good, but arbitrary!

I will introduce vectorization and cosine distance in the next blog.