足ることを知らず

Data Science, global business, management and MBA

Day 75 in MIT Sloan Fellows Class 2023, Advanced Data Analytics and Machine Learning in Finance 6, Playing with text data

Torkenization

What is Tokenization | Methods to Perform Tokenization

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

 

Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text.

 

Tokernization in python

 

  • Sentence Tokenization – Splitting sentences in the paragraph
  • PunktSentenceTokenizer – When we have huge chunks of data then it is efficient to use it.
  • Word Tokenization – Splitting words in a sentence.
  • PunktWordTokenizer – It doesn’t separates the punctuation from the words.
  • CountVectorizer
  • TdifVectorizer

Difference between Embedding and Vectorizer

Transformer Embeddings and Tokenization

Tokenization converts a text into a list of integers. embedding converts the list of integers into a list of vectors (list of embeddings) positional information about each token is added to embeddings using positional encodings or embeddings.

The typical examples of embedding is word2vec. So tokenization is creating dictionary for our machine learning, whereas embedding is describing word/token with quantifiable expressions, vectors. The major benefit is the decline of dimensions. (Similar to PCA)

 

One hot encoding

One hot encoding is to transform vectoralized data into a binary type table.

Firstly, the efficiency of machine learning. 

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

 

Secondly, using integer coding to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.