Torkenization
What is Tokenization | Methods to Perform Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text.
Tokernization in python
- Sentence Tokenization – Splitting sentences in the paragraph
- PunktSentenceTokenizer – When we have huge chunks of data then it is efficient to use it.
- Word Tokenization – Splitting words in a sentence.
- PunktWordTokenizer – It doesn’t separates the punctuation from the words.
- CountVectorizer
- TdifVectorizer
Difference between Embedding and Vectorizer
Transformer Embeddings and Tokenization
Tokenization converts a text into a list of integers. embedding converts the list of integers into a list of vectors (list of embeddings) positional information about each token is added to embeddings using positional encodings or embeddings.
The typical examples of embedding is word2vec. So tokenization is creating dictionary for our machine learning, whereas embedding is describing word/token with quantifiable expressions, vectors. The major benefit is the decline of dimensions. (Similar to PCA)
One hot encoding
One hot encoding is to transform vectoralized data into a binary type table.
Firstly, the efficiency of machine learning.
Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.
Secondly, using integer coding to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).
In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.