Do you also want to learn NLP as Quick as Possible ? Perhaps you are here because you also want to learn natural language processing as quickly as possible, like me. Let’s start. The first thing we need is to install some dependency 1. Python >3.7 https://www.python.org/downloads/ 2. Download an IDE or install Jupyter notebook To install Jupyter notebook, just open your cmd(terminal) and type pip install jupyter-notebook after that type jupyter notebook to run it then you can see that your notebook is open at http://127.0.0.1:8888/ token . 3. Install packages pip install nltk : It is a python library that can we used to perform all the NLP tasks(stemming, lemmatization, etc..) NLTK In this blog, we are going to learn about Tokenization Stopwords Stemming Lemmatizer WordNet Part of speech tagging Bag of Words Before learning anything let’s first understand NLP. Natural Language refers to the way we humans communicate with each other and processing is basically formatting the data in an understandable form. So we can say that NLP (Natural Language Processing) is a way that helps computers to communicate with humans in their own language. It is one of the broadest fields in research because there is a huge amount of data out there and from that data, a big amount of data is text data. So when there is so much data available so we need some technique through which we can process the data and retrieve some useful information from it. Now, we have an understanding of what is NLP, let’s start understanding each topic one by one. 1. Tokenization Tokenization is the process of dividing the whole text into tokens. It is mainly of two types: Word Tokenizer (separated by words) Sentence Tokenizer (separated by sentence) nltk nltk.tokenize sent_tokenize,word_tokenize example_text = print(sent_tokenize(example_text)) print(word_tokenize(example_text)) import from import "Hello there, how are you doing today? The weather is great today. The sky is blue. python is awsome" In the above code, we are importing nltk. In the second line, we are importing our tokenizers library , then to use the tokenizer on a text we just need to pass the text as a parameter in the tokenizer. sent_tokenize, word_tokenizfrom nltk.tokenize The output will look something like this: [ , , , ] [ , , , , , , , , , , , , , , , , , , , , , , ] ##sent_tokenize (Separated by sentence) 'Hello there, how are you doing today?' 'The weather is great today.' 'The sky is blue.' 'python is awsome' ##word_tokenize (Separated by words) 'Hello' 'there' ',' 'how' 'are' 'you' 'doing' 'today' '?' 'The' 'weather' 'is' 'great' 'today' '.' 'The' 'sky' 'is' 'blue' '.' 'python' 'is' 'awsome' 2. Stopwords In general, stopwords are the words in any language which does not add much meaning to a sentence. In NLP, stopwords are those words which are not important in analyzing the data. Example: he, she, hi, etc. Our main task is to remove all the stopwords for the text to do any further processing. There are a total of 179 stopwords in English, using NLTK we can see all the stopwords in English. We Just need to import from the library . stopwords nltk.corpus nltk.corpus stopwords print(stopwords.words( )) [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ] from import 'english' ###################### ######OUTPUT########## ###################### 'i' 'me' 'my' 'myself' 'we' 'our' 'ours' 'ourselves' 'you' "you're" "you've" "you'll" "you'd" 'your' 'yours' 'yourself' 'yourselves' 'he' 'him' 'his' 'himself' 'she' "she's" 'her' 'hers' 'herself' 'it' "it's" 'its' 'itself' 'they' 'them' 'their' 'theirs' 'themselves' 'what' 'which' 'who' 'whom' 'this' 'that' "that'll" 'these' 'those' 'am' 'is' 'are' 'was' 'were' 'be' 'been' 'being' 'have' 'has' 'had' 'having' 'do' 'does' 'did' 'doing' 'a' 'an' 'the' 'and' 'but' 'if' 'or' 'because' 'as' 'until' 'while' 'of' 'at' 'by' 'for' 'with' 'about' 'against' 'between' 'into' 'through' 'during' 'before' 'after' 'above' 'below' 'to' 'from' 'up' 'down' 'in' 'out' 'on' 'off' 'over' 'under' 'again' 'further' 'then' 'once' 'here' 'there' 'when' 'where' 'why' 'how' 'all' 'any' 'both' 'each' 'few' 'more' 'most' 'other' 'some' 'such' 'no' 'nor' 'not' 'only' 'own' 'same' 'so' 'than' 'too' 'very' 's' 't' 'can' 'will' 'just' 'don' "don't" 'should' "should've" 'now' 'd' 'll' 'm' 'o' 're' 've' 'y' 'ain' 'aren' "aren't" 'couldn' "couldn't" 'didn' "didn't" 'doesn' "doesn't" 'hadn' "hadn't" 'hasn' "hasn't" 'haven' "haven't" 'isn' "isn't" 'ma' 'mightn' "mightn't" 'mustn' "mustn't" 'needn' "needn't" 'shan' "shan't" 'shouldn' "shouldn't" 'wasn' "wasn't" 'weren' "weren't" 'won' "won't" 'wouldn' "wouldn't" To remove Stopwords for a particular text. nltk.corpus stopwords text = text = word_tokenize(text) text_with_no_stopwords = [word word text word not stopwords.words( )] text_with_no_stopwords ##########OUTPUT########## [ , , , , ] from import 'he is a good boy. he is very good in coding' for in if in 'english' 'good' 'boy' '.' 'good' 'coding' 3. Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. In simple words, we can say that stemming is the process of removing plural and adjectives from the word. Stemming : Example loved → love, learning →learn In python, we can implement stemming by using . we can import it from the library PorterStemmer nltk.stem One thing to remember from Stemming is that it works best with single words. nltk.stem PorterStemmer ps = PorterStemmer() example_words = [ , , , ] w example_words: print(ps.stem(w)) earn earn earn earn Here we can see that earning,earned earns are stem to there lemma root word earn. from import ## Creating an object for porterstemmer 'earn' "earning" "earned" "earns" ##Example words for in ##Using ps object stemming the word ##########OUTPUT########## and or 4. Lemmatizing usually refers to doing things properly with the use of vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Lemmatization In simple words lemmatization does the same work as stemming, the difference is that lemmatization returns a meaningful word. Example: Stemming histori history → Lemmatizing → history history It is Mostly used when designing chatbots, Q&A bots, text prediction, etc. nltk.stem WordNetLemmatizer lemmatizer = WordNetLemmatizer() example_words = [ , , ] w example_words: print(lemmatizer.lemmatize(w)) ----Lemmatizer----- history formality change -----Stemming------ histori formal chang from import ## Create object for lemmatizer 'history' 'formality' 'changes' for in #########OUTPUT############ 5. WordNet WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing. We can use wordnet for finding and synonyms antonyms. In python, we can import wordnet from . nltk.corpus Code For Finding Synonym and antonym for a given word. nltk.corpus wordnet synonyms = [] antonyms =[] syn wordnet.synsets( ): i syn.lemmas(): synonyms.append(i.name()) i.antonyms(): antonyms.append(i.antonyms()[ ].name()) print(set(synonyms)) print(set(antonyms)) { , , , } { } from import ## Creaing an empty list for all the synonyms ## Creaing an empty list for all the antonyms for in "happy" ## Giving word for in ## Finding the lemma,matching ## appending all the synonyms if 0 ## antonyms ## Converting them into set for unique values #########OUTPUT########## 'felicitous' 'well-chosen' 'happy' 'glad' 'unhappy' 6. Part of Speech Tagging It is a process of converting a sentence to forms — a list of words, a list of tuples (where each tuple is having a form ). The tag in the case is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on. (word, tag) Part of Speech Tag List CC coordinating conjunction CD cardinal digit DT determiner EX existential there (like: “there ” … think of it like “there”) FW foreign word IN preposition/subordinating conjunction JJ adjective ‘big’ JJR adjective, comparative ‘bigger’ JJS adjective, superlative ‘biggest’ LS list marker ) MD modal could, will NN noun, singular ‘desk’ NNS noun plural ‘desks’ NNP proper noun, singular ‘Harrison’ NNPS proper noun, plural ‘Americans’ PDT predeterminer ‘all the kids’ POS possessive ending parent’s PRP personal pronoun I, he, she PRP possessive pronoun my, his, hers RB adverb very, silently, RBR adverb, comparative better RBS adverb, superlative best RP particle give up TO to go ‘to’ the store. UH interjection errrrrrrrm VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non d take VBZ verb, rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what WP possessive wh-pronoun whose WRB wh-abverb where, when is 1 -3 3 In python, we can do pos tagging using nltk.pos_tag nltk nltk.download( ) sample_text = nltk.tokenize word_tokenize words = word_tokenize(sample_text) print(nltk.pos_tag(words)) [( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , )] import 'averaged_perceptron_tagger' ''' An sincerity so extremity he additions. Her yet there truth merit. Mrs all projecting favourable now unpleasing. Son law garden chatty temper. Oh children provided to mr elegance marriage strongly. Off can admiration prosperous now devonshire diminution law. ''' from import ################OUTPUT############ 'An' 'DT' 'sincerity' 'NN' 'so' 'RB' 'extremity' 'NN' 'he' 'PRP' 'additions' 'VBZ' '.' '.' 'Her' 'PRP$' 'yet' 'RB' 'there' 'EX' 'truth' 'NN' 'merit' 'NN' '.' '.' 'Mrs' 'NNP' 'all' 'DT' 'projecting' 'VBG' 'favourable' 'JJ' 'now' 'RB' 'unpleasing' 'VBG' '.' '.' 'Son' 'NNP' 'law' 'NN' 'garden' 'NN' 'chatty' 'JJ' 'temper' 'NN' '.' '.' 'Oh' 'UH' 'children' 'NNS' 'provided' 'VBD' 'to' 'TO' 'mr' 'VB' 'elegance' 'NN' 'marriage' 'NN' 'strongly' 'RB' '.' '.' 'Off' 'CC' 'can' 'MD' 'admiration' 'VB' 'prosperous' 'JJ' 'now' 'RB' 'devonshire' 'VBP' 'diminution' 'NN' 'law' 'NN' '.' '.' 7. Bag of words Till now we have learned about tokenizing, stemming, and lemmatizing. All of these are the part of the text cleaning, now after cleaning the text we need to convert the text into some kind of numerical representation called so that we can feed the data to a machine learning model for further processing. vectors For converting the data into vectors we make use of some predefined libraries in python. Let’s see how vector representation works. sent1 = he a good boy sent2 = she a good girl sent3 = boy girl are good | | After removal of stopwords , lematization stemming sent1 = good boy sent2 = good girl sent3 = boy girl good | | calculating the occurrence of each word word frequency good boy girl | | according to their occurrence the sentence | f1 f2 f3 girl good boy sent1 sent2 sent3 is is and or ### Now we will calculate the frequency for each word by 3 2 2 ## Then according to their occurrence we assign o or 1 in ## 1 for present and 0 fot not present 0 1 1 1 0 1 1 1 1 ### After this we pass the vector form to machine learning model The above process can be done using a CountVectorizer in python, we can import the same from sklearn.feature_extraction.text. CODE to implement CountVectorizer In python pandas pd sent = pd.DataFrame([ , , ],columns=[ ]) corpus = [] i range( , ): words = sent[ ][i] words = word_tokenize(words) texts = [lemmatizer.lemmatize(word) word words word set(stopwords.words( ))] text = .join(texts) corpus.append(text) print(corpus) sklearn.feature_extraction.text CountVectorizer cv = CountVectorizer() X = cv.fit_transform(corpus).toarray() X [ , , ] array([[ , , ], [ , , ], [ , , ]], dtype=int64) import as 'he is a good boy' 'she is a good girl' 'boy and girl are good' 'text' for in 0 3 'text' for in if not in 'english' ' ' #### Cleaned Data from import ## Creating Object for CountVectorizer ## Vectorize Form ############OUTPUT############## 'good boy' 'good girl' 'boy girl good' 1 0 1 0 1 1 1 1 1 Congratulations, Now you know the basics of NLP. Like👋 & Don’t Forgot to share your views via Community.