Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Ebook571 pages2 hours

Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Natural Language Processing (NLP) has proven to be useful in a wide range of applications. Because of this, extracting information from text data sets requires attention to methods, techniques, and approaches.
'Python Text Mining' includes a number of application cases, demonstrations, and approaches that will help you deepen your understanding of feature extraction from data sets. You will get an understanding of good information retrieval, a critical step in accomplishing many machine learning tasks. We will learn to classify text into discrete segments solely on the basis of model properties, not on the basis of user-supplied criteria. The book will walk you through many methodologies, such as classification, that will enable you to rapidly construct recommendation engines, subject segmentation, and sentiment analysis applications. Toward the end, we will also look at machine translation and transfer learning.

By the end of this book, you'll know exactly how to gather web-based text, process it, and then apply it to the development of NLP applications.
LanguageEnglish
Release dateMar 26, 2022
ISBN9789389898798
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation

Related to Python Text Mining

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Python Text Mining

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Python Text Mining - Alexandra George

    CHAPTER 1

    Basic Text Preprocessing Techniques

    Introduction

    The market demand for data scientists has been in an increasing deluge because it is his responsibility to do all the steps of the data science workflow single handedly. The main steps among these are data capturing and data cleaning. The life cycle or the work flow of any project is defined in the steps mentioned below:

    Figure 1.1: Data science workflow

    These steps are the ones that will decide the quality of the data we are going to deal with. If there are any compromises in the quality of the data, the resultant model will also turn out to be erroneous.

    Structure

    In this chapter, we will learn:

    How to scrape tweets from Twitter?

    The different common pre-processing techniques when it comes to the text data such as:

    Html tag removal

    Accented character removal

    Contraction expansion

    Stemming and lemmetization

    Emoji handling

    Special character and stop word removal

    How to apply these pre-processing techniques to tweet data scrapped from twitter as Project 1

    How to scrape data from Inshorts and then apply the pre-processings we learned as Project 2

    Objectives

    After studying this chapter, you should be able to:

    Understand the different types of common pre-processing techniques we do for the text data

    Identify which type of pre-processing is necessary for the data

    Effectively maximize information retrieval from the dirty data

    Data preparation

    Data preparation is one of the most important steps in data science. We as data scientists will spend almost 80% of our time collecting and cleaning the data. It is only this step that will determine the quality of the results we can expect.

    Only when the data is clean enough can the model find effective patterns from it. So it is very necessary to choose correct preprocessing steps. The steps we need to perform in data preprocessing are not fixed and purely depend on the impurities in data we are dealing with. So inversely we will need to possess a clear understanding of the data set in order to choose which steps to perform in the preprocessing.

    We shall understand the data preprocessing more in detail with the help of a practical application.

    Project 1: Twitter data analysis

    The humongous growth of the Internet has resulted in the data tsunami. This data is used for a variety of applications and acts as a crucial element for understanding the voice of the people.

    For this, we will need to scrape the data first and thereby process it to convert it into usable insights.

    Scraping the data

    Scraping data from Twitter is straightforward if you have a developer account! Steps to convert your account into a developer account and to generate keys can be found at the following links:

    Steps to apply for a developer account: https://developer.twitter.com/en/docs/basics/developer-portal/faq

    Steps to generate consumer and access keys can be found at this link: https://themepacific.com/how-to-generate-api-key-consumer-token-access-key-for-twitter-oauth/994/

    Necessary libraries:

    Pandas

    Tweepy

    Pandas is used for converting the data into dataframe. Tweepy is also an open-source library which enables scrapping the data from Twitter. TQDM stands for progress in Arabic. It is used to insert progress bars to the processes:

    In any given programming language, we will have plenty of them which of course won't be necessary. So, the first step before going into actual program will be install the necessary packages locally in our working jupyter notebook.

    Importing the necessary libraries:

    The package function we defined here checks whether the package available is installed; if it is already installed, it imports the package and if it is not installed, it installs it. So it is always better to use a function for packages you have doubt about. In case you think the name is lengthy or complex you can always use the 'as' command to give the specific package an alternative name that will be used further in the program.

    import os

    def package(package):

    try:

    import package

    except ImportError:

    print(Trying to Install required module)

    cmd = 'python -m pip install '+package

    os.system(cmd)

    #making use of the function

    package('tweepy')

    import pandas as pd

    package('tqdm')

    from tqdm import tqdm

    Define your consumer and access keys (steps to convert your twitter account into developer account has been specified in the preceding link provided):

    #input your credentials here

    consumer_key= 'Your Consumer key'

    consumer_secret= 'Your Secret Key'

    access_token= 'Your Access token'

    access_token_secret='Your Secret Access token'

    Oauth authentication is a protocol to provide authorization messages for web-based applications and APIs. Since we are using the tweepy twitter access API, we will need the authentication from twitter:

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    auth.set_access_token(access_token, access_token_secret)

    api = tweepy.API(auth,wait_on_rate_limit=True)

    We get the data from the desired Hashtags and append it with the dataframe we created. I am scrapping from the hashtag #ClimateChange, but please feel free to change and try out different hashtags:

    # Open/Create a file to append data

    for tweet in tweepy.Cursor(api.search,q=#ClimateChange,count=500,

    lang=en,

    since=2019-10-16).items():

    #print(tweet.created_at, tweet.text)

    date = tweet.created_at

    user = tweet.user

    tweet = tweet.text.encode('utf-8')

    twitter_data = twitter_data.append(

    {Tweet_date:date,User

    :user, Tweets:tweet},

    ignore_index = True)

    Now that we have scrapped the data we want with the help of the tweepy package and it is further more capable than just scrapping the tweet text and tweet date. So I urge you guys to further explore the possibilities.

    Data pre-processing

    Data pre-processing consists of multiple sub-steps and let us now look at all the different steps written as functions, understand them independently, and then summate them to be used in our data.

    Importing necessary packages

    The first step will be install the necessary packages we need in our program which is done below by calling the function we have created in the previous step:

    #making use of the package function here

    package('re')

    from bs4 import BeautifulSoup

    package('unicodedata')

    package('contractions')

    package('spacy')

    package('nltk')

    import re

    import unicodedata

    import contractions

    import spacy

    import nltk

    import numpy as np

    nlp = spacy.load('en_core_web_sm')

    ps = nltk.porter.PorterStemmer()

    nltk.download('punkt')

    nltk.download('stopwords')

    os.system('python -m spacy download en')

    Here, the package re is for Regular Expressions, bs4 is called the beautiful soup and is used for parsing XML and HTML documents. Unicodedata is used to provide access to the Unicode Character Database (UCD) that is used to define the character properties for all Unicode properties (as per Python documentation page https://docs.python.org/3/library/unicodedata.html).

    HTML parsing

    To access text from any webpage we need to scrape using a special package called Beautiful Soup. To do this, we will be writing a small function to use BeautifulSoup package and extract only the text from the webpage using regular expression (re):

    def strip_html_tags(text):

    soup = BeautifulSoup(text, html.parser)

    stripped_text = soup.get_text()

    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)

    return stripped_text

    Here, we make use of the package BeautifulSoup as mentioned earlier which can be used for HTML or XML parsing. We will define an object of the package BeautifulSoup soup by fitting it with the text and define the feature that needs to be performed in the text to be html.parser.

    So now we can make use of the BeautifulSoup object soup to extract the desired part from the text. Soup.get_text() is to get the text. Similarly, there are many operations that can be done which we allow you to explore. Then, we will perform a substitute operation using regular expression (re) where the patterns like \r, \n and \r\n are all converted to \n in the html tag stripped_text.

    Figure 1.2: Available functions inside Beautiful Soup

    Output:

    Now, let us take a look at how the function works:

    Figure 1.3: Shows how HTML tags are removed and patterns like \r,\n and \r\n are replaced from the text

    The underlying data can only be seen after we remove the HTML tags as it is directly from the websites. This is the first step when we scrape data from the websites.

    Removing accented characters

    Accented characters and marks are very important parts of both written and spoken language. These characters are mainly from European languages like Spanish, German, Italian, French, and Portuguese:

    def remove_accented_chars(text):

    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

    return text

    Here, we perform unicode normalization. NFKD (Normal Form KD) here KD stands for Compatibility and Decomposition. Basically, Unicode is a superset of the ASCII form so we convert text to ASCII and then decode to UTF-8 format.

    Output:

    Figure 1.4: Showing the output of the function to remove accented characters

    The accented characters are mostly due to the presence of non-English words in the text data. If not treated properly, it may cause encoding issues when you save the data in the database at times.

    Expanding contractions

    To make the code reusable, we will make it a point to put the code in a function. So, when we need the code in further part, we can always simply call the function. Here we are writing a function to replace the contraction words with its actual abbreviated form.

    def expand_contractions(text):

    return contractions.fix(text)

    Here, this function makes use of the contractions package to expand the contractions in the text.

    Output:

    Figure 1.5: Output of the after expanding contradictions

    As seen in the output, the contradictions like didn't and wasn't are converted to did not and was not.

    Lemmetization and stemming

    Stemming and lemmetization are two processes which convert the words from their derived form to their root form. For example:

    Derived form: am,are, is -> Root form: be

    Word with inflections: cycle, cycle's, cycles -> Root form: cycle

    But although both perform the same operation, they vary in their methods. Stemming is the rough process of cutting off the ends in a motive to produce the root words called stems and are right most of the time. Lemmetization brings the words to their root words by making use of a dictionary. This making use of a proper vocabulary or morphology to convert the words into their root words is better than stemming but at the cost of costlier computation.

    Fail case

    Both will aim to produce the same results until we give them words like saw. The stemmer will produce a result s but the lemmetizer might produce a result see or saw:

    def spacy_lemmatize_text(text):

    text = nlp(text)

    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])

    return text

    def simple_stemming(text, stemmer=ps):

    text = ' '.join([stemmer.stem(word) for word in text.split()])

    return text

    Output:

    Figure 1.6: Output of the after stemming and lemmatization

    We will make use of spacy to build our stemmer and lemmetizer here. And the stemmer we will use here is called the Porterstemmer. The output shows variation between stemming and lemmatization. The difference between which to choose is purely dependent on the use case as we know the stemmer is just a random way of removing the trailing words while lemmatization is a proper way of doing it using a vocabulary. One should not forget since lemmatization involves the use of a vocabulary. It is also costly in terms of computation than stemmer.

    Removing special characters

    Special characters add impurities to the unstructured data. They can be removed with the help of simple regular expressions:

    def remove_special_characters(text, remove_digits=False):

    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'

    text = re.sub(pattern, '', text)

    If remove_digits = True, then we will remove the numbers as well.

    Output:

    Figure 1.7: Output after removing special characters

    Some special characters like '.' or '?' can sometimes add value to the data so this list can be customizable by adding whatever we want to keep or remove inside this expression r'[^a-zA-Z0-9.?\s]'.

    Removing stop words

    Words that possess little to no significance in a sentence are called stop words. Words like a, an, the, and are stop words. They can be articles, conjunctions, prepositions, and so on:

    def remove_stopwords(text, is_lower_case=False, stopwords=None):

    if not stopwords:

    stopwords = nltk.corpus.stopwords.words('english')

    tokens = nltk.word_tokenize(text)

    tokens = [token.strip() for token in tokens]

    if is_lower_case:

    filtered_tokens = [token for token in tokens if token not in stopwords]

    else:

    filtered_tokens = [token for token in tokens if token.lower() not in stopwords]

    filtered_text = ' '.join(filtered_tokens)

    return filtered_text

    First, we load the stopword corpus from nltk. Then, tokenize the sentences into words and then check whether the words are in the stopword corpus and if not there, have it.

    Output:

    Figure 1.8: Output after removing stopwords

    The preceding example clearly shows what happens when we remove the stop words from the data. Words like 'there, is, an, to, how' from the input are not seen in the output. This is just the case of pre-existing set of stopwords; words can be removed or added to that list as per the requirement.

    Handling emojis or emoticons

    Emojis or emoticons if handled properly can give us a lot of meaning when it comes to sentiment analysis or any other text analysis:

    import emoji

    #Converting emojis to words.

    def convert_emojis(text):

    return emoji.demojize(text).replace(:, ).replace(,, )

    #Conversion of Emoticon to Words

    def convert_emoticons(text):

    for emot in EMOTICONS:

    text = re.sub(u'('+emot+')',

    _.join(EMOTICONS[emot].replace(,,).split()), text)

    return text

    Emoji can be converted into text by using the package emoji which contains a function demojize.

    Output:

    Figure 1.9: Output after converting the emojis

    Converting emojis is an option. We convert the emojis to maximize the information retrieval from the data. But there is also another option available.

    Emoji removal

    Emoji if not converted into text is also an impurity to the unstructed data which can be removed:

    def remove_emoji(string):

    emoji_pattern = re.compile([

    u\U0001F600-\U0001F64F # emoticons

    u\U0001F300-\U0001F5FF # symbols & pictographs

    u\U0001F680-\U0001F6FF # transport & map symbols

    u\U0001F1E0-\U0001F1FF # flags (iOS)

    u\U00002702-\U000027B0

    u\U000024C2-\U0001F251

    ]+, flags=re.UNICODE)

    return emoji_pattern.sub(r'', string)

    Output:

    Figure 1.10: Output after removing emojis

    Of course, this emoji removal might neglect some of the strong information that the data has to convey. But it is also convenient in some places where using emojis could include unwanted noise into the data.

    Text acronym abbreviation

    Nowadays, acronyms like brb, ttyl, and so on are becoming more and more popular in social media platforms. If you are an avid social media user, you would have come across these acronyms at least

    Enjoying the preview?
    Page 1 of 1