Text analytic for topics inferred from a dataset of hotel reviews.

5 min readJun 2, 2021

One of the most useful aspects of natural language processing is the ability to automatically extract what topics people are talking about from large amounts of text.

Understanding what consumers are concerned about and taking their problems and experiences into account on social media is extremely beneficial to businesses, and knowing this will help them affect their marketing strategies. As a result, sifting through massive volumes of unstructured data and assembling relevant topics is incredibly difficult..

In this project, we’ll take the dataset and pre-process it using the Gensim package’s Latent Dirichlet Allocation (LDA) to extract the naturally discussed topics. The weightage of the keywords will be shown as a result of the model. The weights represent how important a keyword is to that subject. Finally, we examine the generated topics using a visualising model.

Importing the Necessary Libraries

For streamed data processing, gensim relies entirely on Python’s built-in generators and iterators. Memory efficiency was one of gensim’s design goals, and it is a key feature of the programme. As a result, our first step is to import Gensim, a Python library for topic modelling.

Following that, The python libraries spacy and pyLDAvis are imported. Spacy aids in the processing and “understanding” of large amounts of text, whereas pyLDAvis is intended to interpret the topics in a topic model that has been fitted to a corpus of text data. The package uses information extracted from a fitted LDA topic model to power an interactive web-based visualisation.

Let’s get the data ready for review by cleaning it up.

First step load the data into a dataframe

The non text pattern will be checked for anywhere in the text using the regular expression, and anything that matches the pattern will be removed. The data in the column “cleandata” has been cleaned and is ready for the next level.

Words or combinations of words that are shortened by removing letters and replacing them with an ellipsis are known as contractions. ‘Are you not gng there?’ Removing contractions before creating word vectors reduces the number of dimensions.Tokenize contractions and transform them back to strings for the next step of analysis. The data in the column “Description_New” has been cleaned and is now ready to proceed to the next level.

The next step is to determine the language in which each review was written and then delete any reviews that were not written in English. We must first obtain the pre-trained language model for our fasttext library (courtesy of Facebook).

Now we tokenize the data in “Description_New” and then converting it to a lower case. Consequently, the data is checked for any strings from the punctuation library. Finally, the data is joined, and the “Description_Final” column is prepared for topic modelling.

Topic Modelling.

To begin, apply an english-language filter to column ‘langs’ and convert the data from the dataframe object ‘Description_Final’ to a list ‘df’.

Let’s transform each sentence in ‘Description_Final’ into a list of words by removing all punctuation and extra characters.

Bigrams are modelled using two words that frequently appear in the same sentence in a text and Trigrams are groups of three frequently occurring terms. Gensim’s Phrases model will generate and execute bigrams, trigrams, quadgrams, and more. The two most common arguments to Phrases are min count and threshold. The more these parameters are set, the more difficult it is to combine terms into bigrams.

The bigrams model is successfully completed. Then let specify and call the functions for removing stopwords, creating bigrams, and lemmatization consecutively.

Make the necessary Dictionary and Corpus for Topic Modeling. The dictionary (id2word) and the corpus are the two primary inputs to the LDA topic model. Now let us make them.

Everything we need to train the LDA model is just here. In addition to the corpus and dictionary, they should include the number of topics. The number of documents that should be included in each training chunk is referred to as chunksize. The parameter ‘update every’ specifies how frequently the model parameters should be updated, while ‘passes’ specifies the total number of training passes.

Model perplexity and topic coherence provide a convenient way to assess the quality of a given topic model. A low level of perplexity and a coherence score of 46.5 percent indicate a successful model.

The interactive chart in the pyLDAvis package is intended to be useful for examining the produced topics and the associated keywords.

Word Clouds of Top N Keywords in Each Topic are graphical representations of word frequency that give greater prominence to words that appear more frequently in a source text

Conclusion.

To develop the LDA model, we started from scratch by importing, cleaning, and processing the dataset. Then we saw several ways to visualise topic model outputs, such as word clouds and sentence colouring, which intuitively tell you which topic is dominant in each topic.

References

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#14computemodelperplexityandcoherencescore