[ad_1]
An illustrated guide on how to perform sentiment analysis on Arabic text
![Dikrula Folorunsho](https://miro.medium.com/v2/resize:fill:88:88/2*ewWfR4kpGgx_K3ClOV__Mw.jpeg)
![Towards data science](https://miro.medium.com/v2/resize:fill:48:48/1*CJe3891yB1A1mzMdqemkdg.jpeg)
Identifying and classifying opinions expressed in text (also known as sentiment analysis) is one of the most frequently performed tasks in NLP. Despite being one of the most spoken languages in the world, Arabic has received little attention when it comes to sentiment analysis. Therefore, this article will specialize in implementing Arabic Sentiment Analysis (ASA) using Python.
- data set
- Import libraries and explore data
- Preprocessing text
- Sentiment analysis using various ML algorithms.
- conclusion
- References
The dataset used in this article consists of 1,800 tweets labeled as positive and negative.can be found here
A very balanced class is offered here.
As someone who is used to working with English texts, I found it difficult to translate the pre-processing steps routinely used in English texts into Arabic. Luckily, I then found a Github repository with code for cleaning Arabic text. This step essentially involves removing punctuation marks, Arabic diacritics (short vowels and other halakhas), expansion marks, and stop words (available in the NLTK corpus).
The purpose of this article is to show how different information extraction techniques can be used for SA. However, for simplicity, we only discuss word vectorization (i.e. tf-idf) here. Similar to other supervised learning tasks, data is first split into features (feed) and labels (sentiment). The data is then split into training and testing sets, and different classifiers are implemented, starting with logistic regression.
Logistic regression
Logistic regression is a very common classification algorithm. It is easy to implement and serves as a baseline algorithm for classification tasks. To keep the code short, we’ll use Scilkit-Learn’s Pipeline class, which combines vectorization, transformation, grid search, and classification.You can read more about gridsearch in the official documentation here
84% accuracy achieved
random forest classifier
Naive Bayes classifier (multinomial)
support vector machine
This article describes the steps involved in Arabic sentiment analysis. The main difference between Arabic and English NLP is the preprocessing step. All fitted classifiers showed excellent accuracy scores ranging from 84 to 85%. Naive Bayes, Logistic Regression, and Random Forest achieved 84% accuracy, while Linear Support Vector Machine achieved a 1% improvement. The model can be further improved by applying techniques such as word embeddings and recurrent neural networks. We will try to implement these in the next article.
https://github.com/motazsaad/process-arabic-text/blob/master/clean_arabic_text.py
[ad_2]
Source link