The internet is a very effective technique for obtaining a huge amount of information in different forms such as documents. Recently, there are millions of documents from various sources, most of which contain valuable information. Manual classification of documents consumes time and is very difficult, especially when people must estimate the category based on the information included. Therefore, the automatic text classification is used to discover the basic information of text documents automatically while saving human effort and time  .
Automatic text categorization is assigning and categorizing texts by using a set of predetermined categories based on the contents of the text. Specifically, it is filtering and routing, clustering information in related texts, and then classifying the texts into specified topics  . The text classification process is divided into three main phases. First, compile training data. Second, select a set of features to represent the texts categories. Third, test testing data with selected machine learning algorithm  . The concept of machine learning (ML) refers to automatic methods of learning automatically without human intervention to make predictions accurate or behave intelligently. Text classification (TC) is one of the important areas in ML. TC is a method in data mining field; it is set categories of texts in a web page, book library, media articles, gallery etc. Predetermined categories are based on their content and then give valuable information from a large unstructured text resource such as email filtering (spam or legitimate)  .
The classification of Arabic texts has received great attention in many recent researches based on the importance of the Arabic language and the huge population who speak Arabic. In this paper, we introduce the HRWiTD algorithm used to automatically analyze Arabic texts to estimate classifications (categories). The proposed algorithm abbreviation refers to highest repetition of words in a text document. The proposed algorithm abbreviation refers to highest repetition of words in a text document. The proposed technique for classifying text is built based on three main stages, pre-processing stage to remove noisy data. Feature extraction stage to learn dataset and build Learning Dataset file based on the extracted features from the train set. Learning Dataset file includes non-duplicate words with its highest repetition values and categories. Classification stage is estimating the classification of texts by using HRWiTD algorithm (the expected classification of the text is the category with the largest number of words). If the average of total repetition for all words in a text (that contains a predetermined classification (categories)) is less than 33.33%, the proposed classification of text sets is “General” category.
The HRWiTD algorithm has been applied to convergent samples of six categories namely culture, economic, public, political, social, and sports to obtain the best classification accuracy. The selected corpus has got from SPA (Saudi Press Agency), it contains 1421 Arabic texts (Newswire), it was divided into two sets, 70% train set and 30% test set and this division is the best to get the best classification accuracy based on  . The train set is analyzed to obtain predetermined categories for each word in all texts and then constructs the Learning Dataset file that will use to predict the categories of test set, then the classification of each text in the test set will be classified based on the learning process  .
Based on recent research, various automated learning algorithms have been successfully applied to Arabic text. The most famous techniques to classify Arabic text from the best to the worst are C5.0 classifier, Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (C4.5), and K-nearest neighbor (KNN)  . These classification techniques recognize as simple and efficient methods for classifying texts   . In this research, these techniques did not perform satisfactory performance in accuracy, and the best average accuracy from all categories is 52.86% using the C5.0 classifier. On the other hand, the HRWiTD algorithm achieved the best performance of the text classification and obtained the highest average accuracy (86.84%) compared to those techniques.
The second section presents some of the relevant work, the third section introduces the proposed work including the HRWiTD algorithm and the evaluation method used in the details, the fourth section presents the experimental results of the proposed algorithm and the most popular machine learning algorithms with their comparison, the latter part is the conclusion
2. Related Work
Text classification (TC) in data mining field is the process of extracting useful knowledge from text by analyzing complex and textual data  . The TC process is the automatic classification of a set of texts in categories based on content  .
In many text mining algorithms, pre-processing is one of the main components of text classification. Typically, the TC framework begins with the pre-processing, then the extraction feature, and finally the classification steps  . In detail, the process of classifying texts is divided into nine steps. These steps in the order are 1) Data collection, 2) Word processing to remove noisy data, 3) Data segmentation into the train set and test set, 4) Extraction features to extract and generate the repetition list of data set features, 5) Feature selection based on 10 feature from selection methods [term frequency (TF), document frequency (DF), information gain (IG), CHI squared (CHI), NG, Goh and Low (NGL) coefficient, Darmstadt indexing approach (DIA) association factor, mutual information (MI), odds ratio (Odds), the Galavotti, Sebastiani, Simi (GSS) a coefficient and relevancy score (RS)] and seven weighting methods (Boolean, frequency, relative frequency, TFiDF, TFC, LTC and entropy), 6) Features representation, 7) Machine learning, 8) Applying a classification model, and 9) Performance evaluation  .
The automatic text classification is used to classify texts in many languages such as Arabic. Arabic is the native language of more than 300 million people and is widely spread in the world  .
Recently, many types of research have been published in machine learning algorithms for the classification of Arabic text. Naïve Bayes is used to automatically classify Arabic documents in El-Kourdi et al.  . Sawaf et al.  used a statistical approach based on the Maximum Entropy to classify and cluster news articles. Sawaf et al. also described a method based on Association Rules to classify Arabic documents  . Al-Harbi et al.  compared the SVM algorithm and the Decision Tree algorithm. Al-Kabi, and Al-Sinjilawi  compared the classification of Arabic documents in Vector Space Model and Naïve Bayesian. Khreisat  compared KNN and SVM algorithms. Kanaan et al.  used Naïve Bayesian classifier to classify Arabic texts and distributed equally into many categories.
Different Machine learning algorithms that are applied to Arabic texts have produced the different classification accuracy that is presented in  . The most popular machine learning algorithms for classifying Arabic documents based on the most frequent selection methods (CHI, TF, DF, IG and None) are C5.0, SVM, NB, C4.5 and KNN, respectively    .
3. Proposed Work
In this paper, there are three main phases to classify Arabic texts, pre-processing, feature extraction and classification. In the pre-processing stage, the selection feature is used to remove noisy data such as numbers, punctuations, kashida, stop words and diacritics  ; in the feature extraction stage, features are then identified when learning the train set, and then building a Learning Dataset file. This file includes unduplicated words with the highest repetition values and categories, and these words are not repeated (just keep the word and category of the category of the highest repetition). In the classification stage, the classification of each text in the test set by using HRWiTD algorithm is based on matching the words of each text with the words in the Learning Dataset file to obtain a prediction classification (category) for each word. Typically, when more than two thirds (66.67%) of words with undefined categories are found in the text, the classification for this text is ambiguous and it is difficult to determine a particular classification. In fact, the “General” category includes all type of texts, some of which may belong to a specific category and some may belong to an unspecified category. Therefore, the best-predetermined classification of ambiguous text is “General” classification. In the suggested approach, if the average of the total of the repetition for all words in a text containing a predetermined classification (category) is greater than third (33.33%), the expected classification of the text is the category with the largest number of words. Otherwise, the proposed classification will be “General”.
The accuracy of using the HRWiTD algorithm for classifying is evaluated through the confusion matrix. This method evaluates the predicted classification of the texts with the actual classification (from six categories) in the Arabic news (SPA).
This section describes the main stages of classification of Arabic texts in details. Figure 1 shows the stages which include data collection, documents
Figure 1. Arabic text classification stages.
processing, data division, feature extraction from train set, filtering, feature extraction, data representation, applying a HRWiTD algorithm, and performance evaluation.
3.1. Data Collection
Data collection is the first and very important stage for the classification of Arabic texts. We chose an Arab source (Newswire) from the Saudi Press Agency (Saudi Press Agency), which includes convergent samples of six categories. We choose a SPA source for two reasons: availability of actual classification (category) for each text in corpus and availability of SPA texts on the Web. SPA statistics are shown in Table 1.
3.2. Documents Preprocessing
The process of pre-processing is actually a process of improving the classification of text documents by removing the data that is worthless. The data may include worthless numbers, punctuations, kashida, Hamza “,” diacritics, and stop words. Some words do not belong to any classification such as prepositions, pronouns, etc., so we append them to a stop word list see Table 2. Preprocessing also normalize text documents by changing TaaMarboutah “ة” to “ا”. ATC Tool is used to remove worthless data from the selective corpus.
3.3. Data Division
At this stage, ATC Tool is used to dividing corpus into two partitions, the train set, and the test set. The train set contains 70% of a selected corpus and a test set
Table 1. SPA statistic of selected corpus.
Table 2. Removable stop words.
(Suffix or prefix of singular/dual/plural/feminine/masculine with any Stop Words mentioned above in this table) or (Article with any Stop Words mentioned also above in this table).
contains 30%, and this division is best for the best performance of the classification based on  . The user can manually select the percentage of the train set and the test set.
3.4. Feature Extraction
In this stage, we use data from train set and test set from internal or external source. Features extract and the repetition list of words generates by using the ATC tool. The ATC tool lists and saves the repetitions of each word in all texts of the train set in a train list file. It also lists and saves the repetitions of each word in all texts of the test set in a test list file. In addition, add a field to train the list file and the test list file to label the category of each word. The category of words in the train list is the actual category. On the other hands, the word categories in the test list are set from the Dataset Learning file of the same words.
At this stage, train file will filter by remove the duplication words with their classifications. The word that has the highest repetition will remain with its relative data (repetitive number and category) and delete the same words and its relative data with less repetition.
3.6. Data Representation (Train Set/Test Set)
At this stage, the train list file that is produced from the filter stage will format into Learning Dataset file. The test list file that is produced from the extract feature stage will be used for classifying text with HRWiTD algorithm. The data will be represented as an array with n rows and m columns where rows correspond to words in text and columns that correspond to repetition and category.
3.7. Classification Algorithm (HRWiTD)
In this step, the Learning Dataset file is produced from the data representation stage and the test list file will be used in the classification algorithm (HRWiTD). The test list file is used to store the predicted classification (which gets from Learning Dataset file) for all words in each text. Predicated classification file is used to store the predicted classification of all test texts. Details of the HRWiTD algorithm process are given in Figure 2.
3.8. Performance Evaluation
The performance of using the HRWiTD algorithm for classifying texts has been evaluated using the confusion matrix  . A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed
Figure 2. HRWiTD algorithm to classify Arabic texts.
from the confusion matrix. The actual and predicted information (classification) will be assigned by using HRWiTD algorithm. The confusion matrix should evaluate the performance using the actual and predicted information in the matrix, see Table 3.
Entries in the confusion matrix have the following meaning in the context of our study:
・ True negative (TN) is the number of correct predictions that an instance is negative.
Table 3. Confusion matrix.
・ False positive (FP) is the number of incorrect predictions that an instance is positive.
・ False negative (FN) is the number of incorrect of predictions that an instance negative.
・ True positive (TP) is the number of correct predictions that an instance is positive.
・ Total is the summation of all above variables. See Equation (1).
Overall, the accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined by using the Equation (2):
・ There are two possible predicted classifications: “Positive” and “Negative”. If we were predicting the target classification (ex. “Sport”) of text, for example, “Positive” would mean it belongs to that target classification, and “Negative” would mean it doesn’t belong to that target classification.
・ The classifier (HRWiTD algorithm) has a total of 420 (test data) out of 1421 predictions for each of six categories, including 70 text per category.
・ Out of those 420 cases, the classifier predicted “Positive” FP + TP times, and “Negative” TN + FN times.
・ In reality, FN + TP classification in the table is belong to target classification, and TN + FP classification do not.
4. Experimental Results and Discussion
The HRWiTD algorithm is used to classify Arabic texts. The confusion matrix method was used to determine the classification accuracy of the HRWiTD algorithm, which is 86.84% in this experiment, see Table 5 for details. On the other hand, the same data set has been applied in various famous classifier techniques, models have been developed based on using C5.0, decision tree C4.5, NB, KNN and SVM classifiers (models create by using Rapid Mine Software 5.0)  . We test the performance of the models on the test set and evaluate the accuracy based on the use of a Cross-validation technique and set the number of validations to X-Validation operators. The previous classifiers were evaluated based on two advanced methods for term selection: CHI square (CHI) and Information gain (IG), and different weight methods (Boolean, Entropy, Frequency, LTC, Relative Frequency, TFC and TFiDF). Moreover, two sample methods for term selection: TF (term frequency) and DF (document frequency) were selected. The top 10, 15, 20, 25, and 30 terms for each classification in the dataset were selected as the representative terms, based on their related to TF and DF. The classification accuracy results for the classifiers are shown in Table 4.
The data in Table 4 showed the best classification accuracy is C5.0 classifier and then were KNN after that NB then SVM, C4.5 is the worse. Based on the operation of different weight methods on the data set, Boolean, Frequency, and TFiDF have shown the best weighting methods for the different classifiers that used to obtain the best classification accuracy.
See Table 4, machine learning settings for the best classification accuracy of C5.0 classifier when representation = Frequency, training size = 70% with DF, term selection = CHI square, and terms = top 30 terms of each category. The best classification accuracy of KNN classifier when representation = Frequency, training size = 70% with DF, term selection = CHI square, and terms = top 30 terms of each category. The best classification accuracy of NB classifier when representation = TFiDF, training size = 70% with DF, term selection = IG, and terms = top 30 terms of each category. The best classification accuracy of SVM classifier when representation = LTC, training size = 70% with TF, term selection = CHI square, and terms = top 30 terms of each category. The best classification accuracy of C4.5 classifier when representation = Boolean, Frequency, TFiDF, training size = 70% with TF and DF, term selection = CHI square and IG, and terms = All top terms of each category.
Table 5 shows the details of results of the best two classification techniques based on the performance, namely C5.0 classifier, and HRWiTD algorithm. The average of the accuracy for the six categories texts is calculate by use Equation (3), the accuracy for each categories are namely AC (Culture), AC (Economic), AC (General), AC (Political), AC (Social) and AC (Sport). Moreover, it shows the accuracy of using C5.0 for those six categories, the best result when using the frequency weight method, CHI method to evaluate weight selection and DF with 30 terms per category. The train set was 70% for those two classification techniques. HRWiTD algorithm has got 86.84% as total accuracy, it is better than C5.0 and other classification techniques. However, C5.0 was better than HRWiTD algorithm to classify “General” category.
Table 4. Results of the best classification accuracy for different classifier techniques.
Table 5. The best results of classification accuracy C5.0 classifier and HRWiTD algorithm.
In summary, this paper was carried out to classify Arabic texts automatically using the HRWiTD algorithm. We have applied it to 1421 Arabic Newswire from the Saudi Press Agency (SPA). The corpus includes convergent samples of six categories (culture, economic, public, political, social, and sports). In this paper, the average of the overall classification accuracy for six categories is 86.84 %; confusion matrix method is used to evaluate the classification accuracy. The classification technique in this paper is constructed based on three main phases which are preprocessing, features extraction and classification by using HRWiTD algorithm. The repetition for a predetermined category of each word in the text is calculated. If the average of the total of those words is less than 33.33%, the expected classification of text is “General” category; otherwise, the expected classification of text is the category with the largest number of words. We compared the accuracy of the proposed algorithm (HRWiTD) with the accuracy of the most popular techniques and the accuracy of C5.0, KNN, SVM, NB and C4.5 classifies are 52.86%, 52.38%, 51.90%, 51.90% and 30%, respectively. The best classification performance was when techniques used advanced methods for term selection (CHI, IG, None), different weight methods (Boolean, Entropy, Frequency, LTC, Relative Frequency, TFC and TFiDF), and two sample methods for term selection (TF and DF). Thus, we conclude that the best technique to classify Arabic texts in the selected domain is obtained from the HRWiTD algorithm. In addition, the HRWiTD algorithm gives the best classification accuracy for each individual classification except the “General” category. In future work, first, the HRWiTD algorithm needs to be improved to get better results to classify all text categories; here we cover only six categories and other categories were assigned general category as general. Second, it needs to extend the experimental corpus from different resources to demonstrate efficiency. In this research, we applied the proposed algorithm on 1421 texts, and there are a number of words in the texts that their categories are unknown and which can lead to a poor classification of texts. Therefore, the corpus must be much larger to get the best learning.
This research is supported by German Academic Exchange Service (DAAD).