Wednesday, April 3, 2019

Improving the Accuracy of Arabic DC System

up(p) the Accuracy of Arabic DC SystemThe briny goal of this look for is to enquire and to develop the appropriate school text collections, tools and procedures for Arabic register miscellanea. The adjacent specific objectives shit been set to achieve the main goalTo check the bear upon of preprocessing tasks including normalization, stop playscript removal, and stemming in improving the truth of Arabic DC system.To introduce a novel proficiency for Arabic stemming in modulate to remediate the accuracy of the register mixture system. The virgin algorithmic rule for Arabic stemming tries to overcome the deficiencies in state-of-the-art Arabic stemming techniques and dealing with MWEs, orthogonal Arabized linguistic process and handling the majority of mazed plural form forms to reduce them into their mirthful form.To use Arabic text summarization technique as boast decrement technique to eliminate the noise on the historys and select the virtually salient sentences to represent the original catalogues.To explore the push of disparate birth pick techniques on the accuracy of Arabic peak variety and names and implements a new variant of Term oftenness Inverse Document Frequency (TFIDF) free weight methods that take into account the important of the first appearance of a word and the compactness of the word which can be taken as factors that determine the important characteristics in the history.To implement various classifiers and comp atomic number 18s their performances.1.1.Problem Statementpatronage the achievements in document classification, the performance of document classification systems is far from satisfactory. document classification tasks atomic number 18 characterized by natural languages. This means DC is nigh related to natural language processing (NLP) which require knowledge of its argonna matter. In general NL reveals many of syntactic and semantic ambiguities beside the complexities 45. In the consi deration of DC, a researcher tries to address various hassles arising from characteristics of documents in the process of make extraction and feature representation or problems emanating from the classification algorithms. The following sections volunteer ideas on research problems.1.1.1. Preprocessing Text ProblemThe preprocessing stage is a dispute and affects positively or negatively on the performance of any DC system. Therefore, the improvement of the preprocessing stage for super inflected language such as the Arabic language will enhance the efficiency and accuracy of the Arabic DC system. In spite of the lack of standard Arabic geomorphologic analysis tools most of the previous studies on Arabic DC have proposed the use of preprocessing tasks to reduce the dimensionality of feature vectors without comprehensively examining their contribution in promoting the forte of the DC system. One of the challenges facing the researchers in Arabic document classification systems i s the absence of a strong and an core groupive stemming algorithm. Arabic is morphologically a complex language 46, it uses both kinds of morphologies inflectional and derivational morphologies. Based on these types of morphology, a single word whitethorn yield hundreds or plane thousands of variant forms 47. The importance of using the stemming technique in the documents classification lies in that it makes the processes less dependent on sliceicular forms of words and reduces the highly dimensionality of the feature space, which, in turn, enhance the performance of the classification system. In spite of the rapid research conducted in other languages, Arabic language still suffers from the shortages of researchers and development. The state-of-the-art Arabic stemmers suffer from high stemming error-rates due to its understemming errors, overstemming errors, ignored the handling of multiword expressions (MWEs), broken plural forms, and Arabized words. Therefore, the limitations of the current Arabic stemming methods have motivated this author to investigate a novel technique for Arabic stemming to be used in the extraction of the word roots of Arabic language in order to improve the accuracy of the document classification system in chapter 5.1.1.2. highly Dimensionality of the Feature SpaceExtremely high dimensional features paces and large volumes of info problems sink in automatic document classification. High dimensionality problems arise because the pattern of features used in the classification process increases along with dimensionality of the feature vectors13, 15, 48, 49. pragmatic examples show that the number of features consisting the dimensionality could amount to thousands.A large number of features argon irrelevant to the classification task and can be removed without touching the classification accuracy for several reasons First, the performance of some classification algorithms is negatively affected when dealing with a high dimension ality of features. Second, an over-fitting problem may occur when the classification algorithm is trained in all features. Finally, some features are common and occur in all or most of the categories 50.In order to elucidate this problem, the feature vector dimensionality is required to be reduced without degradation of classification performance. It was important to extract the features with high discriminating former using various techniques. Text summarization, feature selection and feature weighting are common techniques and methods that are used in document classification to reduce the highly dimensionality of the feature space and to improve the efficiency and accuracy of the classification system. The term frequency (TF) weighted by inverse document frequency (IDF) which is abbreviated as TFIDF can partially solve the problem of variation in content and length in the documents but it cannot solve the problem of the distribution of the important words within the document. In general, the document is written in an organized manner to describe its main base(s). For example, the main topic for news articles may mentions at the title and the first part of the document to draw the attention of the reader. Therefore, depending on the location, the document parts may have different degrees of contribution to the documents main topic(s) 51. In this dissertation, we propose new feature weighting methods that treat the problem of the distribution of the important words within the document in chapter 6.In order to satisfy the objectives say in this research, the research questions of this try can be summarized asWhat are the impact of text preprocessing techniques such as normalization, stop word removal, and stemming in improving the performance of Arabic DC system? What are the easy Arabic text preprocessing methods to be implemented in this research? What are their advantages and disadvantages? How to compare and improve their performance in order to impro ve the accuracy of the Arabic documents classification system?What are the Impact of feature reduction techniques on Arabic document classification? How to overcome the problem of the highly dimensionality of the feature space and the difficulty of selecting the important features for understanding the document?Which classification algorithms have the best performance when applied on different representations of Arabic dataset?1.2.Research ContributionThis research focuses on exploring different preprocessing techniques, dimensionality reduction techniques and investigating their effect on Arabic document classification performance. More specifically, the main contributions of this thesis are as followsDemonstrate that using preprocessing task such as normalization, stop word removal, and stemming for Arabic datasets have a satisfying impact on the classification accuracy, especially with complicated morphological structure of the Arabic language. Furthermore, we demonstrate that c hoosing appropriate combinations of preprocessing tasks provides significant improvement on the accuracy of document classification depending on the feature size and classification techniques.In this thesis, we propose a novel stemmer for Arabic documents classification. The proposed stemmer attempts to overcome the weaknesses of root-based stemming technique and light stemming technique, in addition to dealing with the majority of broken plural forms, MWEs, and foreign Arabized words. We compare the proposed stemmer with the well-known Arabic stemmers, including root-base stemming (Khoja stemmer) and light stemming (Larkey stemmer), to study its contribution in improving the classification system. The comparison is carried out for different datasets, classification techniques, and performance measures.Demonstrate that using document summarization technique help to improve the efficiency of Arabic document classification by reducing the highly dimensionality of the feature space wit hout affecting the value or content of documents, then saving the memory space and execution fourth dimension for documents classification process.In this thesis, we investigate the impact of different feature selection techniques, namely, Information gain (IG), Goh and Low (NGL) coefficients, Chi-square Testing (CHI), and Galavotti-Sebastiani-Simi Coefficient (GSS) that have a significant impact on reducing the dimensionality of feature space and thus improve the performance of Arabic document classification system.In this thesis, we investigate the impact of feature representation schemas on the accuracy of Arabic document classification. The document usually consists of several parts and the important features that more closely associated with the topic of the document are appearing in the first parts or repeated in several parts of the document. Therefore, the proposed weighting methods take into account the important of the first appearance of a word and the compactness of the word which can be taken as factors that determine the important features in the document.Unfortunately, there is no free benchmarking dataset for Arabic documents classification. One of the aims of this research is to heap up dataset for Arabic documents classification that cover different text genres which will be used in this research and can be used in the future as a benchmark for computation linguistics researches including text mining, information retrieval. The dataset collected from several published papers for Arabic document classification and from scanning the well-known and reputable Arabic websites. Compiling freely and publically available corpora is advancement step on the field of Arabic document classification.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.