CS  Vol.7 No.8 , June 2016
Content Based Segregation of Pertinent Documents Using Adaptive Progression
ABSTRACT
Due to the emerging technology era, today a number of firms share their service/product descriptions. Such a group of information in the textual form has some structured information, which is beneath the unstructured text. A new attainment which facilitates the form of a structured metadata by recognizing documents which are likely to have some type and this information is then used for both segregation and search process. The idea of this advent describes some attributes of a text that will match with the query object which acts as identifier both for segregation as well as for storage and retrieval. An adaptive technique is proposed to deal with relevant attributes to annotate a document by satisfying the users querying needs. The solution for annotation-attribute suggestion problem is not based on the probabilistic model or prediction but it is based on the basic keywords that a user can use to query a database to retrieve a document. Experiment results show that Querying value and Content Value approach is much useful in predicting a tag for a document and thus prediction is also based on Querying value and Content value which greatly improves the utility of shared data which is a drawback in the existing system. This approach is different, as we consider only the basic keywords to be matched with the content of a document. When compared with other approaches in the existing system, Clarity is a primary goal as we expect that the annotator may improve the annotations on process. The discovered tags assist on quest of retrieval as an alternative to bookmarking.

Received 21 April 2016; accepted 10 May 2016; published 22 June 2016

1. Introduction

Data mining, an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intellect, contraption erudition information and database system [1] . The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for foreside from the raw investigation step, it involves Database and data management aspect, data pre-processing, model and inference considerations, interestingness metrics, complexity consideration, post-processing of discovered structures, visualization, and online updating [1] [2] . Many existing organization share their descriptions about products and services. For illustration, Scientific networks, social networks or disaster management group share their information. Prevailing technologies like content management software (e.g.: -Microsoft Share point) allows users to share documents and tag them in a improvised manner Like that, Google Base allows users to define objects for them either by choosing from predefined or to define their own attributes. This process may facilitate subsequent information discovery. Many annotation systems provide a single way for annotation:”un typed” annotation. Consider that a user may annotate a weather report using a tag such as “Storm Category 5” [2] . In general the most effective expressive annotation strategies use ”attribute-value” pairs as they can contain more un typed information than un typed approaches. In such cases, the above information can be entered as (Storm Category, 5). A most recent work in using the most expressive queries is the “pay-as-you-go” querying strategies in Data space in which users provide the data integration hints at the query time [3] [4] . In such hypothesis based system, the structured information is already present and the difficulty is with matching the source attributes with the query attributes.

In this paper, a cost segregation approach which is similar to CADS (collaborative Adaptive Data Sharing Platform) is proposed. It is an approach which has two ways of segregating a document [8] .

1) By “annotate-as-you-create” platform which facilitates fielded data annotation.

2) By “automated annotation” of a document with its content. In contrast, we are to segregate a document based on its content towards to generate attribute values for attributes that are often used by querying users.

CAD Sharing Platform Approach is to stimulate and lower the cost of creating nicely annotated chronicles that can be promptly useful for commonly furnished semi structured queries such as the ones in Figure 4. Our key intention stimulate the annotation of the chronicles at creation time, while the originator is still in the document creation stage, despite the fact the techniques can also be used for post generation document annotation. In our framework the originator generates a new document and uploads it to the warehouse. After the upload, CADS examines the text and originates the adaptive insertion form. The form holds the best attribute names chronicle text and information need (query workload), and the most feasible attribute values given the chronicle text. The originator can inspect the form, recast the generated metadata as necessary and acquiesce the annotated chronicle for storage.

2. Related Work

However, for human beings it’s simple to judge whether two words are similar or not. But for a computer, it’s a difficult task, which involves psychology, philosophy, artificial intelligence and other fields of knowledge. Hence a computer being a syntactic machine, it cannot understand the semantics [9] . In order to that semantic associated with words or their meanings are to be represented as syntax. Semantic similarity measurement between words or their meaning is a basic research area in the fields of natural language processing, intelligent retrieval, document clustering, document classification, automatic question answering, word sense disambiguation, machine translation etc. The basic process of data processing is represented in Figure 1. Almost all existing studies

Figure 1. Data processing.

on semantic similarity approaches are only concerned with supervised metrics that are low term coverage and difficult to update [10] .

Collaborative annotation Systems like IBM MPEG-7 tool favors this type of annotation for an object and uses the previously used tags for annotations of new objects. An eloquent amount of work has been done to predict tags for documents or resources like web pages, images, videos [13] , from the user’s perspective and involvement, this approach takes different forms on what is anticipated as an input to the system. However the goals are similar to predict the missing tags that are related to an object. The solution that they have proposed is based on a probabilistic framework that considers the evidences in the document content in the query workload [11] . There are two ways to combine these two pieces of evidence, content value and querying value. A model that considers both the components conditionally independent and a linear weighted model. A technique which suggests attributes that improve the visibility of the documents with respect to the query workload by up to 50% was proposed. The method what they had used to extract the structured information from unstructured text is CADS approach. The CADS approach is the collaborative adaptive data sharing platform which is an “annotate- as-you-create” infrastructure that facilitates field the data annotation. Figure 1 illustrates the basic steps of data processing.

The consolidated model for CADS is quiet similar to a data space, in which a heterogeneous source is proposed for a loosely integrated model [12] . A cardinal difference is that data space use blending of existing annotations to produce solutions for a query. But our work evinces the appropriate annotations at the insertion time, by considering the query workload to identify the important attributes to add. A real Time application-Google Base A real time application which is a related data model is Google-Base [13] , in which users can specify attribute/value pairs of their desire. Information Extraction Information extraction is mainly related with the context of value suggestion for the computed attributes. We can classify the IE into two namely open IE closed IE. Closed IE is much cumbersome but open IE is close to CADS approach. In recent years, for document clustering semantic similarity between words or terms has become an increasingly important research topic in data mining. Semantics identify concepts which allow for the extraction of information from data and looking for the meaning of documents or queries concepts need to be captured. It plays an important role in underlying higher level application and become a key point in research.

3. Limitations in the Existing System

Our inspiring frame work is a disaster management situation, inspired by the experience in building business continuity information network [3] for disaster situations in South Florida. During calamities, we have many users and concerns proclaiming and ingesting information. For example, in a hurricane situation, local government firms report shelter location, damages in structures, or structural warnings. Many algorithms and approaches about semantic similarity measurement between words and their meaning are in the range to improve judgment between documents. The semantic similarity approaches or metrics between words and their meanings can be categorized as supervised metrics and unsupervised metrics. The resource based metrics and knowledge-rich text mining requires such human resources and is referred as supervised metrics. Some problems may rise due to existing algorithms that are low in their efficiency due to their technology lag, tedious task, time consuming and resource restrictions. Because of the vastly available documents and high growth of the document both in size and number, it’s difficult to analyze each document separately and directly. It requires lot of human resources.

In Figure 2, it shows a report extracted from the National Hurricane Center repository, narrating the status of a hurricane event in 2008. The report gives the storm location, wind speed, warnings, category, advisory identifier number, and the date it was revealed. Despite the fact, this is a text chronicle; it contains absolutely many attributes names and values, for example, (storm category).

In Figure 3 we could improve the standard of the penetrating through the database. For occurrence, Figure 4 shows three specimen queries for which the report of is a good answer and the lack of the appropriate annotations makes it hard to retrieve it and rank it properly.

The drawbacks of the existing system are the user’s interest towards the attribute suggestion is considered at last not at first. To automate this CADS process we need to have a large database. Some users might find the attribute suggested and their values to be useful for storing the document. Some users might find the attribute suggested and their values do not much useful for storing the document. As, humans interest differs from person to person and the importance what they give will also differ.

4. Proposed Work

Most recently there has been a huge research interest in developing web based similarity measures. But in this

Figure 2. Example of an unstructured document.

Figure 3. Desirable annotation for the document.

Figure 4. Quires that can benefit from the annotations.

Figure 5 shows the process of the analyzing and segregating the document based upon the users input. One objective is to support planning of the knowledge discovery process and buildings of workflows for a user task. The second objective is to support the meta-mining. The Existing language resources and their algorithms are time consuming and it demands a human resources. To address this problem we propose unsupervised semantic similarity computations between the words, their relevant meanings and terms in the document. This algorithm work automatically and does not require any human supervision. The knowledge for annotation process to save their document is quite not much needed. In addition, the proposed unsupervised context-based similarity computation algorithms are shown to be competitive. With the state of art supervised semantic algorithms that employ language specific knowledge resource and also retrieved useful documents. The advantage of the proposed frameworks is that even if the meaning is known we can get our required documents. If the word is also known we can get the exact documents. Search result is accurate when compared with the results given by the operating systems.

5. Experimental Evaluation

The effect of word search with the existing operating systems is initially investigated for the search results. Basically the measures can be classified in to supervised measure and unsupervised measure. Supervised measure uses hand grafted resources like ontology. Various research journals and picked out several unsupervised

Figure 5. Process of the adaptive approach.

similarity and distance measure that play a vital role in Data Object Clustering. Those similarity measures and highlighting their merits was the mentioned tasks of information retrieval. The best measures which provide efficiency and accuracy in the tasks of information retrieval.

5.1. Attribute Suggestion

1) OPT Full Match―It is a technique which uses the subset of the ground truth attributes for each document that satisfies the maximum number of queries. It is an NP-hard problem. For simple workload it works well at the same time for a huge workload it takes some significant amount of measurable time.

2) OPT Partial Match―It is a technique which maximizes the number of query conditions satisfied. It is found by making a single pass on workload.

Table 1. Specifying attributes.

Figure 6. Provide the string for searching.

5.2. Keyword Fetching and Matching

Initially, the user has to specify his interest as either he is going to search with Meaning or Word as a string. If he doesn’t know the exact word that he is going to search he can use the application to get the meaning of that word and then he can search the corresponding string. If he knows that he is going to search with meaning, then he can give the string as it is and can search. This plays a dual part in which depending upon the users choice the search or retrieve process initiates.

5.3. Segregation and Annotation of the Document

5.4. Document Retrieval

Figure 7. Semantic analysis of the keyword.

relevant type if it is identified or else as a whole it is processed. The user has to give the desired word for search. Figure 8 shows the process of obtaining the required document based on search.

After that, the search process uses an information extraction algorithm. It retrieves the required documents as either in T [Good] or T [Best]. In general, every day new words may arise. So, the database has to be frequently updated with the words and their meanings.

6. Performance Analysis

An option for the user’s interest is provided. If a person is interested to search for a meaning, they can give the search string directly. If the choice is word and if they don’t know the meaning, they can use the inbuilt search engine to first get the meaning and then they can search. In the search process too, they can limit the search by using the directory and sub folder selection. Figure 9 illustrates the performance analysis of the document based on the segregation using adaptive progression technique.

In addition, it is going to retrieve the documents alone. So, Unwanted time waste in searching the entire system for the particular word with images, video files and etc are avoided.

7. Conclusion and Future Work

In general, a search processed by an operating system doesn’t give importance for the word given. It instead, gives importance for the folders name or files name with which it is stored. It retrieves all the documents, files like images, power point presentations and so on. But, when we use this system to find the documents that exactly matches a word, it retrieves the exact documents that match with the keyword alone. So, unwanted files processing is avoided and it saves the time. The efficiency is more and only the exact documents are retrieved. Time to search the entire hard drive to unrelated documents is time consuming. So, we retrieve the documents alone by using this IE algorithm. It helps us to improve the accessibility for a document. Unwanted time waste in searching all the files will be eliminated. So, it gives the best result. Even if the meaning for a particular word is not known, we can use this application to retrieve the relevant meanings. For such users it will be very useful.

Figure 8. Obtaining the required document based on search.

Figure 9. Performance analysis of document.

As a future work, classifying documents as belonging to a particular domain by using the common words that we can use to have in a document can be performed. The domain classification is quite a tedious process because of the up gradation in technology on daily basis. It requires updating of a database on daily basis.

Cite this paper
Pitchandi, P. , Muthukumaravel, S. and Boopathy, S. (2016) Content Based Segregation of Pertinent Documents Using Adaptive Progression. Circuits and Systems, 7, 1856-1865. doi: 10.4236/cs.2016.78160.
References
[1]   (2011) Google. Google Base.
http://www.google.com/base

[2]   Jeffery, S.R., Franklin, M.J. and Halevy, A.Y. (2008) Pay-as-You-Go User Feedback for Data Space Systems. SIGMOD’08 Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 847-860.
http://dx.doi.org/10.1145/1376616.1376701

[3]   Jain, A. and Ipeirotis, P.G. (2009) A Quality-Aware Optimizer for Information Extraction. ACM Transactions on Database Systems, 34, Article 5.

[4]   J.M. Ponte and W.B. Croft (1998) A Language Modeling Approach to Information Retrieval. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), ACM, New York, 275-281.
http://dx.doi.org/10.1145/290941.291008

[5]   Chang, K.C.-C. and Hwang, S.-W. (2002) Minimal Probing: Supporting Expensive Predicates for Top-K Queries. Proc. ACM SIGMOD International Conference on Management Data, Madison, Wisconsin, 4-6 June 2002, 12 p.
http://dx.doi.org/10.1145/564691.564731

[6]   Heymann, P., Ramage, D. and Garcia-Molina, H. (2008) Social Tag Prediction. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08), ACM, New York, 531-538.
http://dx.doi.org/10.1145/1390334.1390425

[7]   Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W.C. and Giles, C.L. (2008) Real-Time Automatic Tag Recommendation. Proc. 31st Ann. Int’l ACM SIGIR. Conf. Research and Development in Information Retrieval (SIGIR’08), 515-522.
http://dx.doi.org/10.1145/1390334.1390423

[8]   Etzioni, O., Banko, M., Soderland, S. and Weld, D.S. (2008) Open Information Extraction from the Web. Communications of the ACM, 51, 68-74.
http://dx.doi.org/10.1145/1409360.1409378

[9]   Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M. and Shen, W. (2006) Community Information Management. IEEE Data Engineering Bulletin, 29, 64-72.

[10]   Chu, E., Baid, A., Chai, X., Doan, A. and Naughton, J. (2009) Combining Keyword Search and Forms for Ad Hoc Querying of Databases. Proceedings of ACM SIGMOD International Conference on Management Data, 349-360.
http://dx.doi.org/10.1145/1559845.1559883

[11]   Banerjee, J., Kim, W., Kim, H.J. and Korth, H.F. (1987) Semantics and Implementation of Schema Evolution in Object- Oriented Databases. Proceedings of ACM SIGMOD International Conference on Management Data, 16, 311-322.

[12]   Nandi, A. and Jagadish, H.V. (2007) Assisted Querying Using Instant-Response Interfaces. Proceedings of ACM SIGMOD International Conference on Management Data, 472-483.
http://dx.doi.org/10.1145/1247480.1247640

[13]   Yin, D., Xue, Z., Hong, L. and Davison, B.D. (2010) A Probabilistic Model for Personalized Tag Prediction. Proceedings of ACM SIGMOD International Conference on Knowledge Discovery Data Mining, Washington, DC, July 2010, 959-968.
http://dx.doi.org/10.1145/1835804.1835925

 
 
Top