Knowing the articles that laid the foundations of a specialty or a specific topic of research has been defined, for many years, as one of the essential objectives of a literature review (Hart, 1998). The literature review, necessary in any investigation, has been defined with fairness by (Webster & Watson, 2002) as the analysis of the past as an essential preparation of the vision of the future that every good scientific article should contain. The realization of the so-called “state of the art” has become an essential step in the realization of research: Although it seems amazing the realization of these “states of art” was recognized almost from the very appearance of scientific journals, already many centuries (Sciences, 1823).
One of the main objectives in the realization of a state of the art is to identify those articles that have seated so much, the possible conceptual bases, as methodological of discipline, that is to say, those contributions that in fact “do not age” (Singer, 2009). It is usual, therefore, in the specialized literature, to find both: to determine those seminal articles published in a given journal (Parkinson et al., 2013), the role of one of these contributions in a particular discipline (Dolman, Miralles, & de Jeu, 2014), in a specific technique (Nash, Walker, Gidwani, & Ajuied, 2015; Nash, Walker, Lucas, & Ajuied, 2016) or the most important in a given branch of science (Riordon, Zubritsky, & Newman, 2000).
The importance of identifying the so-called seminal articles has been recognized as a de facto standard in the realization of a state of the art in the most dissimilar disciplines. To identify these articles of unquestionable significance in an investigation (Berkani, Hanifi, & Dahmani, 2020; Silva, Villa, & Cabrera, 2020), different alternatives have been proposed such as the use of collaborative models (Wang & Blei, 2011) and the use of personalized systems for the recommendation of the most relevant articles (Pera & Ng, 2011). Less studied has been the fact of how to identify these and their possible genealogy (Bae, Hwang, Kim, & Faloutsos, 2011, 2014). The fact is that the current researcher is faced with a quantity of information that does not do anything simply to find the most relevant jobs and this requires considerable time and effort (Alonso, Perez, & Hidalgo, 2016; Bravo Hidalgo & León González, 2018).
Within this problematic this contribution started from the investigative idea that the seminal articles are recognized as such, do not age, it is for two reasons:
1) They have been cited in a significant way, that is, they are recognized by the scientific community.
2) They remain valid for several years.
These two simple reasons should lead them to stand out as outliers in space:
VY = f(C)
where VY is the Validity in Years of a given article, that is, the time elapsed from the publication of the article until the current date:
C is the number of appointments received during that period for the article in question.
Data mining offers different possibilities for data analysis (Berkhin, 2006) including different techniques (Bakar, Mohemad, Ahmad, & Deris, 2006; Buthong, Luangsodsai, & Sinapiromsaran, 2013) and algorithms for the detection of values atypical (Ramaswamy, Rastogi, & Shim, 2000). At the same time, different applications have been developed (Rangra & Bansal, 2014) that facilitate the use of data mining. Among these, the Rapidminer offers a whole set of possibilities for the analysis of data (Amer & Goldstein, 2012; Jungermann, 2009) and in particular for the detection of outliers (Buthong et al., 2013). The outlier has long been defined (Barnett & Lewis, 1974) as an observation, or set of observations, that seems to be inconsistent with the data set under analysis.
This contribution was proposed from these considerations to determine if in the space VY = f(C) could be distinguished the seminal articles as outliers using the possibility offered by the Rapidminer (https://rapidminer.com/) to classify them in said space. Another aspect that cannot be ignored is how the articles are determined and the number of citations received by each one. For this purpose, it was also proposed to explore in this research which was the coincidence in relation to the articles considered as seminal when using the Google Scholar (Martin-Martin, Orduna-Malea, Harzing, & Delgado López-Cózar, 2017) compared to another Database of wide recognition by the scientific world, such as Scopus (Burnham, 2006).
2. Material and Methods
To form the space VY = f(C), we proceeded to search both Scopus and Google Scholar for the following terms in English, in the Title of the articles and for the period 1960-2019:
1) Knowledge management
For each of the search terms, the 990 most-cited articles were selected. These were exported to Excel according to the possibilities offered by both Scopus and POP. The Database is thus formed by the fields: Cites, Authors, Title, Year, and Validity that is calculated by subtracting the year of publication of the last year of rank for the search (2019).
In order to compare the similarity between the two sets of articles determined for each term, a Similarity Index (SI) was calculated from:
SI = 2C/AGoogle Scholar + BScopus
SI is Similarity Index.
This SI reproduces the original idea of Sorensen, formulated many years ago to establish the similarity of groups of equal amplitude.
AGoogle Scholar and BScopus are the number of Articles in each of the sets considered (990 for each).
C is the number of shared items of both sets. This number is easily calculated in Excel, if a formula is programmed that compares the coincidences for the two-column matrix (TitleGoogle Schlar, TitleScopus).
The detection of outliers was done using Rapidminer and the process scheme that can be configured in this is shown in Figure 1.
The first Operator reads the file in Excel and processes the Cites and Validity fields, this was done for each search term and for each of the Bases used (Google Scholar and Scopus). The second identifies the Outliers in the data set. This allows you to specify both the number of neighbors (k), and the number of Outliers (n). To be able to compare the different search terms, these parameters were adjusted, after some preliminary tests, to the values:
n = 10
k = 10
The calculation of the distances between the values of k was made using the Euclidean distances between these values. In practical terms, an attempt was made to answer the question: How to determine the 10 articles that can be considered seminal for each of the search terms analyzed?
Table 1 below summarizes the Number of Cites, Years, Cites/Year and Cites/Paper as well as the Similarity Index for the three search terms considered. This information is of great importance for the purpose of characterizing the papers detected in the different databases used.
The Similarity Index remains similar between Knowledge Management and Entrepreneurship but decreases for Marketing.
Figure 1. Process outlier detection in Rapidminer. The data used are those detected under the search criteria previously defined.
Table 1. Summary data. Google scholar matches with Scopus. Similarity index.
Analysis of Seed Articles
Define abbreviations and acronyms the first time they are used in the text, even after they have been defined in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in the title or heads unless they are unavoidable (Figure 2).
Table 2 presents the results for the SI for the case of articles determined as
Figure 2. Outliers in the space VYScopus = f(CScopus); knowledge management case.
Table 2. Seminals papers found: knowledge management, entrepreneurship and marketing.
Outliers and that can be categorized as seminal using Google Scholar and Scopus and for the three search criteria used. The results obtained for the three search terms used are shown below in Table 2. In other words, this table identifies each of the detected documents as Outliers.
When comparing the results obtained for searches in Scopus and Googles Scholar for Knowledge Management, Entrepreneurship and Marketing, it was obtained that there is no marked similarity between the sets of articles that were obtained in both cases. The values for the Similarity Index remained below 0.52%, similar between Knowledge Management and Entrepreneurship but decreasing for Marketing.
The detection of outliers using Data Mining techniques and in particular using Rapidminer, allowed to determine the seminals papers for the three search terms analyzed and allowed to characterize these in the space VA = f(C) in Google Scholar and Scopus. It was shown that the seminal articles can be different if Google Scholar or Scopus is used. The results suggest determining for other search terms whether the trend found is maintained or not.
 Acs, Z. J., Braunerhjelm, P., Audretsch, D. B., & Carlsson, B. (2009). The Knowledge Spillover Theory of Entrepreneurship. Small Business Economics, 32, 15-30.
 Bae, D. H., Hwang, S. M., Kim, S. W., & Faloutsos, C. (2011). Constructing Seminal Paper Genealogy. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (pp. 2101-2104).
 Bakar, Z. A., Mohemad, R., Ahmad, A., & Deris, M. M. (2006). A Comparative Study for Outlier Detection Techniques in Data Mining. In IEEE Conference on Cybernetics and Intelligent Systems.
 Baron, R. A. (2003). Human Resource Management and Entrepreneurship: Some Reciprocal Benefits of Closer Links. Human Resource Management Review, 13, 253-256.
 Berkani, L., Hanifi, R., & Dahmani, H. (2020) Hybrid Recommendation of Articles in Scientific Social Networks Using Optimization and Multiview Clustering. In 3rd International Conference on Smart Applications and Data Analysis for Smart Cyber-Physical Systems (pp. 117-132, Vol. 1207). Berlin: Springer.
 Berkes, F., Colding, J., & Folke, C. (2000). Rediscovery of Traditional Ecological Knowledge as Adaptive Management. Ecological Applications, 10, 1251-1262.
 Berry, L. L. (1995). Relationship Marketing of Services—Growing Interest, Emerging Perspectives. Journal of the Academy of Marketing Science: Official Publication of the Academy of Marketing Science, 23, 236-245.
 Buthong, N., Luangsodsai, A., & Sinapiromsaran, K. (2013). Outlier Detection Score Based on Ordered Distance Difference. In Computer Science and Engineering Conference.
 Dolman, A. J., Miralles, D. G., & de Jeu, R. A. M. (2014). Fifty Years since Monteith’s 1965 Seminal Paper: The Emergence of Global Ecohydrology. Ecohydrology, 7, 897-902.
 Gold, A. H., Malhotra, A., & Segars, A. H. (2001). Knowledge Management: An Organizational Capabilities Perspective. Journal of Management Information Systems, 18, 185-214.
 Gomez-Perez, A., Fernández-López, M., & Corcho, O. (2006). Ontological Engineering: with Examples from the Areas of Knowledge Management, e-Commerce and the Semantic Web. Berlin: Springer Science & Business Media.
 Harzing, A.-W., & Alakangas, S. (2016). Google Scholar, Scopus and the Web of Science: A Longitudinal and Cross-Disciplinary Comparison. Scientometrics, 106, 787-804.
 Henseler, J., Ringle, C. M., & Sinkovics, R. R. (2009). The Use of Partial Least Squares Path Modeling in International Marketing. In Advances in International Marketing (pp. 277-319, Vol. 20). Bingley: Emerald Group Publishing Ltd.
 Jacsó, P. (2009). Calculating the h-Index and Other Bibliometric and Scientometric Indicators from Google Scholar with the Publish or Perish Software. Online Information Review, 33, 1189-1200.
 Jarvis, C. B., Mackenzie, S. B., Podsakoff, P. M., Giliatt, N., & Mee, J. F. (2003). A Critical Review of Construct Indicators and Measurement Model Misspecification in Marketing and Consumer Research. Journal of Consumer Research, 30, 199-218.
 Kotler, P., & Gertner, D. (2002). Country as Brand, Product, and Beyond: A Place Marketing and Brand Management Perspective. Journal of Brand Management, 9, 249-261.
 Kozinets, R. V. (2002). The Field behind the Screen: Using Netnography for Marketing Research in Online Communities. Journal of Marketing Research, 39, 61-72.
 Lee, H., & Choi, B. (2003). Knowledge Management Enablers, Processes, and Organizational Performance: An Integrative View and Empirical Examination. Journal of Management Information Systems, 20, 179-228.
 Martin-Martin, A., Orduna-Malea, E., Harzing, A.-W., & Delgado López-Cózar, E. (2017). Can We Use Google Scholar to Identify Highly-Cited Documents? Journal of Informetrics, 11, 152-163.
 Palmatier, R. W., Dant, R. P., Grewal, D., & Evans, K. R. (2006). Factors Influencing the Effectiveness of Relationship Marketing: A Meta-Analysis. Journal of Marketing, 70, 136-153.
 Parkinson, L., Richardson, K., Sims, J., Wells, Y., Naganathan, V., Brooke, E., & Lindley, R. (2013). Identifying Seminal Papers in the Australasian Journal on Ageing 1982-2011: A Delphi Consensus Approach. Australasian Journal on Ageing, 32, 6-11.
 Pera, M. S., & Ng, Y.-K. (2011). A Personalized Recommendation System on Scholarly Publications. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management.
 Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD Record (ACM Special Interest Group on Management of Data), 29, 427-438.
 Sanchez, R., & Mahoney, J. T. (1996). Modularity, Flexibility, and Knowledge Management in Product and Organization Design. Strategic Management Journal, 17, 63-76.
 Sciences, T. P. (1823). ART. I. A Comparative View of the State of Medical Science among the Ancients and Moderns, Its Revolutions in Different Periods of the World, and an Enumeration of Some of the Errors Which Check Its Progress. The Philadelphia Journal of the Medical and Physical Sciences, 7, 211-226.
 Silva, J., Villa, J. V., & Cabrera, D. (2020). An Intelligent Approach to Design and Development of Personalized Meta Search: Recommendation of Scientific Articles. In 16th International Conference on Distributed Computing and Artificial Intelligence, DCAI 2019 (pp. 99-106, Vol. 1003). Berlin: Springer Verlag.
 Smallbone, D., & Welter, F. (2012). Entrepreneurship and Institutional Change in Transition Economies: The Commonwealth of Independent States, Central and Eastern Europe and China Compared. Entrepreneurship and Regional Development, 24, 215-233.
 Stevenson, H. H., & Jarillo, J. C. (2007). A Paradigm of Entrepreneurship: Entrepreneurial Management. In Entrepreneurship: Concepts, Theory and Perspective (pp. 155-170). Berlin: Springer.
 Tax, S. S., Brown, S. W., & Chandrashekaran, M. (1998). Customer Evaluations of Service Complaint Experiences: Implications for Relationship Marketing. Journal of Marketing, 62, 60-76.
 Tranfield, D., Denyer, D., & Smart, P. (2003). Towards a Methodology for Developing Evidence-Informed Management Knowledge by Means of Systematic Review. British Journal of Management, 14, 207-222.
 Wang, C., & Blei, D. M. (2011). Collaborative Topic Modeling for Recommending Scientific Articles. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
 Zahra, S. A. (2012). Organizational Learning and Entrepreneurship in Family Firms: Exploring the Moderating Effect of Ownership and Cohesion. Small Business Economics, 38, 51-65.