Web is the process of extracting interesting
Web Mining: knowledge discovery domains Prof.
Ms.SwatiA.Abhang Abstract: Webmining applies the data mining, the artificial intelligence and the charttechnology and so on to the web data and traces users’ visiting characteristics,and then extracts the users’ using pattern. Web mining technologies are theright solutions for knowledge discovery on the Web. The knowledge extractedfrom the Web can be used to raise the performances for Web informationretrievals, question answering, and Web based data warehousing. In this paper,we provide an introduction of Web mining as well as a review of the Web miningcategories. Web mining applies the data mining, the artificial intelligence andthe chart technology and so on to the web data and traces users’ visitingcharacteristics, and then extracts the users’ using pattern. Keywords: Data mining; web mining; web usage mining I.
INTRODUCTION:WebMining is the extraction of interesting and potentially useful patterns andimplicit information from artifacts or activity related to the World Wide Web.In order to is better serves for the users, web mining applies the data mining,the artificial intelligence and the chart technology and so on to the web data and traces users’ visiting characteristics,and then extracts the users’ using pattern.Ithas quickly become one of the most important areas in Computer and InformationSciences because of its direct applications in ecommerce, e-CRM, Web analytics,information retrieval and filtering, and Web information systems.Accordingto the differences of the mining objects, there are roughly three knowledgediscovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining.Fig.1 Web mining categories and objects Webcontent mining is the process of extracting knowledge from the content ofdocuments or their descriptions. Webdocument text mining, resource discovery based on concepts indexing or agent;based technology may also fall in this category. Web structure mining is theprocess of inferring knowledge from the World Wide Web organization and linksbetween references and referents in the Web.
Finally, web usage mining, alsoknown as Web Log Mining, is the process of extracting interesting patterns inweb access logs. II.WEB MINING PROCESSWebmining process is generally divided into five stages: data acquisition, datapreprocessing, mode discovery, mode analysis, and mode application.(1) Data acquisitionWebmining can collect raw data from client, server and registered/remote agents.
Their data types are quite different, and the data processing method are notthe same. The data collected from different data source reflects the differentaccess mode in the process of Web using.(2) Data preprocessingDatapreprocessing carries on a series of processing to the primary data, andobtains the target information. The result of data preprocessing is the inputof mining algorithm, which directly influences the mining quality. Datapreprocessing mainly includes data cleaning, user identification, sessionidentification, path completion and transaction identification. It can obtainuser session sets that reflect the user browsing process quite objectively,which makes prepare for improving the accuracy of the final mining mode and theeffect of the recommendation.(3) Mode discoveryModediscovery mines the effective, novel, latent, useful and ultimateunderstandable information and knowledge by using mining algorithm.
Thetechnologies used in Web usage mining include statistic analysis, pathanalysis, association analysis, sequence pattern analysis, classificationanalysis, clustering analysis as well as dependency modeling and so on. Thereare two kinds of clustering on Web: user clustering and page clustering.(4) Mode analysisTheuser behavior mode obtained from mining, need to be analyzed, explained andvisualized with suitable tool and technology, from which we select theinteresting mode, make it become people understandable knowledge to realize thequery from the mined knowledge. There are many mode analysis methods, such asvisualization technology, data query and OLAP. Various visualizationtechnologies like graphical mode paint different color for different values,which can make the overall mode or trend become very outstanding. Content,structure information can also be used to filter out specific mode, such ascontaining specific used data class, content class, or the web with specifichyperlinks structure.(5) Mode applicationWecan applies the meaningful conclusions and mode mined, such as modifying webpage content, improving web services design, customizing personalized interfacefor user, providing personalized E-commerce services etc.
III.WEB CONTENT MININGWebcontent mining is a form of text mining and can take advantage of thesemi-structured nature of web page text. Query interfaces share similar orcommon query patterns. For instance, a frequently used pattern is a textfollowed by a selection list with numeric values. The HTML tags of today’s webpages, and even more so the XML markup of tomorrow’s web pages, bearinformation that concerns not only layout, but also logical structure. HTMLformat might be invalid and cause problems in extracting information. In mostof previous works extracting information is performed from HTML pages and someof them firstly is converted invalid HTML pages to valid HTML pages and thenextracting process is applied but in this paper we use XML format of web pagesfor extracting information. Extractor system which is presented in this paper gets XML pages as an input andcan access to XML tags in documents with XML DOM API.
DOM1 is a standardlanguage that gets a web page as an input and shows it in a structured treefrom interfaces, objects and relations between them as an output. A sample DOMtree shows in Fig. 2 that is the extracted form of a sample query interface.Webcontent mining targets the knowledge discovery, in which the mainobjects are the traditional collections of text documents and, more recently,also the collections of multimedia documents such as images, videos, audios,which are embedded in or linked to the Web pages. Web content mining could bedifferentiated from two points of view: the agent-based approach or thedatabase approach. Fig.2Simple Dom Tree Thefirst approach aims on improving the information finding and filtering andcould be placed into the following three categories:1.
IntelligentSearch Agents. These agents search for relevant information using domaincharacteristics and user profiles to organize and interpret the discoveredinformation.2.
InformationFiltering/ Categorization. These agents use information retrievaltechniques and characteristics of open hypertext Web documents to automaticallyretrieve, filter, and categorize them.3. PersonalizedWeb Agents. These agents learn user preferences and discover Webinformation based on these preferences, and preferences of other users withsimilar interest.Thesecond approach aims on modeling the data on the Web into more structured formin order to apply standard database querying mechanism and data miningapplications to analyze it. The two main categories are Multilevel databasesand Web query systems. IV.
WEB STRUCTURE MININGThechallenge for Web structure mining is to deal with the structure of thehyperlinks within the Web itself. Link analysis is an old area of research.However, with the growing interest in Web mining, the research of structureanalysis had increased and these efforts had resulted in a newly emergingresearch area called Link Mining, which is located at the intersection of thework in link analysis, hypertext and web mining, relational learning andinductive logic programming, and graph mining. There is a potentially widerange of application areas for this new area of research, including Internet.TheWeb contains a variety of objects with almost no unifying structure, withdifferences in the authoring style and content much greater than in traditionalcollections of text documents. The objects in the WWW are web pages, and linksare in-, out- and co-citation (two pages that are both linked to by the samepage).
Attributes include HTML tags, word appearances and anchor texts; .hisdiversity of objects creates new problems and challenges, since is not possibleto directly made use of existing techniques such as from database management orinformation retrieval. Link mining had produced some agitation on some of thetraditional data mining tasks. Asfollows, we summarize some of these possible tasks of link mining which areapplicable in Web structure mining.1. Link-basedClassification.
Link-basedclassification is the most recent upgrade of a classic data mining task tolinked domains. The task is to focus on the prediction of the category of a webpage, based on words that occur on the page, links between pages, anchor text,html tags and other possible attributes found on the web page.2. Link-basedCluster Analysis. Thegoal in cluster analysis is to find naturally occurring sub-classes.
The datais segmented into groups, where similar objects are grouped together, anddissimilar objects are grouped into different groups. Different than theprevious task, link-based cluster analysis is unsupervised and can be used todiscover hidden patterns from data.3. Link Type. There are a wide range of tasksconcerning the prediction of the existence of links, such as predicting thetype of link between two entities, or predicting the purpose of a link.4.
LinkStrength. Links could be associated with weights.5. LinkCardinality. Themain task here is to predict the number of links between objects. Thereare many ways to use the link structure of the Web to create notions ofauthority.
The main goal in developing applications for link mining is to madegood use of the understanding of these intrinsic social organization of theWeb. V.WEB USAGE MINING Concept of web usage miningWebservers record and accumulate data about user interactions whenever requestsfor resources are received. Analyzing the web access logs of different websites can help understand the user behavior and the web structure, therebyimproving the design of this colossal collection of resources. There are twomain tendencies in Web Usage Mining driven by the applications of the discoveries:General Access Pattern Tracking and Customized Usage Tracking. Web Usage Miningis to mine data from log record on web page. Log record lots useful informationsuch as URL, IP address and time and so on.
Analyzing and discovering Log couldhelp us to find more potential customers and trace service quality and so on.Theweb usage mining is the process of applying the data mining technology to theweb data and is the pattern of extracting something that the users are interestin from their network behaviors to be interested. When people visit onewebsite, he will leave some data such as IP address, visiting pages, visitingtime and so on, web usage mining will collect, analyze and process the log andrecording data Approach of web usage miningThe web usagemining generally includes the following several steps: data collection, datapretreatment, establishing interesting model the data back processes.(1) Data collectionDatacollection is the first step of web usage mining, the data authenticity andintegrality will directly affect the following works smoothly carrying on andthe final recommendation of characteristic service’s quality. Therefore it mustuse scientific, reasonable and advanced technology to gather various data. Atpresent, towards web usage mining technology, the main data origin has threekinds: server data, client data and middle data (agent server data and packagedetecting).(2) Data pretreatmentSomedatabases are insufficient, inconsistent and including noise. The data pretreatmentis to carry on a unification transformation to those databases.
The result isthat the database will to become integrate and consistent, thus establish thedatabase which may mine. In the data pretreatment work, mainly include dataclearing, user recognition, user conversation recognition and data formatting.(3) Establish interesting modelUsestatistical method to carry on the analysis and mine the pretreated data. Wemay discover the user or the user community’s interests then construct interestmodel. At present the usually used machine learning methods mainly haveclustering, classifying, the relation discovery and the order model discovery.Each method has its own excellence and shortcomings, but the quite effectivemethod mainly is classifying and clustering at the present.
(4)Pattern analysisCarryon the further analysis and induction to the interested pattern which hasalready established. First delete the less significance rules or models fromthe interested model storehouse; Next use technology of OLAP and so on to carryon the comprehensive mining and analysis; Once more, let discovered data orknowledge be visible; Finally, provide the characteristic service to theelectronic commerce website.CONCLUSION Inthis paper we survey the Web mining, focusing on the category of Web mining asWeb Content Mining, Web Structure Mining, and Web Usage Mining. We havediscussed the process of all three types of mining in detail REFERENCES:1. Chen,M. S.
, Park, 1. S., and Yu, P. S.
, “Efficient Data Mining for PathTraversal Patterns”, IEEE Transactions on Knowledge and Data Engineering,MarchiApril, 1998, pp.209-221.2. R.Cooley, B. Mobasher and 1.
Srivastava, “Web Mining: Infonnation andPattern Discovery on the World Wide Web”,3. Proceedingsof the 9th IEEE International Conference on Tools with Artificial Intelligence(ICTAI’97), November.4. HongT, Chiang M, Wang S H, “Mining weighted browsing patterns with linguisticminimum supports”, 2002IEEE International Conference on Systems, Man andCybernetics, 2002,Yasmine Hammamet, Tunisia, pp. 635-639. 5. S.Schechter, M.
Krishnan, and M. D. Smith. Using path proles to predict httprequests. In 7th International World Wide Web Conference, Brisbane, Australia,1998.6.
R.Cooley, B. Mobasher and J. Srivastava, “Web Mining: Information and PatternDiscovery on the World Wide Web”, Proceedings of the 9th IEEEInternational Conference on Tools with Artificial Intelligence (ICTAI’97),November.7. WangJicheng, Huang Yuan, Wu Gangshan, Zhang Fuyan.
Web mining:knowledge discovery on the Web. Systems,Man, and Cybernetics, 1999. IEEE SMC ’99 Conference Proceedings. 1999 IEEEInternational Conference – on Volume 2, Page(s):137 – 141 vol.2 – 12-15 Oct.19998. Cooley,R.
; Mobasher, B.; Srivastava, J.; Web mining: information and pattern discoveryon the World Wide Web. Tools with Artificial Intelligence,1997. Proceedings.,Ninth IEEE International Conference.
Page(s):558 – 567 – 3-8 Nov. 1997.