Data Mining World Wide Web
What is Web Mining ?
- Web mining can define as the method of utilizing data mining techniques and algorithms to extract useful information directly from the web, such as Web documents and services, Web content, hyperlinks, and server logs.
- World Wide Web contains a large amount data that provides a rich source data mining. Objective of Web mining is to look for patterns in Web data by collecting and examining data in order to gain insights.
- Web mining has a distinctive property to provide a set of various data typese has multiple aspects that yield different approaches for the mining process, such as web pages consist of text, web pages are linked via hyperlinks, and user activity can be monitored via web server logs.
- Web mining can widely be seen as the application of adapted data mining techniques to the web.
Three types of Data Mining
Datamining World Wide Web
Web Content Mining
- It is the application of extracting useful information from the content of the web documents.
- It consist of several types of data – image, text, video , audio etc.
- Content data is the group of facts that a web page is designed. It can provide effective and interesting patterns about user needs.
- Text documents are related to text mining, machine learning and natural language processing. This mining is called text mining.
Web Structured Mining
- The web structure mining can be used to find the link structure of hyperlink. It is used to identify that data either link the web pages or direct link network.
- Web Structure Mining, an individual considers the web as a directed graph, with web pages being vertices that are associated with hyperlinks.
- The most important application in this regard is Google search engine, which estimates ranking of its outcomes primarily with PageRank algorithm.
Web Usage Mining
- It is used to extract information, useful data, knowledge from the weblog records, and assists in recognizing the user access patterns for web pages.
- The content and structure of the collection of web pages follow intentions of authors of the pages, the individual requests demonstrate how the consumers see these pages. The usage of web resources, individual is thinking about records of requests of visitors of a website, that are often collected as web server logs.
Some of the important methods to identify and analyze the web usage patterns are given below:
Session and Visitor Analysis
- The analysis of preprocessed data can be accomplished in session analysis, which incorporates the time, guest records, days, sessions, etc.
- This data can be utilized to analyze the visitor's behavior.
- Document is created after this analysis, which contains details of repeatedly visited web pages, common entry, and exit.
OLAP (Online Analytical Processing):
- It accomplishes a multidimensional analysis of advanced data.
- It can be accomplished on various parts of log related data in a specific period.
- It tools can be used to infer important business intelligence metrics.
Challenges in Web Mining
Challenges in Web Mining
The complexity of Web Pages
- It is extremely complicated as compared to traditional text documents.
- Enormous amounts of documents in the digital library of the web. These libraries are not organized according to a specific order.
Dynamic Data Source
- Data on the internet is quickly updated. For example: News
Diversity of Client Networks
- The client network on the web is quickly expanding.
- There are over a hundred million workstations that are associated with the internet and still increasing tremendously.
Relevancy of Data
- It is considered that a specific person is generally concerned about a small portion of the web. While the rest of the segment of the web contains the data that is not familiar to the user and may lead to unwanted results.
The web is too broad
- The size of the web is tremendous and rapidly increasing.
Application of Web Mining
- Marketing and conversion tool.
- Data analysis on website.
- Audience behavior analysis.
- Testing and analysis of a site.
- Advertising and campaign accomplishment analysis.