专注在线职业教育23年
下载APP
小程序
希赛网小程序
导航

计算机专业时文选读之十七

责编:hnldzy 2004-12-31

Web Harvesting

As the amount of information on the Web grows, that information becomes ever harder to keep track of and use. Search engines are a big help, but they can do only part of the work, and they are hard-pressed to keep up with daily changes.

Consider that even when you use a search engine to locate data, you still have to do the following tasks to capture the information you need: scan the content until you find the information,mark the information (usually by highlighting with a mouse),switch to another application (such as a spreadsheet, database or word processor),paste the information into that application.

A better solution, especially for companies that are aiming to exploit a broad swath of data about markets or competitors, lies with Web harvesting tools.

Web harvesting software automatically extracts information from the Web and picks up where search engines leave off, doing the work the search engine can't. Extraction tools automate the reading, copying and pasting necessary to collect information for analysis, and they have proved useful for pulling together information on competitors, prices and financial data of all types.

There are three ways we can extract more useful information from the Web.

The first technique, Web content harvesting, is concerned directly with the specific content of documents or their descriptions, such as HTML files, images or e-mail messages. Since most text documents are relatively unstructured (at least as far as machine interpretation is concerned), one common approach is to exploit what's already known about the general structure of documents and map this to some data model.

Another approach to Web content harvesting involves trying to improve on the content searches that tools like search engines perform. This type of content harvesting goes beyond keyword extraction and the production of simple statistics relating to words and phrases in documents.

Another technique, Web structure harvesting, takes advantage of the fact that Web pages can reveal more information than just their obvious content. Links from other sources that point to a particular Web page indicate the popularity of that page, while links within a Web page that point to other resources may indicate the richness or variety of topics covered in that page. This is like analyzing bibliographical citations—a paper that's often cited in bibliographies and other papers is usually considered to be important.

The third technique, Web usage harvesting, uses data recorded by Web servers about user interactions to help understand user behavior and evaluate the effectiveness of the Web structure.

General access-pattern tracking analyzes Web logs to understand access patterns and trends in order to identify structural issues and resource groupings.

Customized usage tracking analyzes individual trends so that Web sites can be personalized to specific users. Over time, based on access patterns, a site can be dynamically customized for a user in terms of the information displayed, the depth of the site structure and the format of the resources presented.

时文选读

Web收割

随着网上信息量的增加,信息变得越来越难以跟踪和使用。虽然搜索引擎给予了很大的帮助,但它们只能做一小部分工作,也很难迫使它们跟上每天的变化。

考虑到即使你在用搜索引擎确定数据位置,你还是不得不完成下列任务,以捕捉到所需的信息 : 扫描内容,直到找到信息为止; 给信息置上标记(通常用鼠标使它更亮些); 转到其他应用(如电子数据表、数据库或字处理程序); 把信息粘贴到那个应用程序。

Web收割工具是一个更好的解决方案,尤其是对那些要大量利用市场或竞争对手的数据的公司而言。

Web收割软件自动从网上提取信息,在搜索引擎脱身的地方精选信息,完成搜索引擎不能做的工作。提取工具自动完成收集供分析用信息所需的读出、复制和粘贴,这些工具对于汇总有关竞争对手的信息、各种各样的价格和财务数据而言,已被证明是有用的。

从网上提取更有用信息的方法有三种:

第一种技术叫 Web内容收割,与具体的文档内容或它们的描述,如HTML文件、图像或电子邮件信息直接有关。由于大多数文本文档相对而言是非结构化的(至少就机器解释而言),一个常用的方法就是利用对文档一般结构已知的信息,将它映射到某个数据模型。

Web内容收割的另一种方法涉及到试着改进内容搜索,像搜索引擎一类工具所做的那样。此类内容收割超过关键词提取,和产生与文档中的词和短语有关的简单统计。

第二种技术叫 Web结构收割,它利用了网页能比显而易见(如纸面上的)的内容揭示更多的信息。指向特定网页的其他来源的链接,表明了该页的流行性,而同一页内指向其他资源的链接,表明了该页所覆盖的题目的丰富性和多样性。这类似于书目引用的分析——常常被引用的论文通常就被认为比较重要。

第三种方法叫 Web使用收割,它使用Web服务器记录下的有关用户交互行为的数据,来帮助理解用户的行为和评价Web结构的有效性。

通用的访问模式跟踪分析 Web日志,来理解访问模式和倾向以便鉴别结构问题和资源分组。

定制的用途跟踪分析了个别倾向,从而能针对特定用户使网站实现个性化。随着时间的推移,基于访问模式,网站就能按显示的信息、网站结构的深度和展示资源的格式,为用户进行动态定制。

更多资料
更多课程
更多真题
温馨提示:因考试政策、内容不断变化与调整,本网站提供的以上信息仅供参考,如有异议,请考生以权威部门公布的内容为准!
相关阅读
查看更多

加群交流

公众号

客服咨询

考试资料

每日一练