当前位置:文档之家› 计算机专业英语期末作业

计算机专业英语期末作业

第一篇参考文献AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIALCENSUS记录连接的FELLEGI-SUNTER模型在美国1990年十年人口普查中的应用William E. Winkler and Yves ThibaudeauU.S. Bureau of the CensusAbstract:This paper describes a methodology for computer matching the Post Enumeration Survey with the Census. Computer matching is the first stage of a process for producing adjusted Census counts. All crucial matching parameters are computed solely using characteristics of the files being matched. No a priori knowledge of truth of matches is assumed. No previously created lookup tables are needed. The methods are illustrated with numerical results using files from the 1988 Dress Rehearsal Census for which the truth of matches is known.Key words and phrases : EM Algorithm ; String Comparator Metric ; LP Algorithm; Decision Rule ; Error Rate.摘要:本文介绍了一种电脑匹配人口普查的覆核统计调查的方法。

电脑匹配是产生调整后人口普查计数的第一阶段。

所有关键的匹配参数计算仅使用被匹配的文件的特点。

没有假定的先验匹配真理的知识。

先前创建的查找表也不是必要的。

这个方法说明了使用了已知匹配的真相的1988年的预演人口普查的档案的计算结果。

关键词和短语:EM算法;字符串比较公制; LP算法;决策规则;错误率。

第二篇参考文献Data Cleaning: Problems and Current Approaches数据清洗:问题与目前的做法Erhard Rahm , Hong Hai DoUniversity of Leipzig, GermanyAbstract:We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.摘要:我们给处理数据清洗的数据质量问题分类,并提供了一个概览的主要解决办法。

数据清洗是必要的,尤其是集成异构数据源时,应与模式相关的数据转换一同处理。

在数据仓库,数据清洗是一个所谓的ETL过程中的重要组成部分。

我们还讨论了当前数据清洗工具的支持。

第三篇参考文献Record Linkage: Current Practice and Future记录链接:目前的实践和未来DirectionsLifang Gu, Rohan Baxter, Deanne Vickers, and Chris RainsfordCSIRO Mathematical and Information SciencesGPO Box 664, Canberra, ACT 2601, AustraliaCMIS Technical Report No. 03/83A bstract: Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the “standard” probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.Keywords: record linkage, data cleaning, entity identification, entity reconciliation, object isomerism, merge/purge, list washing.摘要记录链接是迅速和准确地识别对应于同一个实体从一个或多个数据源的记录的工作。

记录链接也被称为数据清洗,实体核对或鉴定,并合并/净化问题。

本文提出了“标准”的概率纪录链接模型及相关算法。

最近的在信息检索方面的工作,联合数据库系统和数据挖掘提出了标准算法的关键部件的替代物。

对这些替代品的标准方法的影响进行评估。

关键的问题是,这些新的替代品是否以及如何将在一个特定的记录链接应用更省时间,并获得更高的准确性和自动化程度。

关键字:记录联动,数据清洗,实体识别,实体核对,异构对象,合并/净化,清洗列表。

第四篇参考文献LEARNING OBJECT IDENTIFICATION RULES FORINFORMATION INTEGRATION信息集成学习对象识别规则Sheila Tejada1, Craig A. Knoblock1, and Steven Minton2 1University of Southern California/Information Sciences Institute, 4676 Admiralty Way,Marina del Rey CA 902922Fetch Technologies 4676 Admiralty Way, Marina del Rey CA 90292(December 2001)Abstract:When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects' shared attributes in order to identify matching objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous methods of object identification have required manual construction of object identification rules or mapping rules for determining the mappings between objects. This manual process is time consuming and error-prone. In our approach, Active Atlas learns to tailor mapping rules, through limited user input, to a specific application domain. The experimental results demonstrate that we achieve higher accuracy and require less user involvement than previous methods across various application domains.摘要:整合来自多个网站的信息时,相同的数据对象可能存在跨站点的文本格式不一致,因此很难确定使用精确的文本匹配的匹配对象。

相关主题