当前位置:文档之家› 搜索引擎技术

搜索引擎技术

– This is the hybrid approach – Index provides fast access to a subset of database records – Scan subset to find solution set
• IR Problem: • Cannot predict keys that people will use in queries
• Hybrids: Use small index, then scan a subset of the collection
2021/3/6
Indexes
• What should the index contain?
• Database systems index primary and secondarykeys
data,compressed
2021/3/6
Indexes: Implementation
• Common implementations of indexes
– Bitmaps – Signature files
No positional data indexed
– Inverted files
Syntactic phrases & SDR
1
1
1
2 1 6 3 3 2 3 2 1 1 2 1 1 3 1 1 1 37
Conceptual IR, KB IR
1
4 4 1 3 3 4 3 5 7 5 1 6 3 5 3 2 3 4 1 3 2 1 1 75
Question Compression
2021/3/6
Inverted Search Algorithm
1. Find query elements (terms) in the lexicon
2. Retrieve postings for each lexicon entry 3. Manipulate postings according to the
– Every word in a document is a potential search term
• IR Solution: Index by all keys (words) full text indexes
2021/3/6
Index Contents
• The contents depend upon the retrieval model • Feature presence/absence
2021/3/6
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
8416
5 10 1 3 5 2 5 2 4 1
31122
66
5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1
date: Tue, 15 Apr 2003 08:13:06 GMT
// time of harvest
ip: 162.105.129.12
// IP address
unzip-length: 30233
// If included, the data must be compressed
length: 18133
// data length
// a blank line
XXXXXXXX
// the followings are data part
XXXXXXXX
….
XXXXXXXX
// data end
// insert a new line
2021/3/6
File Organizations (Indexes)
2021/3/6
抓取 进程
抓取 进程
协调
进程 ……
(节点)
协调 进程
(节点)
调度模块
天网存储格式
version: 1.0
// version number
url: /
// URL
origin: /
// original URL
Clustering
ans1 wering
2
11
• Use indexes for direct access
– Evaluation time O(query term occurrences in collection) – Practical for “large” collections – Many opportunities for optimization
搜集
整理
服务
• 搜集
– 批量搜集,增量式搜集;搜集目标,搜集策略
• 预处理
– 关键词提取;重复网页消除;链接分析;索引
• 服务
– 查询方式和匹配;结果排序;文档摘要
2021/3/6
搜索引擎系统流程
2021/3/6
天网搜索引擎系统流程
2021/3/6
分布式Web搜集系统结构
抓取 进程 协调 进程 (节点)
– Boolean – Statistical (tf, df, ctf, doclen, maxtf) – Often about 10% the size of the raw data, compressed
• Positional
– Feature location within document – Granularities include word, sentence, paragraph, etc – Coarse granularities are less precise, but take less space – Word-level granularity about 20-30% the size of the raw
• Common index components
– Dictionary (lexicon)
– Postings
• document ids
• word positions
2021/3/6
Inverted Files
2021/3/6
Inverted Files
2021/3/6
Word-Level Inverted File
• Choices for accessing data during query evaluation • Scan the entire collection
– Typical in early (batch) retrieval systems – Computational and I/O costs are O(characters in collection) – Practical for only “small” text collections – Large memory systems make scanning feasible
• In the 1960s, the SMART system by Gerard Salton and his students
• Cranfield evaluations done by Cyril Cleverdon • The 1970s and 1980s saw many developments built on
the advances of the 1960s. • In 1992 with the inception of Text Retrieval Conference. • The algorithms developed • The algorithms developed in IR were employed for
5 10 1 3 5 2 5 2 4 1
31122
66
General !
5 2 9 2 9 5 7 10 10 6 10 6 2 5 8 6 2 2 4 3 1
4 2 5 1 126
Models
1
211
4121212
222231
30
Question answering
1
111
1
1
1
1
4 4 1 17
searching the Web from 1996.
2021/3/6
Clustering of SIGIR papers by topic vs. year
Cluster \ Year
Databases, NL Interfaces General ! Models Question answering Syntactic phrases & SDR Conceptual IR, KB IR Compression Clustering Relevance feedback Inverted files & Implementations Term weighting Message understanding & TDT Filtering Hypertext IR, Multiple evidence Image retrieval Probabilistic & Language models Boolean & extended Boolean Japanese & Chinese IR DBMS & IR Users & Search Visualisation Signature files Distributed IR Evaluation Topic distillation & Linkage retrieval Latent semantic indexing Text categorisation Document summarisation Cross lingual
相关主题