IR-chapter2:the term vocabulary

作者: woodsouthmmm | 来源:发表于2017-04-17 21:52 被阅读0次

IR-chapter2:the term vocabulary
信息检索导论三：词典及容错式检索
TERM environment variable not se
ElasticSearch - term和全文搜索的区别
Balance
term
ES 深入理解搜索
2018-12-06
有关你的将来
ES6.x 各种查询

0.document delineation and character sequence decoding

obtain the character sequence in a document

determine the character encoding
determine the document file
assumption: text is a linear sequence of characters.

choose a document unit

indexing granularity
trade-off between precision(small - sentence) and recall(large - book).

1.determine the vocabulary of terms

tokenization

terminology

token
an instance of a sequence of characters
type
the class of all tokens containing the same character sequence.
term
A term is a (perhaps normalized) type that is included in the IR system’s dictionary.

to sleep perchance to dream,

5 tokens,4 types (because there are two instances of to), 3 terms ( if to is omitted from the index as a stop word)

tricky cases

do the exact same tokenization for both the dictionary and query

Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.
language specific

unusual specific tokens:email address, URL, IP
Items such as the date of an email, which have a clear semantic
type, are often indexed separately as document metadata
hyphenations, splitting on white space
encourage users to enter hyphens wherever they may be
possible(depends on user' training)
new language, new issue
Chinese:word segmentation

dropping common terms: stop words

collect frequency, hand-filtered build stop list
modern IR system abandoned, for its harm to phrase query

normalization

token normalization

the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens

equivalence classing

using mapping rules that remove characters like hyphens
maintain relations between unnormalized tokens.

expansion lists

query expansion lists
perform the expansion during index construction
Although more space for postings seemed to be less efficient, for modern IR system, the increased flexibility is more appealing.

common forms of normalization

Accents and diacritics.
Spanish
Non-ASCII text is hardly used on many computer systems, it might be best to equate all words to a form without diacritics.
Capitalization/case-folding
Do case-folding by reducing all letters to lower case.(companies, government organization, personal names)
Making every token lowercase is to just make some tokens lowercase.(if your users usually use lowercase regardless of the correct case of words?)
other issues in English:
different version of English.
other languages:
Japanese: multiple alphabets

stemming and lemmatization

For grammatical reasons, documents are going to use different forms of a
word, such as organize, organizes, and organizing.
stemming
a crude heuristic process that chops off the ends of words
lemmatization
doing things properly with the use of a vocabulary and morphological analysis of words.

2.faster posting list intersection via skip points

skip list: adding skip points

skip points

for a postings list of length P, use √P evenly spaced skip pointers.

intersect with skip list

the optimal encoding for an inverted index

Traditionally, CPUs were slow, and so highly compressed techniques were not optimal.
Now CPUs are fast and disk is slow, so reducing disk postings list size dominates.

3.positional postings and phrase queries

Many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes.

Biword indexes

longer phrase: break them down
NX*N: a extended biword indexes
i.e.the abolition of slavery
increase the space cost

Positional indexes

<doc1ID,frequency: <�position1, position2, . . . �>;
doc2ID,frequency: <�position1, position2, . . . �>;
...>

an example of positional indexes

k word proximity searches

Combination schemes

A combination strategy uses a phrase index, or just a biword index, for certain common queries(like Micheal Jackson) and uses a positional index for other phrase queries.

网友评论

本文标题：IR-chapter2:the term vocabulary

本文链接：https://www.haomeiwen.com/subject/mdhxzttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！