美文网首页
IR-chapter2:the term vocabulary

IR-chapter2:the term vocabulary

作者: woodsouthmmm | 来源:发表于2017-04-17 21:52 被阅读0次

0.document delineation and character sequence decoding

obtain the character sequence in a document

  • determine the character encoding
  • determine the document file
  • assumption: text is a linear sequence of characters.

choose a document unit

  • indexing granularity
    trade-off between precision(small - sentence) and recall(large - book).

1.determine the vocabulary of terms

tokenization

terminology
  • token
    an instance of a sequence of characters
  • type
    the class of all tokens containing the same character sequence.
  • term
    A term is a (perhaps normalized) type that is included in the IR system’s dictionary.

to sleep perchance to dream,

5 tokens,4 types (because there are two instances of to), 3 terms ( if to is omitted from the index as a stop word)

tricky cases
  • do the exact same tokenization for both the dictionary and query

Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.
language specific

  • unusual specific tokens:email address, URL, IP
    Items such as the date of an email, which have a clear semantic
    type, are often indexed separately as document metadata
  • hyphenations, splitting on white space
    encourage users to enter hyphens wherever they may be
    possible(depends on user' training)
  • new language, new issue
    Chinese:word segmentation

dropping common terms: stop words

  • collect frequency, hand-filtered build stop list
  • modern IR system abandoned, for its harm to phrase query

normalization

token normalization

the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens

equivalence classing
  • using mapping rules that remove characters like hyphens
  • maintain relations between unnormalized tokens.
expansion lists
  • query expansion lists
  • perform the expansion during index construction
    Although more space for postings seemed to be less efficient, for modern IR system, the increased flexibility is more appealing.
common forms of normalization
  • Accents and diacritics.
    Spanish
    Non-ASCII text is hardly used on many computer systems, it might be best to equate all words to a form without diacritics.
  • Capitalization/case-folding
    Do case-folding by reducing all letters to lower case.(companies, government organization, personal names)
    Making every token lowercase is to just make some tokens lowercase.(if your users usually use lowercase regardless of the correct case of words?)
  • other issues in English:
    different version of English.
  • other languages:
    Japanese: multiple alphabets

stemming and lemmatization

  • For grammatical reasons, documents are going to use different forms of a
    word, such as organize, organizes, and organizing.
  • stemming
    a crude heuristic process that chops off the ends of words
  • lemmatization
    doing things properly with the use of a vocabulary and morphological analysis of words.

2.faster posting list intersection via skip points

skip list: adding skip points

skip points

for a postings list of length P, use √P evenly spaced skip pointers.

intersect with skip list

the optimal encoding for an inverted index

  • Traditionally, CPUs were slow, and so highly compressed techniques were not optimal.
  • Now CPUs are fast and disk is slow, so reducing disk postings list size dominates.

3.positional postings and phrase queries

Many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes.

Biword indexes

  • longer phrase: break them down
  • NX*N: a extended biword indexes
    i.e.the abolition of slavery
  • increase the space cost

Positional indexes

<doc1ID,frequency: <�position1, position2, . . . �>;
doc2ID,frequency: <�position1, position2, . . . �>;
...>


an example of positional indexes k word proximity searches

Combination schemes

A combination strategy uses a phrase index, or just a biword index, for certain common queries(like Micheal Jackson) and uses a positional index for other phrase queries.

相关文章

网友评论

      本文标题:IR-chapter2:the term vocabulary

      本文链接:https://www.haomeiwen.com/subject/mdhxzttx.html