美文网首页
5 分类与标注词汇

5 分类与标注词汇

作者: shashaslife | 来源:发表于2017-11-27 17:01 被阅读0次

importos, re,nltk

fromnltk.corpusimportwords, state_union,brown,treebank

fromcollectionsimportdefaultdict

列表与元组

# words = ['I', 'turned', 'off', 'the', 'spectroroute','the']

# words2=('I', 'turned', 'off', 'the', 'spectroroute','the','I')

# print (set(words))

# #print(reversed(words))

# print(sorted(words))

# print (set(words2))

# print(reversed(words2))

# print(sorted(words2))

#NOUN 名词

# brown_news_tagged=brown.tagged_words(categories='news',tagset='universal')

# word_tag_pairs=nltk.bigrams(brown_news_tagged)

# noun_proceders = [a[1]for(a,b)in word_tag_pairs if b[1]=='NOUN']

# fdist=nltk.FreqDist(noun_proceders)

# common_proceders=[tag for (tag,value) in fdist.most_common()]

# print(common_proceders) 获取名词前置的高频词类

#Verb 动词

获得过去分词以及过去式词形相同的动词

# wsj=treebank.tagged_words()

# cfd1=nltk.ConditionalFreqDist(wsj)

# vl=[w for w in cfd1.conditions()if 'VBN' in cfd1[w] and 'VBD' in cfd1[w]]

# print(vl)

获取某过去分词词以及其tag的位置

# cfd2=nltk.ConditionalFreqDist((tag,word)for (word,tag)in wsj)

# vbn_list=list(cfd2['VBN'])

# idx1=wsj.index(('kicked','VBN'))

# print(idx1)

获取其前置词

# for v in vbn_list:

#    idx=wsj.index((v, 'VBN'))

#    print (wsj[idx-1:idx])

等同于:

#print([wsj[wsj.index((v, 'VBN'))-1:wsj.index((v, 'VBN'))] for v in vbn_list])

#Ajectives and Adverbs 形容词和副词

词典反置是常用方法

# def findtags(tag_prefix, tagges_text):

#    cfd=nltk.ConditionalFreqDist((tag,word) for (word,tag) in tagges_text

#                                  if tag.startswith(tag_prefix))

#    return dict((tag, cfd[tag].most_common(5) for tag in cfd.conditions()))

#exploring tagged  corpora 探索标注的数据库

# brwon_learnd_tagged=brown.tagged_words(categories='learned', tagset='universal')

# tags=[b[1]for(a,b)in nltk.bigrams(brwon_learnd_tagged)if a[0]=='often']

# #print(tags)

# fd=nltk.FreqDist(tags)

# print(fd.tabulate())

# brwon_learnd_tagged=brown.tagged_words(categories='news', tagset='universal')

# cfd=nltk.ConditionalFreqDist((word.lower(),tag)

#                            for (word,tag) in brwon_learnd_tagged)

# for word in sorted(cfd.conditions()):

#    if len(cfd[word])>3:

#        tags=[tag for (tag, _) in cfd[word].most_common()]

#        #print(cfd[word])

#        print(word, tags)

#dictionary 词典:默认词典

# news_words = brown.words(categories='news')

# fd=nltk.FreqDist(news_words)

# v1000=[word for (word, _) in fd.most_common(1000)]

# mapping=defaultdict(lambda: 'UNK')

# for word in v1000:

#    mapping[word]=word

# new_word=[mapping[word] for word in news_words]

# print(new_word[:20])

# incrementally updating a Dictionary 词典内容递增

# words = words.words('en')

# last_letters=defaultdict(list)

# for word in words:

#    key=word[-2:] 发现有该类键,就将其名称以及值添加到字典中

#    last_letters[key].append(word)

# print(last_letters['zy'][:10])

#

# anagrams=defaultdict(list) 找出有特定字母组成的所有的词

# for word in words:

#    key=''.join(sorted(word))

#    anagrams[key].append(word)

Nltk提供的简单方法

# anagrams=nltk.Index((''.join(sorted(w)),w)for w in words)

# print(anagrams['abc'])

#invert a dictionary 反置词典 便于查找

# pos={'cats':'N','name':'N','old':'ADJ','young':'ADJ','run':'V', 'sing':'V'}

# #pos2=dict((value,key)for (key,value)in pos.items())

# pos2=nltk.Index((value,key)for (key,value)in pos.items())

# print(pos2['N'])

#Automatic Tagging 自动标注: 用100个高频词汇的高频tag做tagger

#The Lookup Tagger 查找tagger

# brown_tagged_sents=brown.tagged_sents(categories='news')

# fd=nltk.FreqDist(brown.words(categories='news'))

# cfd=nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

# most_freq_words=fd.most_common(100)

# likely_tags=dict((word, cfd[word].max())for (word,_)in most_freq_words)

# baseline_tagger=nltk.UnigramTagger(model=likely_tags)

# print(cfd['news'].max())

# print(cfd['news'].tabulate())

# print(baseline_tagger.evaluate(brown_tagged_sents))

#N-Gram Tagging 多级标注

brown_tagged_sents=brown.tagged_sents(categories='news')

brown_sents=brown.sents(categories='news')

size=int(len(brown_tagged_sents)*0.9)

train_sents=brown_tagged_sents[:size]  将数据拆分

#print(train_sents[3])

test_sents=brown_tagged_sents[size:]

#

unigram_tagger=nltk.UnigramTagger(train_sents)

print(unigram_tagger.size())

#print(unigram_tagger.tag(brown_sents[3]))

#

# print(bigram_tagger.evaluate(test_sents))

#combination

# t0=nltk.DefaultTagger('NN')

# t1=nltk.UnigramTagger(train_sents, backoff=t0)

# t2=nltk.BigramTagger(train_sents, cutoff=2, backoff=t1)

#print(t2.evaluate(test_sents))

# test_tags = [tag for sent in brown.sents(categories='editorial')

#                  for (word, tag) in t2.tag(sent)]

# gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]

# print(nltk.ConfusionMatrix(gold_tags, test_tags))

# cfd=nltk.ConditionalFreqDist(

#                            ((x[1],y[0]),y[1])

#                            for sent in brown_tagged_sents

#                            for x,y in nltk.bigrams(sent))

#

# ambigous_context=[c for c in cfd.conditions() if len(cfd[c])>1]

# print(sum(cfd[c].N()for c in ambigous_context)/cfd.N())

相关文章

  • 5 分类与标注词汇

    importos, re,nltk fromnltk.corpusimportwords, state_union...

  • 标注组件-react版

    组件支持标注类型:1、图像 — 浏览、标注集合展示2、图像分类 — 支持对图像进行分类标注3、图像检测 — 支持对...

  • 2021-01-21

    1、定下初版的行业包分类1234label的标注文档,并抽取1000条数据安排标注与标注相关答疑 2、采用1000...

  • 万唯中考使用说明

    任务一、词汇分类攻关 第1页至第10内容 每天任务: 1页词汇分类速记+1页重点词汇突破 要求: 1.默写词汇分类...

  • 探索上下文语境

    前面标注器有基于默认标注器,正则表达式标注器以及查询标注器等等。在本章分类的基础上构建一个单词分类器。借助于上下文...

  • 生成模式和判别模式

    分类问题 在监督学习中,输出变量Y取有限个离散值时,预测问题变成了分类问题 包括学习和分类两个过程 标注问题 标注...

  • 哈利波特与魔法石 单词统计 第六章

    CHAPTER 6 本篇单词统计:所有词汇标注共336个,其中GRE词汇共93个,托福词汇共121个,英语专八词汇...

  • 哈利波特与火焰杯 单词统计 第二章

    CHAPTER 2 本篇单词统计:所有词汇标注共251个,其中GRE词汇共74个,托福词汇共108个,英语专八词汇...

  • CRF相关知识总结

    1. 序列标注和分类问题对比 我们知道序列标注问题有时候也被当做一个分类问题去对待,那么为什么传统分类算法较少用...

  • 回归 分类 标注概念

    输入变量与输出变量均为连续变量的预测问题 称为:回归问题; 输出变量为有限个离散变量的预测问题,称为:分类问题; ...

网友评论

      本文标题:5 分类与标注词汇

      本文链接:https://www.haomeiwen.com/subject/coudbxtx.html