哦,不!你不小心把一个长篇文章中的空格、标点都删掉了,并且大写也弄成了小写。像句子"I reset the computer. It still didn’t boot!"已经变成了"iresetthecomputeritstilldidntboot"。在处理标点符号和大小写之前,你得先把它断成词语。当然了,你有一本厚厚的词典dictionary,不过,有些词没在词典里。假设文章用sentence表示,设计一个算法,把文章断开,要求未识别的字符最少,返回未识别的字符数。
注意:本题相对原题稍作改动,只需返回未识别的字符数。
示例:
输入:
dictionary = ["looked","just","like","her","brother"]
sentence = "jesslookedjustliketimherbrother"
输出: 7
解释: 断句后为"jess looked just like tim her brother",共7个未识别字符。
提示:
- 0 <= len(sentence) <= 1000
- dictionary中总字符数不超过 150000。
- 你可以认为dictionary和sentence中只包含小写字母。
解题思路
方法一:直接动态规划
数组dp[i]的元素代表:第i个字符之前的未识别字符有几个。利用字符串切片来找是否与dictionary中的词匹配,匹配时状态转移方程为:
之所以会取min,是因为存在重叠的可能,譬如:
dictionary = ['leo','apple']
sentence = 'appleo'
'''
这种情况下,最少的未匹配方案就是'apple','o'
'''
最外层遍历sentence,在每个位置对dictionary中的单词再遍历,找切片字符串是否匹配,提前先给dictionary单词按照长度进行升序排序,不然会出错。
class Solution:
def respace(self, dictionary: List[str], sentence: str) -> int:
if not sentence:
return 0
if not dictionary:
return len(sentence)
dictionary.sort(key = lambda x: len(x))
dp = [0] * (len(sentence)+1) # 多一个,参看dp的定义
for i in range(1,len(sentence)+1):
dp[i] = dp[i-1] + 1
for word in dictionary:
if len(word) > i:
break
if sentence[i-len(word):i] == word:
dp[i] = min(dp[i], dp[i-len(word)])
return dp[-1]
方法二:Trie树+DP
参照官方题解,对dictionary中的单词倒叙插入到树中,对于结束词打上word_end标记。
倒叙插入的Trie树会好好利用单词的相同后缀。如:

代码:
# 根据给定的单词表构建嵌套的字典树
class Trie:
def __init__(self):
self.root = {}
self.word_end = -1
def insert(self, word):
curNode = self.root
# 将单词逆序构建
for c in word[::-1]:
if c not in curNode:
curNode[c] = {}
curNode = curNode[c]
curNode[self.word_end] = True
class Solution:
def respace(self, dictionary: List[str], sentence: str) -> int:
# 从给定的单词表中构建一颗逆序字典树
tire = Trie()
for word in dictionary:
tire.insert(word)
n = len(sentence)
dp = [0] * (n + 1)
for i in range(1, n + 1):
dp[i] = dp[i-1] + 1
# Trie树查找
curNode = tire.root
for j in range(i, 0, -1):
c = sentence[j - 1]
if c not in curNode:
break
elif tire.word_end in curNode[c]:
dp[i] = min(dp[i], dp[j - 1])
curNode = curNode[c]
return dp[-1]
网友评论