2

我正在使用 Python 2.6.6 运行 Cloudera Spark 1.5.0

我已经定义了 3 个这样的函数

def tf(tokens):
    """ Compute Term/Token Frequency
    Args:
    tokens (list of str): input list of tokens from tokenize
    Returns:
    dictionary: a dictionary of tokens to its TF values
    """
    tokenCounts = {} 
    for t in tokens: 
      tokenCounts[t] = tokenCounts.get(t, 0) + (1.0/len(tokens))
    return tokenCounts

def idfs(corpus):
    """ Compute IDF
    Args:
    corpus (RDD): input corpus
    Returns:
    RDD: a RDD of (token, IDF value)
    """
    uniqueTokens = corpus.flatMap(lambda x: list(set(x[1])))
    tokenCountPairTuple = uniqueTokens.map(lambda x: (x, 1))
    tokenSumPairTuple = tokenCountPairTuple.reduceByKey(lambda a,b: a+b)
    N = corpus.count()
    return (tokenSumPairTuple.map(lambda x: (x[0], float(N)/float(x[1]))))

def tfidf(tokens, idfs):
    """ Compute TF-IDF
    Args:
    tokens (list of str): input list of tokens from tokenize
    idfs (dictionary): record to IDF value
    Returns:
    dictionary: a dictionary of records to TF-IDF values
    """
    tfs = tf(tokens)
    tfIdfDict = {k: v*idfs[k] for k, v in tfs.items()}
    return tfIdfDict

从其他功能,我这样打电话

w1 = tfidf(tokenize(string1),idfsDictionary)

我收到这样的错误

tfIdfDict = {k: v*idfs[k] for k, v in tfs.items()}
                            ^
SyntaxError: invalid syntax

我的语法有什么问题。在databricks环境中运行良好

4

0 回答 0