python - UnicodeEncodeError：'ascii' 编解码器无法在位置 39 编码字符 u'\xea'：序数不在范围内（128）

Question

我是 Python 新手，我已经尝试修复它两个小时了。

这是代码：

import praw
import json
import requests
import tweepy
import time

access_token = 'REDACTED'
access_token_secret = 'REDACTED'
consumer_key = 'REDACTED'
consumer_secret = 'REDACTED'

def strip_title(title):
    if len(title) < 94:
        return title
    else:
        return title[:93] + "..."

def tweet_creator(subreddit_info):
    post_dict = {}
    post_ids = []
    print "[bot] Getting posts from Reddit"
    for submission in subreddit_info.get_hot(limit=20):
        post_dict[strip_title(submission.title)] = submission.url
        post_ids.append(submission.id)
    print "[bot] Generating short link using goo.gl"
    mini_post_dict = {}
    for post in post_dict:
        post_title = post
        post_link = post_dict[post]         
        short_link = shorten(post_link)
        mini_post_dict[post_title] = short_link 
    return mini_post_dict, post_ids

def setup_connection_reddit(subreddit):
    print "[bot] setting up connection with Reddit"
    r = praw.Reddit('yasoob_python reddit twitter bot '
                'monitoring %s' %(subreddit)) 
    subreddit = r.get_subreddit(subreddit)
    return subreddit

def shorten(url):
    headers = {'content-type': 'application/json'}
    payload = {"longUrl": url}
    url = "https://www.googleapis.com/urlshortener/v1/url"
    r = requests.post(url, data=json.dumps(payload), headers=headers)
    link = json.loads(r.text)['id']
    return link

def duplicate_check(id):
    found = 0
    with open('posted_posts.txt', 'r') as file:
        for line in file:
            if id in line:
                found = 1
    return found

def add_id_to_file(id):
    with open('posted_posts.txt', 'a') as file:
        file.write(str(id) + "\n")

def main():
    subreddit = setup_connection_reddit(‘python’)
    post_dict, post_ids = tweet_creator(subreddit)
    tweeter(post_dict, post_ids)

def tweeter(post_dict, post_ids):
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    for post, post_id in zip(post_dict, post_ids):
        found = duplicate_check(post_id)
        if found == 0:
            print "[bot] Posting this link on twitter"
            print post+" "+post_dict[post]+" #python"
            api.update_status(post+" "+post_dict[post]+" #python")
            add_id_to_file(post_id)
            time.sleep(30)
        else:
            print "[bot] Already posted" 

if __name__ == '__main__':
    main()

追溯：

root@li732-134:~# python twitter.py
[bot] setting up connection with Reddit
[bot] Getting posts from Reddit
[bot] Generating short link using goo.gl
[bot] Already posted
[bot] Already posted
[bot] Already posted
[bot] Posting this link on twitter
Traceback (most recent call last):
File "twitter.py", line 82, in <module>
main()
File "twitter.py", line 64, in main
tweeter(post_dict, post_ids)
File "twitter.py", line 74, in tweeter
print post+" "+post_dict[post]+" #python"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 39:       
ordinal not in range(128)`

我真的不知道该怎么办。有人能指出我正确的方向吗？

编辑：添加代码和回溯。

score 1 · Accepted Answer

即使您调用decode()，您收到的字节也必须是预期的、正确编码的形式。

如果\xea在 UTF-8 字符串中遇到，它必须后跟两个字节，而不仅仅是任何字节，它们必须在有效范围内。否则，它不是有效的 UTF-8。

例如，这里有两个 Unicode 代码点。第一个代码点U+56只占用一个字节。下一个，U+a000需要三个字节，我们知道这是因为我们遇到\xea：

http://hexutf8.com/?q=0x560xea0x800x80

只需删除上面的最后一个连续字节，这将不再是有效的 UTF-8：

http://hexutf8.com/?q=0x560xea0x80

我没有看到您在哪里发布了您失败的值，但我会仔细检查并确保您实际上获得了有效的 UTF-8 数据。

score 0 · Accepted Answer

错误发生在这里：

print post+" "+post_dict[post]+" #python"

问题似乎是您在这一行中连接 ASCII 字符串和 Unicode 字符串。这在这里造成了问题。尝试仅连接 Unicode 字符串：

print post + u" " + post_dict[post] + u" #python"

如果您仍然遇到问题，请查看type(post)和type(post_dict[post])应该是 Unicode 字符串的输出。如果其中任何一个都不是，那么您需要使用正确的编码（很可能是 UTF-8）将它们转换为 Unicode 字符串。可以这样做：

post.decode('UTF-8')

或者：

post_dict[post].decode('UTF-8')

以上将在 Python 2 中将字符串转换为 Unicode 字符串。完成后，您可以安全地将 Unicode 字符串连接在一起。Python 2 中的关键是永远不要将常规字符串与 Unicode 字符串混合，因为这会导致问题。

python - UnicodeEncodeError：'ascii' 编解码器无法在位置 39 编码字符 u'\xea'：序数不在范围内（128）

2 回答 2

Related

Reference