1

我有一个简单的 Python 脚本,可以从 reddit 中提取帖子并在 Twitter 上发布。不幸的是,今晚它开始出现问题,我假设是因为 reddit 上某人的标题存在格式问题。我收到的错误是:

  File "redditbot.py", line 82, in <module>
  main()
 File "redditbot.py", line 64, in main
 tweeter(post_dict, post_ids)
 File "redditbot.py", line 74, in tweeter
 print post+" "+post_dict[post]+" #python"
 UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in  position 34: ordinal not in range(128)

这是我的脚本:

# encoding=utf8
import praw
import json
import requests
import tweepy
import time
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf8')

access_token = 'hidden'
access_token_secret = 'hidden'
consumer_key = 'hidden'
consumer_secret = 'hidden'


def strip_title(title):
    if len(title) < 75:
    return title
else:
    return title[:74] + "..."

def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=2000):
    post_dict[strip_title(submission.title)] = submission.url
    post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
    post_title = post
    post_link = post_dict[post]

    mini_post_dict[post_title] = post_link
return mini_post_dict, post_ids

def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('PythonReddit PyReTw'
            'monitoring %s' %(subreddit))
subreddit = r.get_subreddit('python')
return subreddit



def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
    for line in file:
        if id in line:
            found = 1
return found

def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
    file.write(str(id) + "\n")

def main():
subreddit = setup_connection_reddit('python')
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)

def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
    found = duplicate_check(post_id)
    if found == 0:
        print "[bot] Posting this link on twitter"
        print post+" "+post_dict[post]+" #python"
        api.update_status(post+" "+post_dict[post]+" #python")
        add_id_to_file(post_id)
        time.sleep(3000)
    else:
        print "[bot] Already posted"

if __name__ == '__main__':
main()

任何帮助将不胜感激 - 在此先感谢!

4

3 回答 3

4

考虑这个简单的程序:

print(u'\u201c' + "python")

如果您尝试打印到终端(使用适当的字符编码),您会得到

“python

但是,如果您尝试将输出重定向到文件,您会得到一个UnicodeEncodeError.

script.py > /tmp/out
Traceback (most recent call last):
  File "/home/unutbu/pybin/script.py", line 4, in <module>
    print(u'\u201c' + "python")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

当您打印到终端时,Python 使用终端的字符编码来编码 unicode。(终端只能打印字节,因此必须对 unicode 进行编码才能打印。)

当您将输出重定向到文件时,Python 无法确定字符编码,因为文件没有声明的编码。ascii因此默认情况下,Python2在写入文件之前使用编码隐式编码所有 unicode 。由于u'\u201c'无法进行 ascii 编码,因此UnicodeEncodeError. (只有前 127 个 unicode 码点可以用 ascii 编码)。

这个问题在Why Print Fails wiki中有详细解释。


要解决此问题,首先,避免添加 unicode 和字节字符串。这会导致在 Python2 中使用 ascii 编解码器进行隐式转换,并在 Python3 中出现异常。为了使您的代码适应未来,最好是明确的。例如,post在格式化和打印字节之前显式编码:

post = post.encode('utf-8')
print('{} {} #python'.format(post, post_dict[post]))
于 2016-01-17T11:14:08.437 回答
2
于 2016-01-17T11:00:48.410 回答
1

问题可能是由于在连接时混合了字节串和 unicode 字符串。作为所有字符串文字前缀的替代方法u,也许

from __future__ import unicode_literals

为你解决问题。请参阅此处以获得更深入的解释并决定它是否适合您。

于 2016-01-17T10:58:43.453 回答