当前位置：首页 > news >正文

教室在线设计网站建设有限公司网站

news 2025/11/28 23:27:22

教室在线设计网站,建设有限公司网站,城乡建设环保部网站,企业网站的价值开始用python处理中文时#xff0c;读取文件或消息#xff0c;http参数等等一运行#xff0c;发现乱码(字符串处理#xff0c;读写文件#xff0c;print) 然后#xff0c;大多数人的做法是#xff0c;调用encode/decode进行调试#xff0c;并没有明确思考为何出现乱码… 开始用python处理中文时读取文件或消息http参数等等一运行发现乱码(字符串处理读写文件print) 然后大多数人的做法是调用encode/decode进行调试并没有明确思考为何出现乱码所以调试时最常出现的错误错误1 Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: ascii codec cant decode byte 0xe6 in position 0: ordinal not in range(128)错误2 Traceback (most recent call last): File stdin, line 1, in module File /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py, line 16, in decodereturn codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: ascii codec cant encode characters in position 0-1: ordinal not in range(128)首先必须有大体概念了解下字符集字符编码 ASCII | Unicode | UTF-8 | 等等字符编码笔记ASCIIUnicode和UTF-8 淘宝搜索技术博客-中文编码杂谈 str 和 unicode str和unicode都是basestring的子类所以有判断是否是字符串的方法 def is_str(s):return isinstance(s, basestring)str和unicode 转换 decode 文档 encode 文档 str - decode(the_coding_of_str) - unicode unicode - encode(the_coding_you_want) - str区别 str是字节串由unicode经过编码(encode)后的字节组成的声明方式 s 中文 s u中文.encode(utf-8) type(中文) type str求长度(返回字节数) u中文.encode(utf-8) \xe4\xb8\xad\xe6\x96\x87len(u中文.encode(utf-8)) 6unicode才是真正意义上的字符串由字符组成声明方式 s u中文 s 中文.decode(utf-8) s unicode(中文, utf-8) type(u中文) type unicode求长度(返回字符数),在逻辑中真正想要用的 u中文 u\u4e2d\u6587len(u中文) 2结论搞明白要处理的是str还是unicode, 使用对的处理方法(str.decode/unicode.encode) 下面是判断是否为unicode/str的方法 isinstance(u中文, unicode) Trueisinstance(中文, unicode) False isinstance(中文, str) Trueisinstance(u中文, str) False简单原则不要对str使用encode不要对unicode使用decode (事实上可以对str进行encode的具体见最后为了保证简单不建议) 中文.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: ascii codec cant decode byte 0xe4 in position 0: ordinal not in range(128) u中文.decode(utf-8) Traceback (most recent call last): File stdin, line 1, in module File /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py, line 16, in decodereturn codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: ascii codec cant encode characters in position 0-1: ordinal not in range(128)不同编码转换,使用unicode作为中间编码 #s是code_A的str s.decode(code_A).encode(code_B)文件处理,IDE和控制台处理流程可以这么使用把python看做一个水池一个入口一个出口入口处全部转成unicode, 池里全部使用unicode处理出口处再转成目标编码(当然有例外处理逻辑中要用到具体编码的情况) 读文件外部输入编码decode转成unicode处理(内部编码统一unicode)encode转成需要的目标编码写到目标输出(文件或控制台)IDE和控制台报错原因是print时编码和IDE自身编码不一致导致输出时将编码转换成一致的就可以正常输出 print u中文.encode(gbk)print u中文.encode(utf-8) 中文建议规范编码统一编码防止由于某个环节产生的乱码环境编码IDE/文本编辑器, 文件编码数据库数据表编码保证代码源文件编码这个很重要 py文件默认编码是ASCII, 在源代码文件中如果用到非ASCII字符需要在文件头部进行编码声明文档不声明的话输入非ASCII会遇到的错误,必须放在文件第一行或第二行 File XXX.py, line 3 SyntaxError: Non-ASCII character \xd6 in file c.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details声明方法 # -*- coding: utf-8 -*- 或者 #codingutf-8若头部声明codingutf-8, a 中文其编码为utf-8 若头部声明codinggb2312, a 中文其编码为gbk so, 同一项目中所有源文件头部统一一个编码,并且声明的编码要和源文件保存的编码一致(编辑器相关) 在源代码用作处理的硬编码字符串统一用unicode 将其类型和源文件本身的编码隔离开, 独立无依赖方便流程中各个位置处理 if s u中文: #而不是 s 中文pass #注意这里 s到这里时确保转为unicode以上几步搞定后你只需要关注两个 unicode和你设定的编码(一般使用utf-8) 处理顺序 1. Decode early 2. Unicode everywhere 3. Encode later相关模块及一些方法获得和设置系统默认编码 import syssys.getdefaultencoding() ascii reload(sys) module sys (built-in)sys.setdefaultencoding(utf-8)sys.getdefaultencoding() utf-8str.encode(other_coding) 在python中直接将某种编码的str进行encode成另一种编码str #str_A为utf-8 str_A.encode(gbk)执行的操作是 str_A.decode(sys_codec).encode(gbk) 这里sys_codec即为上一步 sys.getdefaultencoding() 的编码获得和设置系统默认编码和这里的str.encode是相关的但我一般很少这么用主要是觉得复杂不可控,还是输入明确decode输出明确encode来得简单些(个人观点) chardet 文件编码检测下载 import chardetf open(test.txt,r)result chardet.detect(f.read())result {confidence: 0.99, encoding: utf-8}\u字符串转对应unicode字符串 u中 u\u4e2d s \u4e2dprint s.decode(unicode_escape) 中 a \\u4fee\\u6539\\u8282\\u70b9\\u72b6\\u6001\\u6210\\u529fa.decode(unicode_escape) u\u4fee\u6539\u8282\u70b9\u72b6\u6001\u6210\u529fpython unicode文档

查看全文

http://www.zqtcl.cn/news/878231/