爬取网页时调用tostring()中文乱码("&#数字;")解决方案-白红宇的个人博客

发布日期：2021-05-07 19:49:21 浏览次数：12 分类：技术文章

本文共 1113 字，大约阅读时间需要 3 分钟。

出现乱码的代码

import requestsimport refrom lxml import etreewith open('real_case.html', 'r', encoding='utf-8') as f:    c = f.read()tree = etree.HTML(c)table_element = tree.xpath("//div[@class='table-box'][1]/table/tbody/tr")#正则表达式过滤掉<>pattern1_attrib = re.compile(r"<.*?>")for row in table_element:    try:        td1 = row.xpath('td')[0]        #调用tostring()后出现乱码        s1 = etree.tostring(td1).decode('utf-8')        s1 = pattern1_attrib.sub('', s1)        print(s1)    except Exception as error:        pass

乱码：

修正过后的代码
引入HTML包，使用unescape()方法

import requestsimport refrom lxml import etree#引入HTML包import htmlwith open('real_case.html', 'r', encoding='utf-8') as f:    c = f.read()tree = etree.HTML(c)table_element = tree.xpath("//div[@class='table-box'][1]/table/tbody/tr")pattern1_attrib = re.compile(r"<.*?>")for row in table_element:    try:        td1 = row.xpath('td')[0]        s1 = etree.tostring(td1).decode('utf-8')        s1 = pattern1_attrib.sub('', s1)        # unescape() 此函数使用HTML5标准定义的规则将字符转换成对应的unicode字符。        s1 = html.unescape(s1)        print(s1)    except Exception as error:        pass

结果：

上一篇：HTTP基本原理

下一篇：Python爬虫学习：小实例

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章