用Python解析XML文件-白红宇的个人博客

用Python解析XML文件

发布日期：2021-05-08 04:51:57 浏览次数：20 分类：精选文章

本文共 4650 字，大约阅读时间需要 15 分钟。

本文翻译自：https://developer.yahoo.com/python/python-xml.html

使用Python解析XML文件

许多YDN APIs提供了JSON格式的数据输出，JSON对于对于大部分情况来说都是有用的数据结构。但是API没有提供JSON数据输出，那么可以使用Python的XML模块解析数据。

使用 minidom

Python有和两个模块可供用来解析XML，后者更适合处理Yahoo！的API

下面是用xml.dom.minidom解析YahooAPI的例子

import urllibfrom xml.dom import minidomWEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'def weather_for_zip(zip_code):    url = WEATHER_URL % zip_code    dom = minidom.parse(urllib.urlopen(url))

minidom.parse()处理一个类文件对象。这个对象我们使用urllib.urlopen()来返回。

forecasts = []    for node in dom.getElementsByTagNameNS(WEATHER_NS, 'forecast'):    forecasts.append({        'date': node.getAttribute('date'),        'low': node.getAttribute('low'),        'high': node.getAttribute('high'),        'condition': node.getAttribute('text')    })

getElementsByTagNameNS()是一个名字空间感知函数，它需要两个函数，第一个参数是名字空间的URL，第二个参数是标签的名字。我们可以使用getElementsByTagName('yweather:forecast')来代替，但是前者可以使程序更加健壮。

ycondition = dom.getElementsByTagNameNS(WEATHER_NS, 'condition')[0]    return {        'current_condition': ycondition.getAttribute('text'),        'current_temp': ycondition.getAttribute('temp'),        'forecasts': forecasts,        'title': dom.getElementsByTagName('title')[0].firstChild.data    }

最后一行的

firstChild.data作用是取得文档（document）第一个元素。

下面是完整代码：

import urllibfrom xml.dom import minidomWEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'def weather_for_zip(zip_code):    url = WEATHER_URL % zip_code    dom = minidom.parse(urllib.urlopen(url))    forecasts = []    for node in dom.getElementsByTagNameNS(WEATHER_NS, 'forecast'):        forecasts.append({            'date': node.getAttribute('date'),            'low': node.getAttribute('low'),            'high': node.getAttribute('high'),            'condition': node.getAttribute('text')        })    ycondition = dom.getElementsByTagNameNS(WEATHER_NS, 'condition')[0]    return {        'current_condition': ycondition.getAttribute('text'),        'current_temp': ycondition.getAttribute('temp'),        'forecasts': forecasts,        'title': dom.getElementsByTagName('title')[0].firstChild.data    }

下面做个试验，zip code ：66044

>>> from pprint import pprint>>> pprint(weather_for_zip(66044)){'current_condition': u'Fair', 'current_temp': u'85', 'forecasts': [{'condition': u'Mostly Sunny',                'date': u'20 Jul 2006',                'high': u'103',                'low': u'75'},               {'condition': u'Scattered Thunderstorms',                'date': u'21 Jul 2006',                'high': u'82',                'low': u'61'}], 'title': u'Yahoo! Weather - Lawrence, KS'}

使用 ElementTree

我们使用 ElementTree API在实现上面那个例子一次:

import urllibfrom elementtree.ElementTree import parseWEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'def weather_for_zip(zip_code):	url = WEATHER_URL % zip_code	rss = parse(urllib.urlopen(url)).getroot()

ElementTree的 parse（）方法处理一个类文件对象并返回一个ElementTree对象，这个对象对应整个XML文件。

getroot()这个方法取得元素树的根，在这个例子中就是<rss>元素。

forecasts = []        for element in rss.findall('channel/item/{%s}forecast' % WEATHER_NS):            forecasts.append({                'date': element.get('date'),                'low': element.get('low'),                'high': element.get('high'),                'condition': element.get('text')            })

这里用模式搜索查找

<yweather:forecast>元素，这个元素是<item>的子元素，<item>又是<channel>的子元素。这些元素可以使用get（）方法取得他们的属性。

下面来找<yweather:condition>元素：

ycondition = rss.find('channel/item/{%s}condition' % WEATHER_NS)    return {        'current_condition': ycondition.get('text'),        'current_temp': ycondition.get('temp'),        'forecasts': forecasts,        'title': rss.findtext('channel/title')    }

下面是完整的代码：

import urllibfrom elementtree.ElementTree import parseWEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'def weather_for_zip(zip_code):    url = WEATHER_URL % zip_code    rss = parse(urllib.urlopen(url)).getroot()    forecasts = []    for element in rss.findall('channel/item/{%s}forecast' % WEATHER_NS):        forecasts.append({            'date': element.get('date'),            'low': element.get('low'),            'high': element.get('high'),            'condition': element.get('text')        })    ycondition = rss.find('channel/item/{%s}condition' % WEATHER_NS)    return {        'current_condition': ycondition.get('text'),        'current_temp': ycondition.get('temp'),        'forecasts': forecasts,        'title': rss.findtext('channel/title')    }

上一篇：读书笔记（一）凡读必记

下一篇：3D采集设备（一）激光雷达认知

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

使用Python解析XML文件

使用 minidom

使用 ElementTree

发表评论

最新留言

关于作者

推荐文章