Python爬取某旅游网站中的中国城市信息-白红宇的个人博客

发布日期：2021-05-10 03:34:01 浏览次数：26 分类：原创文章

本文共 2155 字，大约阅读时间需要 7 分钟。

分析

这是目标
可以发现它是通过点击下一页来翻页的，所以可以大概率判断它每一页的链接是有规律的，我们找出它的前两页的链接：

https://place.qyer.com/china/citylist-0-0-1/https://place.qyer.com/china/citylist-0-0-2/

可以发现的确是有规律的，再找一个稍微后一点的页面看看：

https://place.qyer.com/china/citylist-0-0-169/

这下确定无疑了，可以看到，它有171个页面，链接中的数字也是从1开始一直到171，所以可以用一个for循环来提取每一页的内容。
接下来就是分析如何提取一个页中的内容了，我个人最拿手的是xpath，有些人使用的是BeautifulSoup也行。
可以在Chrome的开发者工具中明显看到每一个城市对应一个li标签，所以我先将所有的li标签提取出来，提取结果是一个列表，列表中的每一个对象也是Selector对象，也就是说列表中的每一个li标签还可以使用xpath方法提取该节点中的内容。
接下来就是写好要提取的内容对应的xpath语句了，可以使用Full copy Xpath或在xpath helper插件中自己写。

代码编写

下面是程序的完整代码：

import requests  # the library to initiate a requestfrom fake_useragent import UserAgent  # the library to make the request headerimport parsel  # the library to parse HTMLimport csv  # the library to writer csv filedef getdata(url):    headers = {           "user-Agent": UserAgent().chrome    }    response = requests.get(url=url, headers=headers)    response.encoding = response.apparent_encoding    selector = parsel.Selector(response.text)    # extract all li tags    lis = selector.xpath('//ul[@class="plcCitylist"]/li')    for li in lis:        city_names = li.xpath('./h3/a/text()').get()        city_names = city_names.rstrip()        number_people = li.xpath('./p[2]/text()').get()        place_hot = li.xpath('./p[@class="pois"]/a/text()').getall()        place_hot = [place.strip() for place in place_hot]        place_hot = '、'.join(place_hot)        place_url = li.xpath('./p[@class="pics"]/a/@href').get()        img_url = li.xpath('./p[@class="pics"]/a/img/@src').get()        print(city_names, number_people, place_url, img_url, place_hot, sep='|')        with open('qiongyouDate.csv', mode='a', encoding='utf-8', newline='') as file_object:            csv_write = csv.writer(file_object)            csv_write.writerow([city_names, number_people, place_url, img_url, place_hot])def main():    for i in range(1, 172):        url = "https://place.qyer.com/china/citylist-0-0-{}/".format(str(i))        getdata(url)if __name__ == '__main__':    main()