爬取网易科技滚动新闻
发布日期:2021-06-29 18:16:16 浏览次数:2 分类:技术文章

本文共 3019 字,大约阅读时间需要 10 分钟。

背景需求

完成作业的同时练习爬虫,利用Xpath匹配出需要爬取的内容;

需要爬取的新闻界面

需要爬取的信息

实现代码

#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time    : 2019/3/13 13:08# @Author  : cunyu# @Site    : cunyu1943.github.io# @File    : NetaseNewsSpider.py# @Software: PyCharmimport requestsfrom lxml import etreeimport xlwtheaders = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"}# 根据url获取刚网页中的新闻详情页的网址列表def getNewsDetailUrlList(url): """ :param url: 每页的URL :return newDetailList:每页包含的新闻详情URL """ response = requests.get(url, headers=headers) html = response.content.decode('gbk') selector = etree.HTML(html) newsDetailList = selector.xpath('//ul[@id="news-flow-content"]//li//div[@class="titleBar clearfix"]//h3//a/@href') return newsDetailList# 获取新闻标题def getNewsTitle(detailUrl): """ :param detailUrl:新闻详情url :return newsTitle:新闻标题 """ response = requests.get(detailUrl, headers=headers) html = response.content.decode('gbk') selector = etree.HTML(html) newsTitle = selector.xpath('//div[@class="post_content_main"]//h1/text()') return newsTitle# 获取新闻详情内容def getNewsContent(detailUrl): """ :param detailUrl: 新闻详情url :return newsContent: 新闻内容详情 """ response = requests.get(detailUrl, headers=headers) html = response.content.decode('gbk') selector = etree.HTML(html) newsContent = selector.xpath('//div[@class="post_text"]//p/text()') return newsContent# 将新闻标题和内容写入文件 TODO# 获取翻页网址列表def getUrlList(baseUrl, num): """ :param baseUrl:基础网址 :param num: 翻到第几页 :return urlList: 翻页网址列表 """ urlList = [] urlList.append(baseUrl) for i in range(2, num+1): urlList.append(baseUrl + "_" + str(i).zfill(2)) return urlListif __name__ == '__main__': baseUrl = "http://tech.163.com/special/gd2016" num = int(input('输入你要爬取的页数: ')) urlList = getUrlList(baseUrl, num) print(urlList) detailUrl = [] for url in urlList: for i in getNewsDetailUrlList(url): detailUrl.append(i) print(detailUrl) print(getNewsTitle(detailUrl[0])) print(getNewsContent(detailUrl[0])) # # 将爬取的文本存入文本文件 # with open('news.txt', 'w', encoding='utf-8') as f, open('newsTitle.txt', 'w', encoding='utf-8') as titleFile,\ # open('newsContent.txt', 'w', encoding='utf-8') as contentFile: # print('正在爬取中。。。') # for i in detailUrl: # f.write(''.join(getNewsTitle(i))) # f.write('\n') # f.write(''.join(getNewsContent(i))) # f.write('\n') # # titleFile.write(''.join(getNewsTitle(i))) # titleFile.write('\n') # # contentFile.write(''.join(getNewsContent(i))) # contentFile.write('\n') # # print('文件写入成功') # 将爬取得文本存入excel文件 # 创建一个Excel文件 workbook = xlwt.Workbook(encoding='utf-8') news_sheet = workbook.add_sheet('news') news_sheet.write(0, 0, 'Title') news_sheet.write(0, 1, 'Content') print('正在爬取中。。。') for i in range(len(detailUrl)): # print(detailUrl[i]) news_sheet.write(i + 1, 0, getNewsTitle(detailUrl[i])) news_sheet.write(i + 1, 1, getNewsContent(detailUrl[i])) # 将写入操作保存到指定Excel文件中 workbook.save('网易新闻.xls') print('文件写入成功')

结果

  • 代码运行结果

  • 保存的文件

总结

总体来说比较简单,代码也存在需要改进的地方,后续会改进更新,有其他想法的也可以相互交流!

转载地址:https://cunyu1943.blog.csdn.net/article/details/88534928 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!

上一篇:Python 去除文本文件中的空行
下一篇:Python 实现 PTA 平台 基础编程题目集

发表评论

最新留言

网站不错 人气很旺了 加油
[***.192.178.218]2024年04月03日 12时26分09秒

关于作者

    喝酒易醉,品茶养心,人生如梦,品茶悟道,何以解忧?唯有杜康!
-- 愿君每日到此一游!

推荐文章

第9章 BP神经网络(算法原理+手编实现+库调用) 2019-04-30
Ubuntu Add the installation prefix of “Qt5“ to CMAKE_PREFIX_PATH or set “Qt5_DIR“ 2019-04-30
MySQL报错ERROR 1265 (01000): Data truncated for column 'v_sate' at row 1 2019-04-30
运行testrpc报错/usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:74110 let results = {} 2019-04-30
error: No rule to make target 'images/myword.qrc', needed by 'debug/qrc_myword.cpp'. Stop. 2019-04-30
Eclipse操作Hadoop时报错The type java.lang.Exception cannot be resolved. It is indirectly referenced from 2019-04-30
ubunu qt5 移植依赖问题:This application failed to start because it could not find or load the Qt platform 2019-04-30
cp: cannot stat '/usr/local/bin/node': Too many levels of symbolic links 2019-04-30
Jenkins 连接github 2019-04-30
android中是Aspect 进行埋点笔记 2019-04-30
android中 java编写的test 类 mock kotlin 对象问题笔记 2019-04-30
leetcode 1864. Minimum Number of Swaps to Make the Binary String Alternating 2019-04-30
牛客网 找到搜索二叉树中两个错误的节点 2019-04-30
牛客网. 未排序正数数组中累加和为给定值的最长子数组的长度 2019-04-30
牛客题霸. 未排序数组中累加和为给定值的最长子数组长度 2019-04-30
牛客网. 未排序数组中累加和为给定值的最长子数组系列问题补2 2019-04-30
ffwd:delegation is much faster than you think 阅读笔记_GeorgeLuo 2019-04-30
"E: Sub-process /usr/bin/dpkg returned an error code (1) "solution 2019-04-30
Patch-Based Optimization for Image-Based Texture Mapping(SIGGRAPH 17)翻译 2019-04-30
读Exploring Randomly Wired Neural Networks for Image Recognition 2019-04-30