Python + Selenium 爬取网易云课堂课时标题及时长

转载请注明出处：

文章目录

软件安装

selenium
pip install selenium

geckodriver

目标页面

一开始用常规方法请求下来，发现源码中根本找不到任何课时信息，说明该网页用 JavaScript 来动态加载内容。

使用开发者工具分析一下，发现浏览器请求了如下的地址获取课时详情信息：

在预览界面可以看到各课时信息的 Unicode 编码。

尝试直接请求上述地址，显然会报错，不想去研究请求头具体应该传哪些参数了，直接上 Selenium，反正就爬一个页面，对性能没什么要求。

代码

说明

study163seleniumff.py 是主运行文件

helper.py 是辅助模块，与主运行文件同目录

geckodriver.exe 需要放在 ../drivers/ 这个相对路径下

from selenium.webdriver import Firefoxfrom selenium.webdriver.firefox.options import Optionsfrom lxml import etreeimport csvfrom helper import Chapter, Lesson# 请求数据url = 'https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1'options = Options()options.add_argument('-headless')  # 无头参数driver = Firefox(    executable_path='../drivers/geckodriver',    firefox_options=options)driver.get(url)text = driver.page_sourcedriver.quit()# 解析数据html = etree.HTML(text)chapters = html.xpath('//div[@class="chapter"]')TABLEHEAD = ['章节号', '章节名', '课时号', '课时名', '课时长']rows = []for each in chapters:    chapter = Chapter(each)    lessons = chapter.get_lessons()    for each in lessons:        lesson = Lesson(each)        chapter_info = chapter.chapter_info        lesson_info = lesson.lesson_info        values = (*chapter_info, *lesson_info)        row = dict(zip(TABLEHEAD, values))        rows.append(row)# 存储数据with open('courseinfo.csv', 'w', encoding='utf-8-sig', newline='') as f:    writer = csv.DictWriter(f, TABLEHEAD)    writer.writeheader()    writer.writerows(rows)

class Chapter:    def __init__(self, chapter):        self.chapter = chapter        self._chapter_info = None    def parse_all(self):        # 章节号        chapter_num = self.chapter.xpath(            './/span[contains(@class, "chaptertitle")]/text()')[0]        # 去掉章节号最后的冒号        chapter_num = chapter_num[:-1]        # 章节名        chapter_name = self.chapter.xpath(            './/span[contains(@class, "chaptername")]/text()')[0]        return chapter_num, chapter_name    @property    def chapter_info(self):        self._chapter_info = self.parse_all()        return self._chapter_info        def get_lessons(self):        return self.chapter.xpath(            './/div[@data-lesson]')class Lesson:    def __init__(self, lesson):        self.lesson = lesson        self._lesson_info = None    @property    def lesson_info(self):        # 课时号        lesson_num = self.lesson.xpath(            './/span[contains(@class, "ks")]/text()')[0]        # 课时名        lesson_name = self.lesson.xpath(            './/span[@title]/@title')[0]        # 课时长        lesson_len = self.lesson.xpath(            './/span[contains(@class, "kstime")]/text()')[0]        self._lesson_info = lesson_num, lesson_name, lesson_len        return self._lesson_info

最终结果

最终结果保存为 courseinfo.csv，与主运行文件同路径。

完成于 2018.11.16

上一篇：用 Python 切换输入法

下一篇：使用 Python 第三方库 moviepy 剪辑视频

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

Python + Selenium 爬取网易云课堂课时标题及时长

文章目录

软件安装

目标页面

代码

说明

最终结果

发表评论

最新留言

关于作者

推荐文章