使用python爬虫把自己的CSDN文章爬取下来并保存到MD文件-白红宇的个人博客

发布日期：2021-05-28 17:13:02 浏览次数：28 分类：精选文章

本文共 2144 字，大约阅读时间需要 7 分钟。

导言

爬虫技术在网络数据采集中占据重要地位，但其使用必须谨慎。没有获得明确许可进行爬采之前，千万不能随意尝试，否则可能引发法律纠纷。

如何判断某个网站是否允许爬虫

要判断网站是否允许爬虫，可以在域名后面直接或通过/robots.txt文件进行验证。例如：

https://blog.csdn.net/robots.txt

禁止爬取的网站目录

经过分析可知，以下网站目录通常不允许爬取：

/images/

/content/

/ui/

/js/

包含问号?的URL地址

除上述目录外，大部分网站对其他地址通常是允许爬取的。因此若只需要爬取文章内容，问题通常不大。

无图无真相：先上图

文章标题：Linux通过chrony进行时间同步

通过爬取的文章内容为：

技术内容：详细介绍了chrony时间同步工具的安装、配置及使用方法。

实现示例：通过代码示例展示了如何在服务器中配置chrony并实现时间同步功能。

确保代码可读性

注意：以下代码块经过优化后清理，方便阅读和执行：

import requestsfrom os import pathimport refrom requests_html import HTMLimport jsonfrom bs4 import BeautifulSoupclass Crawler:    def __init__(self, user, headers):        self.user = user        self.url = f"https://blog.csdn.net/{user}?type=blog"        self.headers = headers    def write_md(self, file, txt):        try:            with open(file, 'w', encoding='utf-8') as f:                f.write(txt)            return f"{file} 写入: 成功"        except:            return f"{file} 写入: 失败"    def get_status(self):        response = requests.get(url=self.url, headers=self.headers)        if response.status_code == 200:            print("访问请求成功")            return True        else:            print("访问请求失败")            return False    def get_articles(self):        response = requests.get(url=self.url, headers=self.headers)        response.encoding = 'utf-8'        text = response.text        soup = BeautifulSoup(text, 'lxml')        articles = soup.find_all('h4')        article_urls = soup.find_all('a')        return articles, article_urls    def extract_article(self, title, url):        response = requests.get(url=url, headers=self.headers)        if response.status_code != 200:            print(f"请求失败：{title}")            return False, ""                response.encoding = response.apparent_encoding        text = response.text        soup = BeautifulSoup(text, 'lxml')        content_div = soup.find('div', "markdown_views prism-tomorrow-night")        content = ""        for i in content_div:            temp = str(i)            # 去除svg相关内容            temp = re.sub(r'.*svg.*', '', temp)            temp = re.sub(r'

总结

通过以上方法，可以有效判断网站是否允许爬虫，并合理进行内容采集。记住，爬虫技术的使用需要遵守相关规定，避免引发法律纠纷。

上一篇：LTP安装方法

下一篇：VSCODE使用技巧：解决每次推送代码都要输入用户名密码的繁琐过程(非ssh)

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

导言