Python爬虫利器之Beautiful Soup的全世界最强用法五百行文章！

发布日期：2021-05-07 13:03:49 浏览次数：11 分类：精选文章

本文共 3130 字，大约阅读时间需要 10 分钟。

Beautiful Soup入门及实用技巧

Beautiful Soup是Python中一个强大的网页解析库，广泛应用于网页爬虫及数据抓取任务中。本文将深入介绍其核心功能及实用技巧。

一、Beautiful Soup简介

Beautiful Soup（简称BS）是一款功能强大的网页解析框架，主要用于从HTML、XML等文档中提取结构化数据。它以其灵活的查询方式和丰富的API，成为Python爬虫领域的利器。

二、安装与环境配置

安装Beautiful Soup可以通过pip命令实现，推荐使用最新版本BS4.3.2。安装步骤如下：

```bash pip install bs4 ```

安装完成后，确保环境中已安装支持解析的库，如lxml或html5lib。如果安装过程中出现问题，可尝试使用离线包或源码安装。

三、Beautiful Soup快速上手

使用Beautiful Soup的第一步是创建一个BeautifulSoup对象。以下是一个简单的示例：

```python from bs4 import BeautifulSoup html = '

Hello World

' soup = BeautifulSoup(html, 'html.parser') ```

通过soup对象，我们可以轻松访问和操作文档中的各个节点。

四、Beautiful Soup的核心对象

Beautiful Soup将HTML文档转化为树形结构，每个节点属于以下四种类型：

Tag：表示HTML标签

NavigableString：表示可遍历的字符串（如文本节点）

BeautifulSoup：表示整个文档的根节点

Comment：表示HTML注释

1. Tag对象

Tag对象是Beautiful Soup中最核心的对象，用于表示HTML标签。例如：

```python tag = soup.find('p') print(tag.name) # 输出: 'p' print(tag.attrs) # 输出: {'class': 'content'} ```

通过tag.name获取标签名称，tag.attrs获取标签属性，可以通过get方法单独获取某个属性值。

2. NavigableString对象

NavigableString对象表示标签中的文本内容。可以通过.string属性获取标签内的文本：

```python string_text = tag.string print(string_text) # 输出: 'Hello World' ```

3. BeautifulSoup对象

BeautifulSoup对象表示整个文档的根节点，类似于一个特殊的Tag对象。可以通过soup.name获取其名称：

```python print(soup.name) # 输出: 'html' ```

4. Comment对象

Comment对象用于处理HTML注释。注意，Beautiful Soup会自动去除注释符号：

```python comment = soup.find('!--comment') print(comment.string) # 输出: '注释内容' print(type(comment)) # 输出:

```

五、文档树的遍历与操作

通过Beautiful Soup，我们可以方便地遍历和操作文档树，提取所需数据。以下是一些常用的操作：

1. 获取子节点

可以通过.contents属性获取当前标签的直接子节点：

```python children = tag.contents for child in children: print(child) # 输出:

Hello World

```

2. 获取所有子孙节点

通过.descendants属性，可以递归获取当前标签的所有子孙节点：

```python descendants = tag.descendants for descendant in descendants: print(descendant) # 输出:

Hello World

```

3. 获取节点内容

如果标签中包含多个文本节点，可以通过.strings属性获取所有文本内容：

```python strings = tag.strings for string in strings: print(string) # 输出: 'Hello World' ```

六、文档树的搜索

Beautiful Soup提供了强大的搜索功能，通过find_all方法可以快速定位到符合条件的节点。

1. 基于标签名搜索

可以通过指定标签名来筛选相应的节点：

```python p_tags = soup.find_all('p') for p_tag in p_tags: print(p_tag) # 输出:

Hello World

```

2. 基于属性筛选

可以通过指定标签属性来筛选节点。例如，查找带有'class'属性的p标签：

```python has_class = soup.find_all('p', class_='content') for has_class_tag in has_class: print(has_class_tag) # 输出:

Hello World

```

3. 结合正则表达式搜索

通过传入正则表达式，可以对标签内容进行匹配：

```python import re math_tags = soup.find_all('p', re.compile(r'math.*')) for math_tag in math_tags: print(math_tag.string) # 输出: 'math content' ```

4. 混合过滤条件

可以结合多个过滤条件，实现更复杂的搜索逻辑：

```python specific_tags = soup.find_all(['p', 'span'], {'class': 'important'}) for specific_tag in specific_tags: print(specific_tag) # 输出:

Hello World

```

七、CSS选择器

Beautiful Soup支持CSS选择器语法，通过select方法快速定位到目标节点：

```python selected_elements = soup.select('p.classname #idname tagname') for selected in selected_elements: print(selected) # 输出:

```

8. 组合选择

可以通过空格分隔选择多个选择器：

```python mixed_selector = 'p.classname #idname ~ span.classname2' selected_mixed = soup.select(mixed_selector) for selected in selected_mixed: print(selected) # 输出:

```

结语

Beautiful Soup为Python爬虫提供了强大且灵活的功能，适用于从简单到复杂的网页解析任务。通过合理使用find_all、select等方法，可以高效提取所需数据。本文仅为Beautiful Soup入门介绍，更多实用技巧和案例建议进一步探索。

上一篇：Python字符串居然可以这样玩到底怎么做到的年薪50w程序员揭晓

下一篇：双宋喜结连理过程可以通过python爬虫Aispider 几行代码就能搞定！

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

Beautiful Soup入门及实用技巧

一、Beautiful Soup简介

二、安装与环境配置

三、Beautiful Soup快速上手

四、Beautiful Soup的核心对象

1. Tag对象

2. NavigableString对象

3. BeautifulSoup对象

4. Comment对象

五、文档树的遍历与操作

1. 获取子节点

2. 获取所有子孙节点

3. 获取节点内容

六、文档树的搜索

1. 基于标签名搜索

2. 基于属性筛选

3. 结合正则表达式搜索

4. 混合过滤条件

七、CSS选择器

8. 组合选择

结语

发表评论

最新留言

关于作者

推荐文章