Web Scraping with R: How to Fill Missing Value (爬虫：如何处理缺失值)-白红宇的个人博客

Web Scraping with R: How to Fill Missing Value (爬虫：如何处理缺失值)

发布日期：2021-05-09 09:08:21 浏览次数：18 分类：博客文章

本文共 3776 字，大约阅读时间需要 12 分钟。

��

�� IMDb �� R ��

##### >> Preparation

�� Hadley Wickham �� rvest �� rvest ��

#install packageinstall.package('rvest')#loading librarylibrary('rvest')

##### >> Downloading and parsing HTML file

�� read_html() �� XML ��

url <- 'https://www.imdb.com/search/title?count=100&release_date=2018-01-01,2018-12-31&view=advanced'webpage <- read_html(url)

##### >> Extracting Nodes

��

Rank: ��

Title��

Runtime��

Genre��

Rating��

Metascore��

Description��

Votes��

Gross��

�� html_nodes() �� XML �� CSS ��

#Using CSS selectors to extract noderank_data_html <- html_nodes(webpage, '.text-primary')#Converting the node to textrank_data <- html_text(rank_data_html)#Converting text value to numeric valuerank_data <- as.numeric(rank_data)

�� HTML/CSS �� HTML/CSS��

�� Chrome �� F12 ��

��

��

�� CSS ��

�� Script ��

##### >> Handling Missing Values

�� Metascore

metascore_data_html <- html_nodes(webpage,'.metascore')metascore_data <- html_text(metascore_data_html)length(metascore_data)

�� NA ��

�� html_node �� html_nodes �� ?html_node ��

html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.

�� DOM�� DOM �� DOM��

��

metascore_data_html <- html_node(html_nodes(webpage, '.lister-item-content'), '.metascore')metascore_data <- html_text(metascore_data_html)length(metascore_data)

##### >> Making a Data Frame

�� data frame ��

movies <- data.frame(  rank = rank_data,  title = title_data,  description = description_data,  runtime = runtime_data,  genre = genre_data,  rating = rating_data,  metascorre = metascore_data,  votes = votes_data,  gross = gross_data)

##### >> Exporting CSV File

�� csv ��

write.csv(movies, file = file.choose(new = TRUE), row.names = FALSE)

��

##### >> Notes

rvest ��

html_tag(): ��DOM �� tag name

html_attr(): ��DOM ��

html_attrs(): ��DOM ��

guess_encoding() and repair_encoding()�� ~��

jump_to(), follow_link(), back(), forward(): ��

##### >> Sample Code

上一篇：Survival on the Titanic (泰坦尼克号生存预测)

下一篇：How to Compare Means (均值比较)

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章