1. 爬虫作用 用网络爬虫技术让重复性的手工流程实现自动化处理。2. 爬取准备 a. 检查robots.txt 在链接后加robots.txt查看是否有要求或限制 User-agent : 后表示禁止的用户代理 Crawl-delay : 后表示要求的爬取延迟 Sitemap : 后的链接提供网站地图文件 例:伯乐在线提供的网站地图 b. 估算网站大小 site: +网站链接或URL路径 (用goole吧) c. 识别网站所用技术 i. 在windows powershell 中输入pip查看是否已安装pip ii. 使用pip install builtwith安装 builtwith模块 iii. 使用该模块将URL作为参数,对该URL进行分析 >>> import builtwith >>> builtwith.parse('http://example.webscraping.com') {u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI'], u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'], u'programming-languages': [u'Python'], u'web-servers': [u'Nginx'] } >>> builtwith.parse('http://jianshu.com') {u'javascript-frameworks': [u'Prototype', u'RequireJS'], u'web-frameworks': [u'Twitter Bootstrap', u'Ruby on Rails'],u'Twprogramming-languages': [u'Ruby'], u'web-servers': [u'Tengine']} >>> builtwith.parse('http://chinadaily.com.cn') {u'javascript-frameworks': [u'jQuery'], u'web-servers': [u'Nginx']} >>> builtwith.parse('http://oschina.net') {u'javascript-frameworks': [u'jQuery', u'Vue.js'], u'web-servers': [u'Tengine']} d. 寻找网站所有者 i. 安装WHOIS协议封装库 pip install python-whois ii. 使用 >>>import whois >>> print whois.whois('jianshu.com') { "updated_date": [ "2016-04-06 00:00:00", "2016-04-06 10:24:47" ], "status": [ "clientTransferProhibited https://icann.org/epp#clientTransferProhibited", "clientTransferProhibited" ], "name": "Shanghai Bai Ji Information Technology Inc. Ltd,", "dnssec": "Unsigned", "city": "Shanghai", "expiration_date": [ "2020-03-20 00:00:00", "2020-03-20 18:28:58" ], "zipcode": "200433", "domain_name": "JIANSHU.COM", "country": "CN", "whois_server": "whois.name.com", "state": "Shanghai", "registrar": "Name.com, Inc.", "referral_url": "http://www.name.com", "address": "Innospace 2, B1, Building #5, KIC, No.316 Songhu Road , Yangpu District", "name_servers": [ "F1G1NS1.DNSPOD.NET", "F1G1NS2.DNSPOD.NET", "f1g1ns1.dnspod.net", "f1g1ns2.dnspod.net" ], "org": "Shanghai Bai Ji Information Technology Inc. Ltd,", "creation_date": [ "2008-03-20 00:00:00", "2008-03-20 18:28:58" ], "emails": [ "contact@jianshu.com", "abuse@name.com" ] }