当前位置：文档之家› python爬虫入门——邓旭东

python爬虫入门——邓旭东

37.7834
目录
一、引言

二、准备知识
• 爬虫工作原理 • HTML • Python基本知识
四、网页解析
• 如何解析网页 • BeautifulSoup • re库的使用
六、如何应对反爬
• 控制访问频率 • 伪装装成浏览器 • 使用代理IP
三、网页请求
• 找规律构建url • requests库

for循环

>>>for x in [‘1’, ’2’, ’3’]: >>> 1 print(x)

2
3
目录
一、引言二、准备知识
• 爬虫工作原理 • HTML • Python基本知识
四、网页解析
• 如何解析网页 • BeautifulSoup • re库的使用
六、如何应对反爬
url = ‘/u/1562c7f164’ r = requests.get(url)
伪装成浏览器的访问
Headers = {‘User-Agent’: ’Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36’} r = requests.get(url,headers = Headers)
• 爬虫工作原理 • HTML • Python基本知识
四、网页解析
• 如何解析网页 • BeautifulSoup • re库的使用
六、如何应对反爬
• 控制访问频率 • 伪装装成浏览器 • 使用代理IP
三、网页请求
• 找规律构建url • requests库
五、开始采集
• 条件、循环语句 • try。。。Except异常处理。 • 数据存储
‘The Dormouse‘s story’ ‘Once upon a time there were three little sisters; and their names were’ ‘<a class=“sister” href=“/elsie” id=“link1”></a>’ ‘<a class=“sister” href=“/lacie” id=“link2”>Lacie</a> and’ ‘<a class=“sister” href=“/tillie” id=“link3”>Tillie</a>’ ‘and they lived at the bottom of a well....’
一句话，只要浏览器有的，你都可以抓可以爬
引言爬虫好学吗？
简单的道理
>>>from math import pow
>>>YouJoinUs = {‘is’:True} >>>If YouJoinUs[‘is’]:
>>>
>>>
result = pow(1.01,365)
print(result)
五、开始采集
• 条件、循环语句 • try。。。Except异常处理。 • 数据存储
七、高级爬虫
• selenium+Firefox（36版） • 抓包应对动态网页
爬虫工作原理
• 蓝色线条：发起请求（request） • 红色线条：返回响应（response）
HTML标签

访问Python中文社区https:///zimei 返回HTML文件如下：
元组tuple
（1，2，3，4）（’1’, ‘2’, ‘3’, ‘4’） (‘a’, ’b’, ’c’, ’d’)
集合set

{‘a’, ’b’, ’c’} 集合是不重复的元素组成的一个基本数据类型。
字典dict

>>>Dict = {‘name’: ‘邓旭东’， ‘age’: 26, ‘gender’: ‘male’} 在大括号中，数据成对存储，冒号左边是键（key），冒号右边是值（value） >>>Dict[‘age’] 26
邓旭东
列表list

[1, 2, 3, 4, 5] [‘1’, ’2’, ’3’, ’4’, ’5’] [‘a’, ’b’, ’c’, ’d’]

[(1,2),(1,2)]
… 列表中的元素可以是字符串，数字，元组，字典，集合下面的写法是不对的[a, b, c] (除非a，b， c是变量)
格式化输出 bsObj对象的内容
tag对象
>>>bsObj.title ‘<title>The Dormouse‘s story</title>’
>>>bsObj.head
‘<head><title>The Dormouse‘s story</title></head>’
>>>bsObj.a
‘<a class="sister" href="/elsie" id="link1"></a>’ 注意：它查找的是在所有内容中的第一个符合要求的标签，如果要查询所有的标签，这种方法不奏效
七、高级爬虫
• selenium+Firefox（36版） • 抓包应对动态网页
如何解析网页
火狐Firebug/谷歌的开发者工具
1. 2.
BeaufifulSoup/re库
Python基本知识
BeautifulSoup
两种主要的对象: Tag、NavigableString
html = """ 4 <html><head><title>The Dormouse's story</title></head> 5 <body> 6 The Dormouse's story 7 Once upon a time there were three little sisters; and their names we 8 re 9 <a href="/elsie" class="sister" id="link1"></a>, 1 <a href="/lacie" class="sister" id="link2">Lacie</a> and 0 <a href="/tillie" class="sister" id="link3">Tillie</a>; 1 and they lived at the bottom of a well. 1 ... """ >>>bsObj = BeautifulSoup(html, “html.parser”)
• 控制访问频率 • 伪装装成浏览器 • 使用代理IP
三、网页请求
• 找规律构建url • requests库
五、开始采集
• 条件、循环语句 • try。。。Except异常处理。 • 数据存储
七、高级爬虫
• selenium+Firefox（36版） • 抓包应对动态网页
找规律构建url

requests常用方法
使用cookie访问
Cookie = {‘Cookie’: ’UM_distinctid=15ab64ecfd6592-0afad5b368bd691d3b6853-13c680-15ab64ecfd7b6; remember_user_token=W1sxMjEzMTM3XSwiJDJhJDEwJHhjYklYOGl2eTQ0Yi54W C5seVh2UWUiLCIxNDg5ODI2OTgwLjg4ODQyODciXQ%3D%3D--ac835770a030c0595b2993289e39c37d82ea27e2; CNZZDATA1258679142=559069578-1488626597https%253A%252F%%252F%7C1489923851’} r = requests.get(url, cookies=cookies)

url = Base_url.format(num=Num*20)
print(url)
‘https:///tag?start=0’ ‘https:///tag?start=20’ ‘https:///tag?start=40’

e商务文档

python爬虫入门——邓旭东

相关文档推荐：