中国股市比赌场还不如 -- 因为在中国股市,某些人可以看别人的底牌@吴敬琏

简述 Python 爬虫(一)

Python 2017-04-03 浏览量: 1670 字数统计: 740 最后更新: 2017-04-09 21:22

文章目录[显示]

requests

是一个模拟的http客户端
类似于shell里面的 curl 命令

requests 跟 urllib,urllib2 类似,那为什么很多人 用 requests 而不用urllib2呢?
对此官方文档是这样说明的:

python的标准库urllib2提供了大部分需要的HTTP功能,但是API太逆天了,一个简单的功能就需要一大堆代码。

所以这么简单的东西不用,难道要用难的嘛

先上一个来自官方的 ?????我看到有人这么讲,原谅我见识短浅好像并看不出来这是个鳖啊!!!!!

requests-sidebar.png

  • Chrome的检查

685e64331516419b480ecbd8d33dd75c.png

使用方法:

# -*- coding: UTF-8 -*-

import requests

response = requests.get('http://news.sina.com.cn/')
print response

# -*- coding: UTF-8 -*-

import requests

response = requests.get('http://news.sina.com.cn/')
# 获取文章的内容
print response.text

会发现是乱码

<!DOCTYPE html>
<!-- [ published at 2017-03-31 17:12:04 ] -->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<title>æ°é»ä¸­å¿é¦é¡µ_æ°æµªç½</title>

查看一下编码,这说明python在发送请求的时候把不知道是utf-8

print response.encoding

ISO-8859-1

设置一下编码

response.encoding = 'utf-8'
print response.encoding


utf-8
# -*- coding: UTF-8 -*-

import requests

response = requests.get('http://news.sina.com.cn/')
response.encoding = 'utf-8'
print response.text



<!DOCTYPE html>
<!-- [ published at 2017-03-31 17:15:19 ] -->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<title>新闻中心首页_新浪网</title>
<meta name="keywords" content="新闻,时事,时政,国际,国内,社会,法治,聚焦,评论,文化,教育,新视点,深度,网评,专题,环球,传播,论坛,图片,军事,焦点,排行,环保,校园,法治,奇闻,真情">
<meta name="description" content="新浪网新闻中心是新浪网最重要的频道之一,24小时滚动报道国内、国际及社会新闻。每日编发新闻数以万计。">

..........................................

来自官方的例子

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

再看一波非request例子

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2

gh_url = 'https://api.github.com'

req = urllib2.Request(gh_url)

password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, gh_url, 'user', 'pass')

auth_manager = urllib2.HTTPBasicAuthHandler(password_manager)
opener = urllib2.build_opener(auth_manager)

urllib2.install_opener(opener)

handler = urllib2.urlopen(req)

print handler.getcode()
print handler.headers.getheader('content-type')

# ------
# 200
# 'application/json'

没有对比就没有伤害!


Beautiful Soup

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器,实现惯用的文档导航、查找、修改文档的方式。Beautiful Soup会帮你节省数小时甚至数天的工作时间

安装

pip install beautifulsoup4

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml

8e5a71e4b171eaa4338145aaf442ea53.png

使用 BeautifulSoup
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,这些对象总体来说可以归纳为4种: Tag , NavigableString, BeautifulSoup, Comment

BeautifulSoup对象表示的是一个文档的全部内容.大部分时候,可以把它当作Tag 对象

Tag对象,其实就是HTML 中的标签

字符串常被包含在tag内.Beautiful Soup用NavigableString类来包装tag中的字符串

使用BeautifulSoup

#-*-coding: UTF-8 -*-

import requests
from bs4 import BeautifulSoup
response = requests.get('http://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
print soup.head.title    # <title>Example Domain</title>
print soup.head.title.text    # 'ExampleDomain'
# 获取标签名
print soup.head.title.name         # 'title'
print soup.html.body.div.h1.text     # 'ExampleDomain'
print soup.html.body.a    # More information...
print soup.html.body.a.attrs    # {'href': 'http://www.iana.org/domains/example'}
print soup.find_all('a')    #[<a href="http://www.iana.org/domains/example">More information...</a>]

print soup.html.body.a.text        # 'Moreinformation...'

具体测试:

# -*- coding: UTF-8 -*-

from bs4 import BeautifulSoup

html = '''<html>
            <body>
                <div>
                    <h1>Example Domain</h1>
                    <p>This domain is established to be used for illustrative examples in documents. You may use this
                    domain in examples without prior coordination or asking for permission.</p>
                    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
                </div>
            </body>
        </html>'''
soup = BeautifulSoup(html)
print soup.text

不加解释器类型会抛出异常
表示初始化时应该加上解析器类型

 UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

加上解释器的类型

soup = BeautifulSoup(html, 'html.parser')

运行结果

Example Domain
This domain is established to be used for illustrative examples in documents. You may use this
                    domain in examples without prior coordination or asking for permission.
More information...

取出html 特定标签里面的内容

# 获取特定标签里面的内容,返回一个list,可以通过下标索引访问
In [3]: p_data = soup.select('p')

In [4]: print p_data
[<p>This domain is established to be used for illustrative examples in documents
. You may use this\n                    domain in examples without prior coordin
ation or asking for permission.</p>, <p><a href="http://www.iana.org/domains/exa
mple">More information...</a></p>]

# 通过下标索引访问,获取标签里面的内容
In [5]: print (p_data[1].text)
More information...

In [6]: for p in p_data:
   ...:     print p
   ...:
<p>This domain is established to be used for illustrative examples in documents.You may use this domain in examples without prior coordination or asking for
permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>

找出所有a 标签的某一个属性,比如说 href


In [13]: links = soup.select('a')
# 这里的 links 是一个字典
In [15]: for link in links:
    ...:     print link['href']
    ...:
http://www.iana.org/domains/example

取出含有特定CSS属性的元素

  • id 前面需要加上# 才能存取里面的内容
In [17]: html = '''<html lang="en">
   ...:             <body>
   ...:                 <h1 id="title">Hello</h1>
   ...:                 <a href="#" class="link"> Link test 1 </a>
   ...:                 <a href="# link2" class="link"> Link test 2 </a>
   ...:             </body>
   ...:             </html>'''

In [18]: soup = BeautifulSoup(html, 'html.parser')
   ...:

In [19]: link = soup.select('#title')
   ...: print link
   ...:
[<h1 id="title">Hello</h1>]
  • class 前面需要加上. 才能存取里面的内容
In [20]: for i in soup.select('.link'):
    ...:     print i
    ...:
<a class="link" href="#"> Link test 1 </a>
<a class="link" href="# link2"> Link test 2 </a>

# -*- coding: UTF-8 -*-
# 获取大连房产资讯

from __future__ import print_function

import requests
from bs4 import BeautifulSoup


def get_detail_news(url):
    response = requests.get(url)
    response.encoding = 'utf-8'
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.select('.title')[0].text
    # 获取时间,但是时间是字符串类型,转换成时间类型
    time = soup.select('.origin span strong')[0].contents[0]
    source = soup.select('.linkRed02')[0].text
    print(title, time, source)

    # 获取文章内容
    # content = []
    # for item in soup.select('.article-body p'):
    #     content.append(item.text.strip())
    #
    # print(content)

    # 一行式改写上面的内容
    print('\n'.join([item.text.strip() for item in soup.select('.article-body p')]))

    # 获取评论(评论是在js添加进去的),unique_id就是通过文章进行的评论
    # comment = soup.select('.comment span')[0].text
    # res = requests.get('http://comment.leju.com/api/comment/getarchivecommentcount?key=b6fc20c083d6672f62ea61c5f32921b7&unique_id=6254167782330783780')
    # print(res.text)


get_detail_news('http://dl.leju.com/news/2017-04-02/13096254167782330783780.shtml')

小蜗牛 说:
Freedom is the source from which all meaning and all values spring .


文章版权归 原文作者所有丨本站默认采用CC-BY-NC-SA 4.0协议进行授权| 转载必须包含本声明,并以超链接形式注明原文作者和本文原始地址: https://www.tougetu.com/2017/04/python-spider-BeautifulSoup-1.html

2 条评论

  1. 小土豆

    哇!此文精品!
    看了下文档,text使用的是Mozilla的chardet来探测文本的编码,content则是直接当成二进制对象。奇怪的是好些网站都会探测失败(导致乱码),我用过chardet印象中chardet很准的啊。
    PS,我是来测试评论UA和评论邮件提醒的新格式的╮(╯_╰)╭赶脚我的UA会探测失败因为我用了Random User-Agent

  2. 萌萌哒学长

    好文章 ~

添加新评论

代码 Pastebin Gist 加粗 删除线 斜体 链接 签到