使用scrapy爬取百度贴吧的信息

一、使用scrapy爬取百度贴吧的信息
1、 xpath中使用标签的属性和属性的值获取标签元素的内容
2、 使用scrapy工程中的items选项关联抓取的内容(将字段与抓取的信息进行关联)
3、 将爬取的信息写入以json格式写入到文件中,并读取其中的信息。将信息转为字典集合进行呈现

二、实践步骤
1、创建工程
Scrapy startproject prjbaidusprider

2、进入prjbaidusprider目录,创建爬虫
Scrapy genspider baiduspider baidu.com

3、修改items.py文件,将抓取的信息与字段进行关联

Define here the models for your scraped items

See documentation in:

https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class PrjbaiduspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
reply = scrapy.Field()

pass

4、 修改baidutiebasipder.py文件,获取信息给字段赋值

import scrapy
from prjbaiduspider.items import PrjbaiduspiderItem

class BaidutiebaspiderSpider(scrapy.Spider):
name = 'baidutiebaspider'
allowed_domains = ['baidu.com']
start_urls = ['https://tieba.baidu.com/f?kw=%E9%87%91%E9%B1%BC&ie=utf-8']
#url="https://tieba.baidu.com/f?kw=%E9%87%91%E9%B1%BC&ie=utf-8"
def parse(self, response):
results=response.xpath("//li[@class=' j_thread_list clearfix']")#根据标签元素的属性获取li的结果集
#print(results)
for rst in results:
#创建items实体对象
item=PrjbaiduspiderItem()
#根据标签属性获取标题,其中contains属性表示div容器中包含的属性信息
title=rst.xpath(".//div[contains(@class,'threadlist_title pull_left j_th_tit ')]/a/text()").extract()
#print(title)
#获取作者的信息
author=rst.xpath(".//div[contains(@class,'threadlist_author pull_right')]//span[contains(@class,'frs-author-name-wrap')]/a/text()").extract()
#print(author)
#获取回复数量
reply=rst.xpath(".//div[contains(@class,'col2_left j_threadlist_li_left')]/span/text()").extract();
#print(reply)

        #给item字段赋值(将抓取的信息传递给字段熟悉)
        item['title']=title
        item['author']=author
        item['reply']=reply

        yield item

5、 编写运行文件,并设置要写入的文件

from scrapy import cmdline

运行爬虫并将爬取的结果写入到item.json文件中

cmdline.execute('scrapy crawl baidutiebaspider -o item.json'.split())

Item.json文件如下:
[
{"title": ["\u517b\u9c7c\u4e0d\u7528\u5438\u4fbf\u6d17\u68c9\u624d\u662f\u771f\u7684\u723d\uff0c\u5bb6\u91cc\u7684\u73bb\u7483\u7f38\u4e0a\u6ee4\u8fd8\u662f\u8981\u6d17\uff0c\u4e00\u5929\u5582"], "author": ["xlc5686514"], "reply": ["70"]},
{"title": ["\u60f3\u95ee\u4e0b\uff0c\u5927\u5bb6\u7528\u54ea\u79cd\u6f5c\u6c34\u6cf5\uff0c\u6c42\u4e00\u6b3e\u58f0\u97f3\u5c0f\u7684\u6f5c\u6c34\u6cf5"], "author": ["x\u67d0\u67d0\u7537"], "reply": ["21"]},
{"title": ["\u4e00\u6761\u5341\u5757\uff0c\u5356\u5bb6\u548c\u6211\u8bf4\u597d\u82d7\u5b50\uff0c\u6ca1\u60f3\u5230\u8fd9\u4e48\u5783\u573e"], "author": ["\u50b2\u5a07\u5c0f"], "reply": ["7"]},
......
]

6、 读取json文件信息,并转为字典集合信息

import json
import sys

with open('item.json') as f:
rownum = 0
new_list = json.load(f)
for i in new_list:
rownum += 1
print("""line{}: title:{}, author:{}, reply:{}.""".format(rownum,
i['title'][0],
i['author'][0],
i['reply'][0]))

呈现效果如下:
line37: title:发个关于加热器的请教贴, author:夏夜追凉, reply:36.
line38: title:金鱼缸不要放造浪泵, author:仙音环绕, reply:12.
line39: title:新下缸频繁擦缸、抖鳍。求各位大佬支招, author:原来10292, reply:3.
line40: title:帮我看看哪个灯好,吉印的有点贵暂时不考虑,他们区别大吗, author:鬼谷, reply:5.
line41: title:菌腮治了7天 死了3天 还剩2条 不知道好没好 状态一般, author:httpwoa, reply:8.

.......

练习:抓取贴吧或者天涯网站的栏目信息,入示例所示要求,并按照分页进行提取.

标签

评论

© 2021 成都云创动力科技有限公司 蜀ICP备20006351号-1