拿本论坛作为演示目标网站,实属演示,如有问题,告知删除
本文章旨在交流python下BS4的强大,对于整个网页页面的理解,就是作为一个对象,比如<a href='...' /a>,<div.../div>.每个元素都可以精确定位。包括注释部分
代码少,且能精确获取网页内容,如果你还在用正则截取内容,你就OUT了
中文文档:http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
Beautiful Soup会帮你节省数小时甚至数天的工作时间.
截止我发表Beautiful Soup 4.2.0已经更新了,而且支持python 3.5 | | | | | import bs4, urllib.request, os | | | | def bathome(soup,idstring): | | homegrids = soup.find(id = idstring) | | a = homegrids.find_all('a') | | | | count = 0 | | for i in a: | | count += 1 | | if count % 2 == 0: | | print('%-15s文章:%3s' % (str, i.string)) | | else: | | str = '用户:%s' % i.string | | | | url = 'http://www.bathome.net/' | | web = urllib.request.urlopen(url) | | soup = bs4.BeautifulSoup(web,'html.parser') | | print("最新主题:") | | bathome(soup, 'homegrids_c_1') | | print("最新回复:") | | bathome(soup, 'homegrids_c_2') | | print("热门主题:") | | bathome(soup, 'homegrids_c_3')COPY |
|