This week I try to use python collect all VR related news in TechCrunch. There are so many news in TechCrunch I don’t want to click them one by one. So I use python open page by page and find the target, go inside and paste all images in a HTML file.
Here’s my code:
import time import requests from bs4 import BeautifulSoup from textblob import TextBlob def downloadImgs(url): html = requests.get(url).text soup = BeautifulSoup(html, 'html.parser') all_imgs = soup.select('.article-entry')[0].select('img'); for img in all_imgs: print img, "<br>" def search_all_content(url): html = requests.get(url).text soup = BeautifulSoup(html, 'html.parser') total_titles = soup.select('.post-title a') for title in total_titles: blob = TextBlob(title.text) if blob.words.count('vr')>0: print "<div>", title, "<br>" downloadImgs(title['href']) print "</div>" base_url = 'https://techcrunch.com/page/' print '''<!DOCTYPE html><html><head> <meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1"> <style> div{ padding: 30px; } a{ position: relative; font-size: 25px; font-weight: bold; color: black; text-decoration: none; padding-bottom: 5px; } a:hover{ text-decoration: underline; } </style> </head><body>''' for pagenumber in range(1,200): search_all_content(base_url + str(pagenumber)) time.sleep(0.5) print '</body></html>'
I save all the new’s title and images to a HTML file.