요청한 url html에서 <loc> </loc> 사이의 값들이 얻어진다.
이를 가지고 다시 request를 요청하는 내용이다.
+ download() 를 보완
code:
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
import re
def download(url, user_agent='wswp', num_retries=2, charset='utf-8'):
print('Downloading:',url)
request = urllib.request.Request(url)
request.add_header('User-agent',user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e :
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
return download(url, num_retries-1)
return html
def crawl_sitemap(url):
sitemap = download(url)
links = re.findall('(.*?)', sitemap)
for link in links:
html = download(link)
crawl_sitemap('http://example.webscraping.com/sitemap.xml')
#download('http://example.webscraping.com/sitemap.xml')
'웹(web) > 크롤링(web scraping)' 카테고리의 다른 글
regex training site (0) | 2019.07.07 |
---|---|
crawl_site() with itertools (0) | 2019.07.06 |
download() with num_tries (0) | 2019.07.06 |
naver html 페이지 다운로드(download() simple ver) (0) | 2019.07.06 |
참고 pdf (0) | 2019.07.06 |