博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
webmagic 下载页面
阅读量:6552 次
发布时间:2019-06-24

本文共 4699 字,大约阅读时间需要 15 分钟。

下面是webmagic官方的默认实现HttpClientDownloader中的下载方法。

@Override    public Page download(Request request, Task task) {        Site site = null;        if (task != null) {            site = task.getSite();        }        Set
acceptStatCode; String charset = null; Map
headers = null; if (site != null) { acceptStatCode = site.getAcceptStatCode(); charset = site.getCharset(); headers = site.getHeaders(); } else { acceptStatCode = Sets.newHashSet(200); } logger.info("downloading page {}", request.getUrl()); CloseableHttpResponse httpResponse = null; int statusCode=0; try { HttpUriRequest httpUriRequest = getHttpUriRequest(request, site, headers); httpResponse = getHttpClient(site).execute(httpUriRequest); statusCode = httpResponse.getStatusLine().getStatusCode(); request.putExtra(Request.STATUS_CODE, statusCode); if (statusAccept(acceptStatCode, statusCode)) { Page page = handleResponse(request, charset, httpResponse, task); onSuccess(request); return page; } else { logger.warn("code error " + statusCode + "\t" + request.getUrl()); return null; } } catch (IOException e) { logger.warn("download page " + request.getUrl() + " error", e); if (site.getCycleRetryTimes() > 0) { return addToCycleRetry(request, site); } onError(request); return null; } finally { request.putExtra(Request.STATUS_CODE, statusCode); try { if (httpResponse != null) { //ensure the connection is released back to pool EntityUtils.consume(httpResponse.getEntity()); } } catch (IOException e) { logger.warn("close response fail", e); } } }

上面第一个标黄的方法,构造org.apache.http.client.methods.HttpUriRequest。这是一个挺重要的方法,这里面涉及到各种请求头文件之类的东西。

还有最重要的代理ip这里也是底层实现的地方。

protected HttpUriRequest getHttpUriRequest(Request request, Site site, Map
headers) { RequestBuilder requestBuilder = selectRequestMethod(request).setUri(request.getUrl()); if (headers != null) { for (Map.Entry
headerEntry : headers.entrySet()) { requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue()); } } RequestConfig.Builder requestConfigBuilder = RequestConfig.custom() .setConnectionRequestTimeout(site.getTimeOut()) .setSocketTimeout(site.getTimeOut()) .setConnectTimeout(site.getTimeOut()) .setCookieSpec(CookieSpecs.BEST_MATCH); if (site.getHttpProxyPool() != null && site.getHttpProxyPool().isEnable()) { HttpHost host = site.getHttpProxyFromPool(); requestConfigBuilder.setProxy(host); request.putExtra(Request.PROXY, host); }else if(site.getHttpProxy()!= null){ HttpHost host = site.getHttpProxy(); requestConfigBuilder.setProxy(host); request.putExtra(Request.PROXY, host); } requestBuilder.setConfig(requestConfigBuilder.build()); return requestBuilder.build(); }

下面进入download方法中标黄的第二个方法,这个方法返回一个org.apache.http.impl.client.CloseableHttpClient类型对象:

private CloseableHttpClient getHttpClient(Site site) {        if (site == null) {            return httpClientGenerator.getClient(null);        }        String domain = site.getDomain();      //Map
httpClients CloseableHttpClient httpClient = httpClients.get(domain); if (httpClient == null) { synchronized (this) { httpClient = httpClients.get(domain); if (httpClient == null) { httpClient = httpClientGenerator.getClient(site); httpClients.put(domain, httpClient); } } } return httpClient; }

进入download第三个标黄的方法,该方法返回一个us.codecraft.webmagic.Page对象,这个page对象是webmagic自己封装的对象:

protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {        String content = getContent(charset, httpResponse);        Page page = new Page();        page.setRawText(content);        page.setUrl(new PlainText(request.getUrl()));        page.setRequest(request);        page.setStatusCode(httpResponse.getStatusLine().getStatusCode());        return page;    }

 

转载于:https://www.cnblogs.com/guazi/p/6676260.html

你可能感兴趣的文章