下面是webmagic官方的默认实现HttpClientDownloader中的下载方法。
@Override public Page download(Request request, Task task) { Site site = null; if (task != null) { site = task.getSite(); } SetacceptStatCode; String charset = null; Map headers = null; if (site != null) { acceptStatCode = site.getAcceptStatCode(); charset = site.getCharset(); headers = site.getHeaders(); } else { acceptStatCode = Sets.newHashSet(200); } logger.info("downloading page {}", request.getUrl()); CloseableHttpResponse httpResponse = null; int statusCode=0; try { HttpUriRequest httpUriRequest = getHttpUriRequest(request, site, headers); httpResponse = getHttpClient(site).execute(httpUriRequest); statusCode = httpResponse.getStatusLine().getStatusCode(); request.putExtra(Request.STATUS_CODE, statusCode); if (statusAccept(acceptStatCode, statusCode)) { Page page = handleResponse(request, charset, httpResponse, task); onSuccess(request); return page; } else { logger.warn("code error " + statusCode + "\t" + request.getUrl()); return null; } } catch (IOException e) { logger.warn("download page " + request.getUrl() + " error", e); if (site.getCycleRetryTimes() > 0) { return addToCycleRetry(request, site); } onError(request); return null; } finally { request.putExtra(Request.STATUS_CODE, statusCode); try { if (httpResponse != null) { //ensure the connection is released back to pool EntityUtils.consume(httpResponse.getEntity()); } } catch (IOException e) { logger.warn("close response fail", e); } } }
上面第一个标黄的方法,构造org.apache.http.client.methods.HttpUriRequest。这是一个挺重要的方法,这里面涉及到各种请求头文件之类的东西。
还有最重要的代理ip这里也是底层实现的地方。
protected HttpUriRequest getHttpUriRequest(Request request, Site site, Mapheaders) { RequestBuilder requestBuilder = selectRequestMethod(request).setUri(request.getUrl()); if (headers != null) { for (Map.Entry headerEntry : headers.entrySet()) { requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue()); } } RequestConfig.Builder requestConfigBuilder = RequestConfig.custom() .setConnectionRequestTimeout(site.getTimeOut()) .setSocketTimeout(site.getTimeOut()) .setConnectTimeout(site.getTimeOut()) .setCookieSpec(CookieSpecs.BEST_MATCH); if (site.getHttpProxyPool() != null && site.getHttpProxyPool().isEnable()) { HttpHost host = site.getHttpProxyFromPool(); requestConfigBuilder.setProxy(host); request.putExtra(Request.PROXY, host); }else if(site.getHttpProxy()!= null){ HttpHost host = site.getHttpProxy(); requestConfigBuilder.setProxy(host); request.putExtra(Request.PROXY, host); } requestBuilder.setConfig(requestConfigBuilder.build()); return requestBuilder.build(); }
下面进入download方法中标黄的第二个方法,这个方法返回一个org.apache.http.impl.client.CloseableHttpClient类型对象:
private CloseableHttpClient getHttpClient(Site site) { if (site == null) { return httpClientGenerator.getClient(null); } String domain = site.getDomain(); //MaphttpClients CloseableHttpClient httpClient = httpClients.get(domain); if (httpClient == null) { synchronized (this) { httpClient = httpClients.get(domain); if (httpClient == null) { httpClient = httpClientGenerator.getClient(site); httpClients.put(domain, httpClient); } } } return httpClient; }
进入download第三个标黄的方法,该方法返回一个us.codecraft.webmagic.Page对象,这个page对象是webmagic自己封装的对象:
protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException { String content = getContent(charset, httpResponse); Page page = new Page(); page.setRawText(content); page.setUrl(new PlainText(request.getUrl())); page.setRequest(request); page.setStatusCode(httpResponse.getStatusLine().getStatusCode()); return page; }