public class StandardSitemapResolver extends Object implements ISitemapResolver
Implementation of ISitemapResolver as per sitemap.xml standard
defined at
http://www.sitemaps.org/protocol.html.
Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).
The Sitemap specifications dictates that a sitemap.xml file defined
in a sub-directory applies only to URLs found in that sub-directory and
its children. This behavior is respected by default. Setting lenient
to true no longer honors this restriction.
Paths relative to URL roots can be specified and an attempt will be made
to load and parse any sitemap found at those locations for each root URLs
encountered (except for "start URLs" sitemaps, see below). Default paths
are /sitemap.xml and /sitemap_index.xml.
Setting null or an empty path array on
setSitemapPaths(String...) will prevent attempts to locate
sitemaps and only sitemaps found in robots.txt or defined as start
URLs will be considered.
Sitemaps can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()). Sitemaps defined
that way will be the only ones resolved for the root URL they represent
(sitemap paths or sitemaps defined in robots.txt won't apply).
Sitemaps are first stored in a local temporary file before
being parsed. The tempDir constructor argument is used as the
location where to store those files. When null, the system
temporary directory is used, as returned by
FileUtils.getTempDirectoryPath().
| Modifier and Type | Field and Description |
|---|---|
static String[] |
DEFAULT_SITEMAP_PATHS |
| Constructor and Description |
|---|
StandardSitemapResolver(File tempDir,
SitemapStore sitemapStore) |
| Modifier and Type | Method and Description |
|---|---|
boolean |
equals(Object other) |
long |
getFrom() |
String[] |
getSitemapLocations()
Deprecated.
Since 2.3.0, use
HttpCrawlerConfig.getStartSitemapURLs() |
String[] |
getSitemapPaths()
Gets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
File |
getTempDir()
Gets the directory where temporary sitemap files are written.
|
int |
hashCode() |
boolean |
isEscalateErrors() |
boolean |
isLenient() |
void |
resolveSitemaps(HttpClient httpClient,
String urlRoot,
String[] sitemapLocations,
SitemapURLAdder sitemapURLAdder,
boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.
|
void |
setEscalateErrors(boolean escalateErrors) |
void |
setFrom(long from) |
void |
setLenient(boolean lenient) |
void |
setSitemapLocations(String... sitemapLocations)
Deprecated.
Since 2.3.0, use
HttpCrawlerConfig.setStartSitemapURLs(String[]) |
void |
setSitemapPaths(String... sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
void |
setTempDir(File tempDir)
Sets the directory where temporary sitemap files are written.
|
void |
stop()
Stops any ongoing sitemap resolution.
|
String |
toString() |
public static final String[] DEFAULT_SITEMAP_PATHS
public StandardSitemapResolver(File tempDir, SitemapStore sitemapStore)
public String[] getSitemapPaths()
public void setSitemapPaths(String... sitemapPaths)
sitemapPaths - sitemap paths.public void resolveSitemaps(HttpClient httpClient, String urlRoot, String[] sitemapLocations, SitemapURLAdder sitemapURLAdder, boolean startURLs)
ISitemapResolverresolveSitemaps in interface ISitemapResolverhttpClient - the http client to use to stream Internet
files if neededurlRoot - the URL root for which to resolve the sitemapsitemapLocations - sitemap locations to resolvesitemapURLAdder - where to store retrieved site map URLsstartURLs - whether the sitemapLocations provided (if any) are
start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs())@Deprecated public String[] getSitemapLocations()
HttpCrawlerConfig.getStartSitemapURLs()public void setSitemapLocations(String... sitemapLocations)
HttpCrawlerConfig.setStartSitemapURLs(String[])sitemapLocations - sitemap locationspublic boolean isLenient()
public void setLenient(boolean lenient)
public long getFrom()
public void setFrom(long from)
public boolean isEscalateErrors()
public void setEscalateErrors(boolean escalateErrors)
public File getTempDir()
public void setTempDir(File tempDir)
tempDir - directorypublic void stop()
ISitemapResolverstop in interface ISitemapResolverCopyright © 2009–2020 Norconex Inc.. All rights reserved.