public class StandardSitemapResolver extends Object implements ISitemapResolver
Implementation of ISitemapResolver
as per sitemap.xml standard
defined at
http://www.sitemaps.org/protocol.html.
Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).
The Sitemap specifications dictates that a sitemap.xml file defined
in a sub-directory applies only to URLs found in that sub-directory and
its children. This behavior is respected by default. Setting lenient
to true
no longer honors this restriction.
Paths relative to URL roots can be specified and an attempt will be made
to load and parse any sitemap found at those locations for each root URLs
encountered (except for "start URLs" sitemaps, see below). Default paths
are /sitemap.xml
and /sitemap_index.xml
.
Setting null
or an empty path array on
setSitemapPaths(String...)
will prevent attempts to locate
sitemaps and only sitemaps found in robots.txt or defined as start
URLs will be considered.
Sitemaps can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()
). Sitemaps defined
that way will be the only ones resolved for the root URL they represent
(sitemap paths or sitemaps defined in robots.txt won't apply).
Sitemaps are first stored in a local temporary file before
being parsed. The tempDir
constructor argument is used as the
location where to store those files. When null
, the system
temporary directory is used, as returned by
FileUtils.getTempDirectoryPath()
.
Modifier and Type | Field and Description |
---|---|
static String[] |
DEFAULT_SITEMAP_PATHS |
Constructor and Description |
---|
StandardSitemapResolver(File tempDir,
SitemapStore sitemapStore) |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
long |
getFrom() |
String[] |
getSitemapLocations()
Deprecated.
Since 2.3.0, use
HttpCrawlerConfig.getStartSitemapURLs() |
String[] |
getSitemapPaths()
Gets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
File |
getTempDir()
Gets the directory where temporary sitemap files are written.
|
int |
hashCode() |
boolean |
isEscalateErrors() |
boolean |
isLenient() |
void |
resolveSitemaps(HttpClient httpClient,
String urlRoot,
String[] sitemapLocations,
SitemapURLAdder sitemapURLAdder,
boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.
|
void |
setEscalateErrors(boolean escalateErrors) |
void |
setFrom(long from) |
void |
setLenient(boolean lenient) |
void |
setSitemapLocations(String... sitemapLocations)
Deprecated.
Since 2.3.0, use
HttpCrawlerConfig.setStartSitemapURLs(String[]) |
void |
setSitemapPaths(String... sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
void |
setTempDir(File tempDir)
Sets the directory where temporary sitemap files are written.
|
void |
stop()
Stops any ongoing sitemap resolution.
|
String |
toString() |
public static final String[] DEFAULT_SITEMAP_PATHS
public StandardSitemapResolver(File tempDir, SitemapStore sitemapStore)
public String[] getSitemapPaths()
public void setSitemapPaths(String... sitemapPaths)
sitemapPaths
- sitemap paths.public void resolveSitemaps(HttpClient httpClient, String urlRoot, String[] sitemapLocations, SitemapURLAdder sitemapURLAdder, boolean startURLs)
ISitemapResolver
resolveSitemaps
in interface ISitemapResolver
httpClient
- the http client to use to stream Internet
files if neededurlRoot
- the URL root for which to resolve the sitemapsitemapLocations
- sitemap locations to resolvesitemapURLAdder
- where to store retrieved site map URLsstartURLs
- whether the sitemapLocations provided (if any) are
start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs()
)@Deprecated public String[] getSitemapLocations()
HttpCrawlerConfig.getStartSitemapURLs()
@Deprecated public void setSitemapLocations(String... sitemapLocations)
HttpCrawlerConfig.setStartSitemapURLs(String[])
sitemapLocations
- sitemap locationspublic boolean isLenient()
public void setLenient(boolean lenient)
public long getFrom()
public void setFrom(long from)
public boolean isEscalateErrors()
public void setEscalateErrors(boolean escalateErrors)
public File getTempDir()
public void setTempDir(File tempDir)
tempDir
- directorypublic void stop()
ISitemapResolver
stop
in interface ISitemapResolver
Copyright © 2009–2020 Norconex Inc.. All rights reserved.