Anemone
Syntax#
- Use Anemone::Core.new(url, options) to initialize the crawler
- Use on_every_page block to run code on every page visited
- Use .run method to start the crawl. No code beforehand will actually start any GET calls.
Parameters#
Parameter | Details |
---|---|
url | URL (including protocol to be crawled) |
options | optional hash, see all options here |
## Remarks# |
- The crawler will by only visit links that are on the same domain as the starting URL. This is important to know when dealing with content subdomains such as
media.domain.com
since they will be ignored when crawlingdomain.com
- The crawler is HTTP / HTTPS aware and will by default stay on the initial protocol and not visit other links on the same domain
- The
page
object in theon_every_page
block above has a.doc
method which returns the Nokogiri document for the HTML body of the page. This means you can use Nokogiri selectors inside theon_every_page
block such aspage.doc.css('div#id')
- Other information to start can be found here
Basic Site Crawl
pages = []
crawler = Anemone::Core.new(url, options)
crawler.on_every_page do |page|
results << page.url
end
crawler.run