The standard WDS crawler supports filesystem, sharepoint and databases.
Since one of the few document types that WDS can ingest is HTML, it would make sense to have the standard crawler also support HTTP and HTTPS accessible websites.
On my recent project the customer had documents in filesystem and sharepoint and on their intranet and internet websites that they wanted to use via WDS. To support this we had to write a custom crawler in Nutch, which was a lot of effort for what seems a common thing to want to do.
Why is it useful?
|Who would benefit from this IDEA?||As a customer I want to use WDS to query documents and HTML pages off my intranet and internet sites|
How should it work?