Nutch plugin - selective crawling based on keywords

ปิด โพสต์แล้ว Feb 1, 2014 ชำระเงินเมื่อจัดส่ง
ปิด ชำระเงินเมื่อจัดส่ง

I need a NUTCH plugin that enables NUTCH (version 1.7) to crawl ONLY webpages that contains specific words that can be set/reset by the NUTCH administrator:

for example I want to crawl only webpages that contain a set of keywords like: ("job" & "apply") or ("job" & "submit") or webpages that contain

words like: ("education" & "requirements" & "experience" & "benefits").

The point of this is to achieve selective crawling (at source) and thus significantly reducing the amount of webpages that we have to crawl. The selection of the webpages by comparison to these words will be made at the source, before fetching the pages - as the goal of the plugin is to significantly decrease the amount of the webpages that will actually be crawled. (this is a MUST HAVE!!!)

The plugin should not slow down the NUTCH crawler, as speed is an important objective that has to be achieved.

The plugin should be instaled on my NUTCH crawler and the construction has to be stable on a large amount of data crawled daily

(more than 5000000 selected&crawled webpages daily).

Apache

หมายเลขโปรเจค: #5382795

เกี่ยวกับโปรเจกต์

3 ข้อเสนอ โปรเจกต์ระยะไกล ใช้งาน %project.latestActivity_relativeTime|แทนที่%

freelancer 3 คน กำลังเสนอราคาในงานนี้ โดยมีราคาเฉลี่ยอยู่ที่ $1162

PauloAngeloCOM

Hi, Can you send more information about the pre-requisites of your project? What is the database engine, for example? Regards.

$2777 USD ใน 30 วัน
(1 รีวิว)
1.3
solutnprovider

A proposal has not yet been provided

$155 USD ใน 3 วัน
(0 รีวิว)
0.0
LogicalError

Hi chrf2006 I have written a Customized parser plugin for Nutch-2.2.1 . I can write it for your 1.7 version as well. If you are interested please write back ( i can show you working of 2.2.1 parser plugin ) thanks

$555 USD ใน 10 วัน
(0 รีวิว)
0.0