541387 Universal Website Data Scraper

กำลังดำเนินการ โพสต์แล้ว Dec 19, 2011 ชำระเงินเมื่อจัดส่ง
กำลังดำเนินการ ชำระเงินเมื่อจัดส่ง

I want a Google Chrome plugin that will let me scrape data from a user-visible website (that essentially is a front for a database of some sort).

If you think you can do this on a website, rather than in a Chrome extension, please let me know. You'll need to be sure that you use Javascript to load the frames (client side), in that case, so my server's IP does not get blocked for overusing a website (loading hundreds of pages in a very short amount of time).

I want to highlight text items and images on a page, highlight them on a second page, and then filter through all items on a list (and scrape the fields I chose from every single item on that list).

For example, I go to [url removed, login to view], and search for "plumbers" in "Pleasanton, CA".

I see this page come up:

[url removed, login to view]

Then, I click on the plugin, and click "Capture Data".

I am asked to highlight "first level page 2". I highlight the "2" at the bottom of the page, which (if clicked) would bring me to the next page of results.

I am asked to click on the first result in the list (and change pages), which in this example, would be "Dan the Handyman".

Once on the page, I want to be able to select sections of text and images, and create database columns based on them. For example, on the "Dan the Handyman" page, I would want to be able to click "Add Field", then select "PO Box 1651" and label it "Address Line 1".

I would do this for many items on the page, then I would click "Back to Page Level 1" on the plugin, and I would be brought back to the search results page.

I would then be asked to click on the second result in the list, which would bring me to that result page. In the example, that would be "Valley Plumbing Home Center Inc". On that page, I will be asked to click each database column created on the first page, and re-select those items from the current page.

I then click "Go!", and the plugin will load a new window and run through all result entries from all pages of the search results, and will gather all requested columns of information from each page.

I believe that this could be done using regular expressions and div selectors; however, it will vary from site to site.

I expect to be able to use this, at a minimum, on [url removed, login to view] (business search), [url removed, login to view] (people search), [url removed, login to view] (item search), [url removed, login to view] (item search), [url removed, login to view] (professor search), and [url removed, login to view] (people search).

The algorithm will likely need to be modified based on the formatting of the content on each site. My goal is to be able to slowly add scraping capability for any major web-visible database. Yes, that means that you'll have more work once the first version of this is complete.

Results, after scraping, should be exportable in CSV, SQL, or XLS format. An error log should also be presented to the user. If a piece of information is unavailable from one of the pages, just leave the field blank in the database.

The user interface should be very simple. I'll leave it up to you to determine what it will look like. Remember though: simplicity is key. This is supposed to be used by people who have no idea how to code anything (mainly researchers who need to scrape data from national databases).

If you are interested, please provide an example of some scraping work you have done before. I will not choose a bid that has not sent an example of previous work.

My budget is very limited up-front; however, as this product is further developed, I will be able to pay significant amounts. I have interest from a number of graduate and PhD level researchers, and all of them have research budgets in the $XX,XXX range. As soon as we have a working V1, I'll be able to get investment to continue development.

Thank you for looking - feel free to ask any questions you'd like.

คีย์ข้อมูล การบริหารฐานข้อมูล eBay HTML JavaScript Odd Jobs วิจัย SEO SQL Web Scraping การจัดการเว็บไซต์

หมายเลขโปรเจค: #2287328

เกี่ยวกับโปรเจกต์

โปรเจกต์ระยะไกล ใช้งาน %project.latestActivity_relativeTime|แทนที่%