Task is simple install/setup scrapy and respective required libs (scrapd etc) and set to crawl the internet (never stopping) cleaning the pages (removing code) storing data (text) in database rows (by day) scraped
on reboot clear backlog of scrape crons and set scrapy to begin scraping.
Start with top one hundred website list and then go from there.
Simple task, should be quick to set up.
Scrapy Process broken down in two parts.
One - Generic Crawl, Scrape, Clean, Store in database as contents (in their colums) in rows per day.
- New table per 24 hour.
- content stores time stamped.
Two - Strategic crawl for keywords (will need a script screated or something from github). crawl, scrape, clean, store in database as keyword in the table name and all content scraped and cleaned afterwords with the same keyword to store in the same table, again time stamped.
Keywords will come dynamically from our own tables (so these will need to feed in) keywords will have to constantly from entering the script run autonomously for-ever.