Create a multi threaded web tool to scrape via proxies + save visual/text output
$750-1500 USD
ปิดแล้ว
โพสต์ ประมาณ 4 ปีที่ผ่านมา
$750-1500 USD
ชำระเงินเมื่อส่งงาน
We want to bulk check thousands of domain names per day in Wayback Machine to look for irregularities and spam factors. ([login to view URL])
** P.S - we've included a very detailed request of what we want to do, but the project is relatively simple scraping and output. So don't be put off :) **
We want to create a web script that can connect and scrape the content + take screenshots of the pages in bulk, via hundreds of proxies.
On average, we will want to check 10,000 domains per day with an average of 20-50 captures per domain. Wayback Machine has a max requests limit of 10-15 per minute, per IP.
The strategy behind this project is to check for spam/adult content on the websites across the history of the domain.
We will have a list of blacklist keywords we will use to mark the domains as good/bad. If a blacklisted word appears on the page in wayback history, we will have the tool mark the domain as bad.
This tool will be used to remove the bulk of the spam and cut down 10,000 domains to 500-1000 that we can manually analyze.
We will want to extract the text output of each capture page so we can check for blacklisted keywords, and also a screenshot of the page once it’s fully loaded.
Wayback is quite slow and takes a while to load, so we’ll need to factor that in as well. We will ensure we have fast dedicated proxies.
Layout Interface
We will want to utilize a web based interface where we can have each project and individual domains along with their associated data.
We can use a web template like this: [login to view URL]
We currently have a tool that does a % of what we want to build here, but relies on third party solutions. This is the current layout > [login to view URL]
As you can see, we can create a project, and paste the domains in that we want to analyze, then we click inside a project and load each domain individually.
We will want to show each capture as a screenshot like this > [login to view URL]
But with each screenshot, we want to also include the following.
A link direct to the page URL on wayback machine (as pictured)
A list of what blacklisted keywords have appeared on the page.
Essentially we want to be able to see a screenshot of the entire page, and then a list of keywords that appear, so we can click to see a larger image of the screenshot and check that the blacklisted keywords were not by mistake.
This process will not be bulletproof and the tool will show false positives, which is to be expected. As an example we will want to filter any adult content, and will have a list of adult keywords in the blacklist, but for example a political or legal blog will mention certain words that will trigger a red flag. We will be able to manually vet that the domain is fine and not spammy though.
Currently we use [login to view URL] which gives us screenshots, but does not deliver the full length website screenshot, which is what we need. So you will need to locate and implement a full page length screenshot software.
It will be fine to show a square screenshot of the top of the page in the interface, to make it all look uniform, but then if you click the screenshot, it will load a full screen scrollable version. We can discuss specifics of this later to come.
In the settings page we will want to be able to load in proxies, as well as an updateable blacklist keyword list.
So in summary, it’s a fairly simple process.
Import a list of domains
Process them all via wayback machine, scraping 1 capture per month per domain
Pull the plain text (not HTML) + a screenshot of each capture
Output the screenshot + link to direct page + list which blacklisted keywords show up
We need this to be reliable, strong tech that works great as a multi threaded process at speed.
We will provide whatever server/tech/hardware is necessary, speed is the utmost highest factor here.
Hi there,
I am scraping expert, I have did more than 500+ scraping project, please check my feedback then you will know.
Can we discuss more details about this project? then I will provide example data/script for you.
Thanks,
Lin
Hi there,I am Web Scraping expert from Bosnia & Herzegovina,Europe.
I have carefully gone through with your requirements and I would like to help you with this project ! I can start immediately and finish it within the agreed deadline.
Check out my profile and former clients feedback - that'll let you know everything about me.
Please feel free to initiate the chat so that we can discuss further details.
Thank you for taking the time to read my proposal.I am looking forward to hearing from you.
Best regards,
Miljan
I am a professional web data scraper specialized using Python program, PHP script, .Net program, Crawler and Bot.
My tool can search any domain, data and get information from Aa to Zz with an existing lists of english words.
Have over 10 years of experience in data mining/ Web scrapping/ Scraping Bots/ Chrome/Opera Extensions I have done it all. Tell us your source and we will put it in excel for you, Or we can even give you filtered results as per your requirement, In the format you want. You can also ask for data into a particular format - Excel, Json, Mysql, Databases, XMLs, you name them. Further Can help you with integrating it with ur databases, Can create json outputs. We are not only good with scraping but also with the tools that u may need after that. We can help you build you softwares round the data
we have 99% Data Accuracy.
We have Duplicate finder. etc.,
We can help with Statistics on the data
We can help with creating Api's front the data
We can create Softwares to manage that data
We can build Sites round the data
Hi,
I am an expert in web scraping.
I can create a multi threaded web tool to scrape via proxies + save visual/text output through my experience of the past.
I worked on projects like your project.
I will finish more quickly than others. i will take less than 4 days
I wait for your reply, I can start work now.
Hello,
I'm Senior Python Developer with strong expertise in web scraping and data mining.
I'm interested to discuss the project in detail.
Regards,
Alex.
Dear Sir, I am working currently on such a project, almost finished. Allow me to give you some ideas.
I am an electrical engineer with 20 years of experience in this area.
I am going through your requirements and will have some questions.
I have gone through your requirement to scrape lots of websites. I am EXPERT in building scraping tools /scripts. Hence, I can SURELY work on your project. I am having 4 YEARS of EXPERIENCE in developing PHP-PYTHON (Scrapy, Selenium) based web scraper as well as WINDOWS BASED web scraping software through which I have crawled many sites such as Craigslist, Amazon, Yelp and many others. I have also worked on complex site to bypass CAPTCHA with the use of PROXY IP bouncing techniques.. Let's work together :) Have a great day! I am glad to see your WORK HISTORY and positive reviews of other freelancers. I am really excited to work with you and would love to have a long-term business association for any of your data related tasks.
Hey there,
Hope you are doing fine. I am a full-stack web developer with more than 7 years of experience. I can create a multi-threaded web tool to scrape via proxies and will also save visual or text output. The script will work via hundreds of proxies. I have checked all your details can do this job perfectly for you but I need more details about it.
Let's discuss it over chat in more detail.
Waiting for your response.
HI,
I can build a web application based on PHP 7, which will do the whole task. But we have to discuss some details like proxy list, url list, keyword list, etc...
Hi there,
We reviewed your requirements and our Fullstack Developers Team with SKILLS - PHP , JAVASCRIPT , Python . We would like to work on web scrapping as per needs and willing to discuss website requirements in detail.
We are a reputed IT company with experienced professionals and have been web scrapping for websites from 7+ years.
Warm Regards,
Naveen
Hello
I have 5+ years experience in Scrapping with Python, and I am very interested in your project.
Always focused on producing well-structured easy-readable projects based on efficient and clean code
I can start right now and will complete the project perfectly as you wish by your deadline
I'd like to discuss further details via chat
Thanks!
Hi,
i'm an expert in highly responsive website with optimale web technologies, please check my feedback then you will know.
i will work this project with c #desktop application (not python) with buttons, progress bar, multithreading ,this is a demo version to show you the performance and speed of the bot.
I'm at your disposal for any further information.
Waiting for your response
Hi
I can develop it.
I have 10 years of web scraper developing experience.
I will develop it with python.
I am using proxy servers to avoid firewall.
I want to discuss with you via chatting in detail.
Regards
Valery
Dear Emp,
I thank you for providing detailed information for the requirement. I would like to make the web tool application to scrub and get details and filter as per requirement with fast multitasking and 100% accuracy ideally.
I thank you for giving me chance to bid on the project work.
With Regards,
Anuj
Hi,
I've read your job description very carefully.
It's my pleasure to associate with you on this project.
I have a rich experience with python scraping.
I will use proxy and multithread in python to get many screenshot of weyback concurrently.
if you support me reliable proxy, i can complete your job perfectly.
My scraping speed will not waste your gold time.
send me message kindly to discuss more detail.
Best Regards.
Rory F.
Hey, team!
I have built similar solutions leveraging AWS lambdas to reach "infinite" scaling - this will allow for the FASTEST scraping and allow for near-real-time analysis to be implemented.
Basically, each lambda is fired for 10 seconds and will launch a custom headless browser that will do the full-length screenshot capturing and text scraping. Screenshots will be saved in S3 (amazon files storage) and text can be saved in text files in S3 or sent to a full-text search engine.
I can also proxy those requests through a web proxy using http/s and basic authentication.
I can build the full deployment on AWS as well as the web interface that will execute all operations.
All you need is to set up an AWS account and provide proxies list. The operational cost will be minimal and strictly calculated so the cost for each scrape will be fixed.