I am looking for some web-scraping code written in Python (Windows 10 PC) to pull results from Google Scholar, scrape data from these results and put the data into a .csv file which is readable by Excel. This data should also be written to a local database running on Windows 10 (MySQL or MongoDB)
I need the following:
1. Simple GUI which allows for a "topic" to be entered into which is to be used to scrape Google Scholar Case Law as well as the start and end year. GUI should have a "submit" button.
2. Ability to automatically (internally) create the correct search URL based upon Federal or State case, the topic keywords and the first year. It will then "step" through each year one by one until it hits the end year.
The URL's always follow a standard format so this should not be difficult to implement.
For example, if the topic is "Trademark" and the year range is 2012 - 2015, the program create a URL for Trademark search from year 2012-2012 (single year sub-range is more manageable data-set) and do the below steps. Once this is done, it steps to Trademark from year 2013-2013 and does the below steps. It steps up by 1 year at a time until it hits 2015-2015. This is to prevent far too many search results from showing up at once and making it unmanageable.
In the above example, there would be 4 separate URL created (one for each date range) that are run separately.
3. Navigate to each URL (no visual display required for this)
4. Grab the entire list of sub-URL from the search results and navigate to each one individually and scrape and save the following data in its own CSV file (auto-name). Each sub-url would then have its own filename and data.
5. Ability to turn off saving .CSV files (this is mainly needed for testing/debugging to make it easier to see that program is working properly)
SAVE:
a) Exact URL of the case
b) Header info - contains name of case, court name, district and year
c) *** if the url contains the phrase "NOT TO BE PUBLISHED IN THE OFFICIAL REPORTS", this must be reflected in the CSV naming convention by adding _NOT at the end of the filename.
d) Every sub-URL will have a section called "DISCUSSION." Inside this section we need to search for and save the following:
Sub-titles inside DISCUSSION section (write sub-title to file and save text associated with sub-title)
aa) Sentences within sub-title section with citations afterwards (text inside parenthesis)- save preceding sentence and citation
bb) Any text inside sub-title section which is inside double quotes, including citation afterwards - save entire text inside double quotes and citation.
Continue the above until end of URL, then get next URL, complete and repeat. Then step forward 1 year (if date range allows) and repeat again until all results are parsed.
I have attached a scraped html image of a Google Scholar article for reference.
Hi sir,
I am scraping expert, I have did more than 350+ scraping project, please check my feedback then you will know.
Can we discuss more details about this project? then I will provide example data/script for you.
Thanks,
Kimi
I wrote a google scholar bot a more than year before using C#. It should be easy for me to rewrote that code in python
Relevant Skills and Experience
Already wrote google scholar bot
Proposed Milestones
$250 USD - Milestone
Hello sir! I've just seen your job offer and because of my skills in web design/development I'm pretty sure that I'm able to make it done with the best quality and price!
Relevant Skills and Experience
Hello sir! I've just seen your job offer and because of my skills in web design/development I'm pretty sure that I'm able to make it done with the best quality and price!
Proposed Milestones
$111 USD - default
I am highly interested to work in your project. I have excellent experience in web scraping, research, data mining, extracting email address and other related contact information of any business
Relevant Skills and Experience
Data scraping
Proposed Milestones
$138 USD - Google Scholar data scrape