Commercial Real Estate Web Scraping

Team 27

CS426 (Senior Projects)
University of Nevada, Reno
Department of Computer Science and Engineering
Spring 2018

Instructors:
Dervin Lee
Sergiu Dascalu

External Advisor:
Ben Lucchesi



About


The commercial real estate web scraper is developed for our sponsor, Capstak. It is an autonomous scraper that has the ability to scrape various commercial real estate sites and intelligently consolidate the scraped data into a database that Capstak will use as desired. The scraper will not only be useful to Capstak, but also Capstak users as the scraper will aid Capstak in providing recent and relevant real estate data to the end user.

Main functionalities of the scraper include being able to run in AWS, with Groovy being the primary language of the scripts being written. The scripts are able to intelligently handle the data being pulled in from the broker sites by matching it with the data it had previously pulled and manipulating to reflect most recent updates it if need be. The data will be stored in a MySQL database where it can be easily accessed and manipulated by the user. A front end GUI will be developed in order to provide the user with an easy way to operate the web scraper.

The scraper is currently under developmen by Alex Sanchez and Austin Ogilvie with the advisment of Ben Lucchesi. The project will be completed in May 2018.




Team Members

Alex Sanchez
Alex is a senior in the CSE program of the University of Nevada, Reno.

Austin Ogilvie
Austin is also a senior in the CSE program of the University of Nevada, Reno.

Resources
Groovy Programming Language:
https://www.tutorialspoint.com/groovy/
This gave the team an overall great introduction to a new language. Luckily, Groovy is not spectacularly different from any languages the team has used at UNR or in their non-academic experiences.

Web Scraping in Python using Scrapy:
https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/
This article is an in depth look at different tools that can be used with python. It’s clear and concise and provides some good tips for web scraping. Though it uses python in its examples, it parallels with what we are looking to accomplish and is a great tool to help us learn how to create a full proof design with respect to our Capstak’s needs.

Is Web Scraping Illegal? Depends on What the Meaning of the Word Is Is.
https://resources.distilnetworks.com/all-blog-posts/is-web-scraping-illegal-depends-on-what-the-meaning-of-the-word-is-is
An article that gave great insight to the team in keeping the scraper legal/ethical.

Exploiting web scraping in a collaborative filtering- based approach to web advertising
by Eloisa Vargiu, Mirko Urru
This paper was helpful and interesting because it details techniques used by the authors of the paper to scrape data which can then be used to suggest ads that would benefit someone’s website. While this business practice differs from the one our product offers, they are similar in that a company is using data to give their clients more informed choices.

Webbots, Spiders, and Screen Scrapers, 2nd Edition
by Michael Schrenk
This book eas useful to us and our project because it is loaded with information about different tools to and ways to use them to help scrape information off of the Internet. The new addition was published a few years ago but still has tons of relevant information on creating fault-tolerant, autonomous scrapers.

Technologies Used