Skip to content

Latest commit

 

History

History
94 lines (64 loc) · 1.98 KB

README.md

File metadata and controls

94 lines (64 loc) · 1.98 KB

Airbnb Scraper

This project is designed to scrape Airbnb listings data. The main entry point for this project is main.py.

Prerequisites

  • Python 3.8 or higher
    python3 -m textblob.download_corpora
    python3 -m spacy download en_core_web_sm
    
  • Git
  • A virtual environment tool (e.g., venv)
  • Install Homebrew
  • Install the following:
    brew install enchant
    brew install openssl
    brew install ca-certificates
    

Setup Instructions

  1. Clone the repository:
git clone https://github.com/cpeters008/scrape_airbnb.git
cd scrape_airbnb
  1. Create a virtual environment:
python3 -m venv venv
  1. Activate the virtual environment:
  • On macOS and Linux:

    source venv/bin/activate
    
  • On Windows:

    .\venv\Scripts\activate
    
  1. Install the required packages:
./install_requirements.sh

If you encounter permission issues, try running chmod +x install_requirements.sh and then run the script again.

Usage

To run the Airbnb scraper, use the following command:

python3 main.py

You can customize the scraper's behavior by modifying the parameters in main.py or the configuration settings in config.ini.

Testing

To run the unittests for the Airbnb scraper, navigate to test

cd test

and run the following command:

python3 -m unittest discover

All tests should pass.

Outputs

There are two outputs can you produce using this script:

  1. A csv file that contains the conversationId, sender, content, and timestamp of each message in all conversations scraped
  2. A json file that has an array of json objects which contains the sender and content only. It is formatted for use in finetuning an openai model.

The content is always scraped to remove PII and attempts to fix typos.

Troubleshooting

If you encounter any issues, please refer to the error messages for guidance or consult the project documentation. If you need further assistance, feel free to open an issue on the GitHub repository.