GitHub - samparsky/web-crawler: This a site crawler built with scrapy and stores data generated in mongodb using scrapy

Simple Web Crawler

This is a simple web crawler that crawls link. Parses through the results page. It works based on the (Scrapy)[https://scrapy.org/] crawling engine. Its uses Extruct to parse application/ld+json content of the pages to retrieve basic content and Xpath to query the

To start

pip install -r requirements.txt

To run the crawler

 cd <directory>
 scrapy crawl wizard

MongoDB

The mongodb collection schema is as follows

    event_name  
    description 
    age_group    
    location     
    price        
    link		 
    event_link 
    date

The mongodb database is mommy and the collection is crawl To view the crawled data run the below commands at the mongo shell

 > use mommy
 > db.crawl.find()

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
mommypoppins		mommypoppins
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Web Crawler

To start

To run the crawler

MongoDB

About

Releases

Packages

Languages

samparsky/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Simple Web Crawler

To start

To run the crawler

MongoDB

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages