Skip to content

This a site crawler built with scrapy and stores data generated in mongodb using scrapy

Notifications You must be signed in to change notification settings

samparsky/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Web Crawler


This is a simple web crawler that crawls link. Parses through the results page. It works based on the (Scrapy)[https://scrapy.org/] crawling engine. Its uses Extruct to parse application/ld+json content of the pages to retrieve basic content and Xpath to query the

To start


pip install -r requirements.txt

To run the crawler

 cd <directory>
 scrapy crawl wizard

MongoDB

The mongodb collection schema is as follows

    event_name  
    description 
    age_group    
    location     
    price        
    link		 
    event_link 
    date

The mongodb database is mommy and the collection is crawl To view the crawled data run the below commands at the mongo shell

 > use mommy
 > db.crawl.find()

About

This a site crawler built with scrapy and stores data generated in mongodb using scrapy

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages