ImageSearch is an image scraper and search engine using Clarifai's image recognition APIs.
The crawler is set up to crawl National Geographic Photography, though the base site can be changed in imagespider/imagespider/settings.py. With this seed, Natural Geographic-y search engine queries such as safari, lake, water, mountain will yield good results. Typical.
Keyword search is implemented according to relevance from the Clarifai API and returned in order of relevance. The inputs are keywords in the search box or as a query parameter called keywords to the /search endpoint.
Make sure that xcode-select update has been run.
Install Python requirements with pip install -r requirements.txt
Inside the imagespider directory, run scrapy crawl image-spider -o items.json
This will crawl National Geographic Photography and save page urls and image urls to a file called items.json.
Create a Postgres database called image_search_engine. This can be done with:
psql
create database image_search_engine;
Now run the migrations on the database. Inside the ImageSearch directory, run: python manage.py db upgrade
Next, we will pass the scraped image urls to Clarifai's API to tag and judge relevance. First, create a Clarifai account and export your tokens:
export CLARIFAI_APP_ID=<an_application_id_from_your_account>
export CLARIFAI_APP_SECRET=<an_application_secret_from_your_account>
Inside the ImageSearch directory run python get_tags.py. This grabs the urls saved in imagerspider/items.json and passes them to Clarifai's API to yield tags and relevances from and persists them in the Postgres database.
In the ImageSearch directory, run python app.py. The keyword search engine is available at http://localhost:5000/. An example query is http://localhost:5000/search?keywords=safari or http://localhost:5000/search?keywords=mountain+lion.
- There are two database tables:
ImageandKeywordRelevance. Imagehas_manyKeywordRelevance, with aforeign_keystored on KeywordRelevance. This is better than a flat, normalized table as it reduces duplication and improves performance of queries. - Scrapy starts at the
start_urland follows links outward in a breadth-first search to aDEPTH_LIMITdepth (defined inimagespider/settings.py).
Because this project is an MVP built in a matter of hours, there are improvements to be made. Here are a few:
- Tests: Unit and integration tests are necessary before an application can go to production.
- Improved Scraping: Scraping can be improved. For example, duplicates can present due to the same image appearing more than once on a page, or on different pages. This can be mitigated by enforcing a uniqueness constraint on
image_urls. A cool feature would also be to keep track of different image sizes and filter by image size. - Database Tables:
Keywordcan be broken off into its own table where thekeywordcolumn is unique to improve performance. Relevance can hold the foreign key toKeywordin ahas_manyrelationship.Keywordnow has deduplication of keywords and Postgres can take advantage of an index onKeyword's primary key. - Database Constraints: Database constraints should be added to enforce data integrity before commiting, for example that
ImageandKeywordRelevance's foreign keys are notnull, which is a reasonable assertion in this case. - Database Sharding: As this grows web scale, the database will need to be scaled up. Sharding is one way to do this, for example by keyword.
- Caching: A production system should take advantage of caching for performance, for example caching images for the most popular queries.
- Parallelize requests: To make crawling web scale, crawlers should be distributed across many machines and many threads.
- Pagination: Pagination of serialized JSON search results is necessary when the scope of results is the entire web.