Organizing Image Collections with Python

Python may be used as a tool to manage creative content. Spreadsheets and databases are valuable to aggregate and filter data for insights. But when the physical assets of a content collection need to be grouped, separated, or copied, Python is an efficient programming language to organize digital files using that spreadsheet data.

The Project

This project started with approximately 120,000 jpeg images in one directory and a CSV of metadata for the pictures.

The goal was to gain insights into where (geographically) the pictures of this collection were photographed and what subjects are covered.

Python provides a strong tool for parsing the original data to see the categories of subjects and a method for copying the images, based on their list of filenames, into separate directories of collections for viewing or sharing.

Using Tableau to visualize the geographic data allows us to better understand the breadth of the collection, and where specific topics were photographed.

The Data

One CSV exists for the whole collection of images.

Locations, captions, and keywords are the attributes which we are studying for this project.

Python Scripts Gather Instances of Selected Keywords

Using Python 3, scripts are written to identify images with geographic attributes for separate sub-collections based on keywords and captions in an original CSV of metadata.

The 6 collections selected are: beaches, cities, forests, landmarks, mountains, parks.

A Python script reads the original metadata CSV, locates any occurrence of specified keyword strings, and then writes a new CSV with only those filenames and their locations. The script was adapted for each collection to isolate the specific image category.

This script finds all image filenames that are tagged with ‘mountain’ and ‘landscape’ in the keywords and ‘scenic’ in the caption. Searching strings in both the keywords and caption fields raises the probability that the newly written CSV will have … — This script finds all image filenames that are tagged with ‘mountain’ and ‘landscape’ in the keywords and ‘scenic’ in the caption. Searching strings in both the keywords and caption fields raises the probability that the newly written CSV will have the specific type of images desired.

Search strings in Caption and Keywords columns for these 6 sub-collections:

Beaches: ‘beach’, ‘sand’, ‘ocean’

Cities: ‘city’, ‘street’, ‘urban’

Forests: ‘forest’, ‘tree’, ‘beauty in nature’

Landmarks: ‘landmark’, ‘historic place’

Mountains: ‘mountain’, ‘landscape’, ‘scenic’

Parks: ‘beauty in nature’, ‘park’, ‘outdoors’

The script also prints the number of images found that meet the criteria.

The new CSVs for the collections contain only the filename and the location where the image was photographed.

This script moves jpeg images from the main collection directory into a new folder based on the CSV files created above. If an image is listed in the CSV that is not in the folder of images, that filename will be printed as missing when the script a… — This script moves jpeg images from the main collection directory into a new folder based on the CSV files created above. If an image is listed in the CSV that is not in the folder of images, that filename will be printed as missing when the script attempts to move it.

Moving and Copying Jpegs with Python

If the original collection needs to stay intact, and the images should be copied rather than moved use:

shutil.copy instead of shutil.move on line 19.

If an image is listed in the CSV that is not in the folder of images, that filename will be listed as missing when the script attempts to move it based on the lines 23-25.

Visualizing Results with Tableau: Mapping the Collection by Subject

Python was used to append all of the new collection CSVs into one CSV.

The geographic location names were cleaned using Google’s data tool OpenRefine. There were a range of location spelling errors and language discrepancies, as well as wrong data in the wrong field.

Once cleaned, location names were converted from strings to long-lat coordinates using the Google Geocoding API.

An interactive map on Tableau allows the viewer to click on each collection to see where each category is represented geographically. Functionality includes the option to turn different collections on-off to understand the breadth of each sub-collection, as well as breadth of all the geographic collections together. Points on the map are sized according to the number of filenames at the location, shaded according to collection subject. On hover, the viewer can see the location (city, state, country) and the number of images represented at that location.

Interact with this Tableau visualization here.

For further information: all scripts, instructions, and project details may be found on a Github repository here.