Organizing Image Collections with Python
Python may be used as a tool to manage creative content. Spreadsheets and databases are valuable to aggregate and filter data for insights. But when the physical assets of a content collection need to be grouped, separated, or copied, Python is an efficient programming language to organize digital files using that spreadsheet data.
The Project
This project started with approximately 120,000 jpeg images in one directory and a CSV of metadata for the pictures.
The goal was to gain insights into where (geographically) the pictures of this collection were photographed and what subjects are covered.
Python provides a strong tool for parsing the original data to see the categories of subjects and a method for copying the images, based on their list of filenames, into separate directories of collections for viewing or sharing.
Using Tableau to visualize the geographic data allows us to better understand the breadth of the collection, and where specific topics were photographed.
The Data
Python Scripts Gather Instances of Selected Keywords
Using Python 3, scripts are written to identify images with geographic attributes for separate sub-collections based on keywords and captions in an original CSV of metadata.
The 6 collections selected are: beaches, cities, forests, landmarks, mountains, parks.
A Python script reads the original metadata CSV, locates any occurrence of specified keyword strings, and then writes a new CSV with only those filenames and their locations. The script was adapted for each collection to isolate the specific image category.
Search strings in Caption and Keywords columns for these 6 sub-collections:
Beaches: ‘beach’, ‘sand’, ‘ocean’
Cities: ‘city’, ‘street’, ‘urban’
Forests: ‘forest’, ‘tree’, ‘beauty in nature’
Landmarks: ‘landmark’, ‘historic place’
Mountains: ‘mountain’, ‘landscape’, ‘scenic’
Parks: ‘beauty in nature’, ‘park’, ‘outdoors’
The script also prints the number of images found that meet the criteria.
The new CSVs for the collections contain only the filename and the location where the image was photographed.
Moving and Copying Jpegs with Python
If the original collection needs to stay intact, and the images should be copied rather than moved use:
shutil.copy instead of shutil.move on line 19.
If an image is listed in the CSV that is not in the folder of images, that filename will be listed as missing when the script attempts to move it based on the lines 23-25.
Visualizing Results with Tableau: Mapping the Collection by Subject
Python was used to append all of the new collection CSVs into one CSV.
The geographic location names were cleaned using Google’s data tool OpenRefine. There were a range of location spelling errors and language discrepancies, as well as wrong data in the wrong field.
Once cleaned, location names were converted from strings to long-lat coordinates using the Google Geocoding API.
An interactive map on Tableau allows the viewer to click on each collection to see where each category is represented geographically. Functionality includes the option to turn different collections on-off to understand the breadth of each sub-collection, as well as breadth of all the geographic collections together. Points on the map are sized according to the number of filenames at the location, shaded according to collection subject. On hover, the viewer can see the location (city, state, country) and the number of images represented at that location.