Efficiently Save PIL or OpenCV Images to HDF5 File for Large Dataset

Introduction

The most common image types in Computer Vision (Python) are Pillow and OpenCV images; large image datasets are usually trained in many machine learning applications. Sometimes we want to save processed Pillow or OpenCV images to HDF5 file for future using, the size of the HDF5 file can vary significantly depending on the image convert method. When working with a large image dataset, we want to minimize the size of the HDF5 file.

In this post, I will use different methods to save Pillow and OpenCV images to HDF5 file. In a word, the best method to minimize file size is to convert Pillow or OpenCV image to bytes and then save it to HDF5. We don’t consider image resolution and HDF5 chunking in this post, resolution of the test images are different.

Testing Setup

100 images (7.99MB or 8,382,912 Bytes) with different resolution are used in testing, they are in images folder. The project directory structure is shown below. Execute the script in project folder using command python3 save_to_hdf5.py.

├── project/
│   ├── save_to_hdf5.py
│   ├── xxxxxx.h5
│   └── images/
│       ├── xxxxx1.JPEG
│       ├── xxxxx2.JPEG
│       ├── ......

Directly Save OpenCV and PIL Images to HDF5

import os
import io

import cv2
from PIL import Image
import numpy as np
import h5py

# get all the images path
image_folder = os.path.join(os.getcwd(), 'images')
image_paths_list = [os.path.join(image_folder, image_path) for image_path in os.listdir(image_folder)]

# directly save opencv images to hdf5
opencv_direct_h5 = h5py.File(os.path.join(os.getcwd(), 'opencv_direct.h5'), 'w')

for i, image_path in enumerate(image_paths_list):
    # resize images to same resolution and add chunks/compression if you want reduce file size 
    image = cv2.imread(image_path)
    opencv_direct_h5.create_dataset('image_{}'.format(i), data=image)

opencv_direct_h5.close()

# directly save pil images to hdf5
pil_direct_h5 = h5py.File(os.path.join(os.getcwd(), 'pil_direct.h5'), 'w')

for i, image_path in enumerate(image_paths_list):
    image = Image.open(image_path)
    pil_direct_h5.create_dataset('image_{}'.format(i), data=image)

pil_direct_h5.close()

Images file size: 8,382,912 Bytes (7.99MB)

opencv_direct.h5 HDF5 file size: 56,484,466 Bytes (53.8MB)

pil_direct.h5 HDF5 file size: 56,484,466 Bytes (53.8MB)

Both of HDF5 have same file size, beacuse OpenCV and Pillow images here are same data type (NumPy Ndarray). Compared with original images, HDF5 file size gets much bigger.

Save OpenCV and PIL Images to HDF5 with Bytes Convert ※

# convert opencv images to bytes and save bytes to hdf5
opencv_to_bytes_h5 = h5py.File(os.path.join(os.getcwd(), 'opencv_to_bytes.h5'), 'w')

for i, image_path in enumerate(image_paths_list):
    image = cv2.imread(image_path)
    image_array = cv2.imencode(".JPEG", image)[1] # image_array is a numpy array
    image_buffer = io.BytesIO(image_array) # convert array to bytes
    image_bytes = image_buffer.getvalue() # retrieve bytes string
    image_np = np.asarray(image_bytes)
    opencv_to_bytes_h5.create_dataset('image_{}'.format(i), data=image_np)

opencv_to_bytes_h5.close()

# convert pil images to bytes and save bytes to hdf5
pil_to_bytes_h5 = h5py.File(os.path.join(os.getcwd(), 'pil_to_bytes.h5'), 'w') 

for i, image_path in enumerate(image_paths_list):
    image = Image.open(image_path)
    image_buffer = io.BytesIO()
    image.save(image_buffer, format='JPEG') # convert array to bytes
    image_bytes = image_buffer.getvalue() # retrieve bytes string
    image_np = np.asarray(image_bytes)
    pil_to_bytes_h5.create_dataset('image_{}'.format(i), data=image_np)

pil_to_bytes_h5.close()

Images file size: 8,382,912 Bytes (7.99MB)

opencv_to_bytes.h5 HDF5 file size: 6,811,576 Bytes (6.49MB)

pil_to_bytes.h5 HDF5 file size: 2,812,969 Bytes (2.68MB)

Both of HDF5 files size is much smaller than the original image files, especially for pil_to_bytes.h5. If we want the smallest hdf5 file size, we can convert Pillow images to bytes and then save it to HDF5.

Read Back the Image from HDF5 (Bytes)

pil_to_bytes_h5 = h5py.File(os.path.join(os.getcwd(), 'pil_to_bytes.h5'), 'r') # same for opencv_to_bytes.h5
for key, value in pil_to_bytes_h5.items():
    image_array  = np.array(value[()])
    image_buffer = io.BytesIO(image_array)
    image_pil = Image.open(image_buffer)

    # process the pillow image image_pil ...

pil_to_bytes_h5.close()

Introduction#

Testing Setup#

Directly Save OpenCV and PIL Images to HDF5#

Save OpenCV and PIL Images to HDF5 with Bytes Convert ※#

Read Back the Image from HDF5 (Bytes)#

Introduction

Testing Setup

Directly Save OpenCV and PIL Images to HDF5

Save OpenCV and PIL Images to HDF5 with Bytes Convert ※

Read Back the Image from HDF5 (Bytes)