MSR-CORE12 Project: Contents-based assessment of the aesthetics of photography

(Funded period: April 2016 - March 2017)


Aesthetics of photography and art work has been studied for a long time. The so-called "Rule of Thirds" based on the golden ratio is a well-known basic rule for deciding the framing. However, in reality, it is often the case that other constraints take precedence over the basic rule. Among the constraints are the purpose of photographing and the nature of the target contents-of-interest in the scene. In most situations, it is more preferable to include certain contents then other contents considering the purpose of photographing. So, the aesthetics of photography should actually be assessed according to the contents visible in the image in addition to general rules. Since the purpose of photographing varies case-by-case and in many cases not even explicitly describable, and also since it is nearly impossible to describe the nature of each content in the scene beforehand, it is very difficult to solve this problem in a general framework. So, the proposed project aims to assess the aesthetics of especially food images whose purpose of photographing is clear (i.e. the target food should look delicious), and also whose contents are restricted and usually annotated (i.e. accompanied with dish names and/or ingredients).

Food was selected as a target domain since there are also real-world demands. In recent years, Web services related to cooking and eating activities have become very popular. For example, there are user contributed cooking recipe sites such as US-based and Japan-based Cookpad, and also restaurant review sites such as US-based Yelp and Japan-based Tabelog , which contain millions of entities. In these Web services, food images play a very important role. In fact, the larger the number of entities contained in a Web site becomes, the more important the role of images become, since general users searching and scanning through the entities tend to be largely attracted by the visual appearance of food images before going further into detailed text descriptions.

Presenting more attractive images attracts more attention.
Fig. 1: Presenting more attractive images attracts more attention.

In this respect, for a user publishing a cooking recipe, accompanying it with an attractive food image is important to expose it to a larger number of viewers, while for a restaurant owner, having the most attractive food images shown out of those posted by reviewers is important to attract more customers. However, it is not necessarily easy for an amateur to photograph a visually attractive food image, and current Web services do not provide ranking functions based on the visual attractiveness of food images. In order to support such real-world demands, this project aims to establish a method that quantitatively assesses the aesthetics of food photography.

Establishing such technology should also contribute to the understanding of aesthetics of photography in general by providing facts and intuitions yielded from the analysis on what attracts us considering the nature of the target dishes and/or ingredients from various aspects including not only visual information but also preparation steps and ingredients list in a cooking recipe, and even other sensory information in our memory such as taste, smell, and texture, acquired through personal experiences.

Expected results

Once the aesthetics of food photography is assessed quantitatively, it could be used as a reference to provide, for example, a smart-phone / tablet app with an interactive interface that supports a user to photograph a visually attractive food image, or a function in a Web service that can be used to select / rank visually attractive food images posted by its reviewers. In this project, I will develop a prototype interface of the former application that supports a user to photograph a visually attractive food image.

Although the proposed project focuses on food images, the proposed approach could also be applied to obtain similar results and applications in other domains as long as the purpose of the photography could be assumed and the nature of the contents-of-interest is known and can be analyzed beforehand.

Research plan

Step 1: Data acquisition

In order to facilitate the capturing of training samples, a 360 degrees image capturing system will be constructed. The system will consist of a program controllable turn-table and multiple arms mounted with program controllable Web-cameras in order to systematically capture food images in various conditions by changing framing conditions (yaw, pitch, zoom-level, and focus) automatically. In order to ensure reproducibility, instead of using actual foods, plastic food samples will be used.

The dataset composed of the captured food images will be provided to the research community shortly after the conclusion of the project.

Step 2: Measurement of visual attractiveness

The attractiveness for each of the captured food images in the dataset will be measured through manual annotation. This is needed as the ground-truth for the training of the software that assesses the attractiveness (from here-on, assessor) and also for its evaluation.

Since it is nearly impossible to measure the absolute attractiveness of each individual food image, we take a relative approach by comparing pairs of food images. Concretely, human subjects are presented on a large high-resolution monitor, randomly chosen pairs of food images captured in different conditions, and asked to judge which of the two is more attractive than the other one. After multiple judgments are obtained for all possible pairs, a scale is obtained by applying Thurstone’s pairwise comparison method. As a result, each food image will be assigned a value between 0 and 1 that represents its relative attractiveness.

Step 3: Training of the attractiveness assessor

Once the ground-truth attractiveness value is assigned to each food image, an attractiveness assessor of a given food image is constructed in an off-the-shelf machine learning framework using various image features. The performance of the trained assessor is then evaluated in an n-fold cross-validation framework.

Selecting and/or designing features appropriate for assessing the attractiveness is not only important for the high performance of the assessor, but also for analyzing and obtaining insight into what attracts us considering the inherent nature of the contents that appear in the image. For example, in a food image (e.g. stew), major ingredients (e.g. meat) and/or results of major cooking procedures (e.g. broiled) should be visible, and presented as attractive as possible to characterize the food, regardless to the criteria assessed by low-level image features (e.g. colorful). From this point, the use of high-level features such as food categories and major ingredients, in addition to low-level image features will be considered. They could be obtained from the food image itself, but text information from user annotations and cooking recipes will also be used.

Attractive as a food image. Attractive as a general image, but not attractive as a food image.
(a) Attractive as a food (stew) image. (b) Attractive as a general image (colorful), but not attractive as a food (stew) image where the major cooking procedure and ingredient (broiled meat) is important.
Fig. 2: Difference of attractiveness considering low-level and high-level image features.

Step 4: Implementation of the interactive interface

As an application of the proposed assessor, an interactive interface that supports a user to photograph an attractive food image will be implemented on a tablet. Using this interface, the user will be indicated the attractiveness of the current condition of the food image captured in the frame. While interactively changing conditions by moving the tablet around the food, the user can decide the best framing condition he/she prefers with the support of the interface. A possible extension of the interface will include an additional function that recommends the user to change certain conditions for a better framing.

The user decides the best framing he/she prefers with the support of the interactive interface.
Fig. 3: The user decides the best framing he/she prefers with the support of the interactive interface.



  1. Kazuma Takahashi, Keisuke Doman, Takatsugu Hirayama, Yasutomo Kawanishi, Ichiro Ide, Daisuke Deguchi, Hiroshi Murase:
    "A study on estimating the attractiveness of food photography",
    Proc. First Int Workshop on Attractiveness Computing in Multimedia (ACM 2016) in conjunction with IEEE BigMM2016, pp.444-449
    (At: Howard Civil Service International House (Taipei, Taiwan), Apr. 2016)
  2. Kazuma Takahashi, Keisuke Doman, Yasutomo Kawanishi, Takatsugu Hirayama, Ichiro Ide, Daisuke Deguchi, Hiroshi Murase:
    "Estimation of the attractiveness of food photography focusing on main ingredients",
    Proc. Ninth Workshop on Cooking and Eating Activities (CEA2017) in conjunction with IJCAI2017, pp.1-6
    (At: RMIT Univ. (Melbourne, VIC, Australia), Aug. 2017)

Invited talks

  1. Ichiro Ide:
    "Contents-based assessment of the aesthetics of photography",
    Microsoft Research Japan - Korea Academic Day 2016
    (At: Microsoft Japan, Shinagawa Headquarters (Tokyo), May 2016)
  2. Ichiro Ide:
    "Assessment of the aesthetics of food photography",
    ACM Multimedia 2016 TPC Workshop at ICMR2016
    (At: Columbia University (New York, NY, USA), June 2016)



Ichiro IDE /

MSR-CORE12 Project: Contents-based assessment of the aesthetics of photography