Following up on previous posts (available on my research blog here) in which I presented algorithms rooted in information theory that identify structure in images, I’ve come up with a series of algorithms that together allow for very fast image categorization.
Using the algorithms presented in previous posts, the process starts with breaking an image into a set of equally sized rectangular regions. These regions are sized so as to maximize the standard deviation of the entropies of the regions. That is, we size the regions so that they contain maximally different amounts of information. This process alone uncovers quite a bit of structure within an image, and is extremely fast, generally taking only a few seconds on an ordinary laptop. See the previous posts for more on this process.
Next, the algorithm assembles these regions into larger features, using an automata that “crawls” around the image, grouping similar regions together into larger contiguous features. The question arises as to how similar two regions need to be in order to be considered part of a larger contiguous feature. Previously, I answered this question by grouping regions that are within one standard deviation of each other (as measured using a metric I describe in previous posts). This method works very well in general, and is very fast. However, it misses smaller structures in more complex images, such as busy city streets.
So instead, this set of algorithms iterates through different multiples of the standard deviation, beginning with a very small fraction of one standard deviation, and iterating up to one standard deviation. The algorithm then chooses the multiple of the standard deviation for which the structure of the features begins to “crystallize”. That is, the algorithm stops partitioning the image at the point where the structure of the resulting features is most stable. The stability of the structure of the features is tested by measuring the stability of the entropy of the distribution of the features. That is, once the distribution of the features identified by the algorithms begins to stabilize, the algorithm stops partitioning the image.
For example, if between .5*std_dev and .6*std_dev, the number and distribution of features is roughly constant, and is the most stable when compared to all other intervals, then the algorithm will choose delta = .5*std_dev as the optimal value for purposes of testing how similar neighboring regions must be in order to be grouped into a single contiguous feature.
In general, the algorithm measures the difference between H(i), the entropy of the distribution of features using a delta of (i*alpha*std_dev), and H(i+1), the entropy of the distribution of features using a delta of ((i+1)*alpha*std_dev), and chooses the value of i that minimizes,
|H(i) – H(i + 1)|.
The theory underlying this approach is that when we don’t distinguish between any of the regions within an image (by using a delta of infinity), the information content of the resultant partition (i.e., the entropy of the distribution of features) is zero, since we are in this case treating the image as one giant feature. That is, the difference between any two regions is finite using any method of comparison, and so if delta is set to infinity, all of the regions will be indistinguishable, producing one feature: the entire image. This feature appears exactly once (since it is simply the entire image), meaning H = 1*log(1) = 0.
As we begin to distinguish between regions, by using a smaller delta, we’ll end up partitioning the image into more and more features. As a result, the entropy of the distribution of features will increase (since it increases as a function of the number of features), reaching the maximum at the point where we distinguish between each region within the image, thereby maximizing the number of features. Though there could be local maxima along the way, as a general matter, the entropy of the distribution of features will increase as we reduce delta, thereby distinguishing between more and more regions within the image, generating a finer partition of the image.
There should, however, be a point along the way at which we actually uncover the “real” structure of the image, which means that the partition should be roughly stable for some interval near that point. Put loosely, the real structure of the image should be “in focus” within some small window around the objectively correct partition.
These algorithms attempt to find that correct partition by choosing the value of delta for which the derivative of the entropy of the distribution of features, as a function of delta, is minimized. This is the point at which the structure of the features, and the number of the features, changes the least as we alter delta, and where we assume the structure of the image is most likely to be “in focus”.
I have attached examples showing the structure identified by these algorithms. The images attached to this post have been weighted by their S-score, so that the most notable features are the brightest, and the least notable the darkest. The S-score is, in short, a measure of how notable a region is in terms of its information content. The shading here “paints” a feature according to the average S-score over all regions within the feature.
Finally, once all of that pre-processing is sorted, another algorithm goes through the features that have been identified in one image, and compares them to the features identified in another, by using six statistical properties of the features, each of which are invariant under rotation:
(1) The partition score N, which is the point at which the “information contrast” of the feature is maximized. See previous posts for more on this.
(2) The standard deviation of the entropy of the feature, as calculated using the partition score N. This is a measure of how much variation there is in the statistical color structure of the feature. A “boring” feature will have a low score, and a “complex” feature that changes color distributions often will have a high score.
(3) The average entropy of the feature, as calculated using the partition score N. This is a measure of how evenly distributed the color information is, with features that have a very concentrated color distribution scoring low, and features that have a very wide distribution of colors scoring high.
(4) (5) and (6) are the average colors across the R,G, and B channels of the feature. This is used to capture the “actual” average color of the feature.
The norm of the difference of the resultant 6-vector for the two regions being compared is then calculated, and averaged over all features, weighted by the importance of the region, as measured using the “S-score” I presented in previous posts. The larger the resultant score, the less similar the two images are. Categorization can then be achieved by comparing the scores generated when one image is compared to a series of images from two or more different categories, allowing us to predict to which category the input image most likely belongs.
Pre-processing an image using the attached algorithms takes anywhere between a few seconds and a few minutes (on a very low-fi laptop), depending upon the size of the image, and its complexity.
Comparing two pre-processed images takes fractions of a second, making processing time for comparing pre-processed images generally negligible. This suggests that these algorithms would likely be able to process video in real, or close to real-time, on an industrial machine. This would allow, for example, live video to be compared to pre-processed databases of images, in order to determine the most likely subject matter of the features within the video.
The code should be called from the Octave command line as follows:
I = imread(“[FILE PATH]”);
[region_matrix N S_mat feature_score_vector] = identify_features(I);
To compare two images that have been processed, use the following:
average_score = IMG_Comparison(region_matrix_1, N_1, S_mat_1, feature_score_vector_1, region_matrix_2, N_2, S_mat_2, feature_score_vector_2)