Expected Intersection, Missing Data, and Compression

In a previous research note, I introduced a method of quickly calculating an expectation value for the intersection count between any two given sets randomly selected from a family of sets. I’ve been doing work on the MovieLens dataset, and though I’ll soon post a more fulsome research note on the matter, I realized that the expected intersection calculation can be used to measure how much mutual information is contained on average between vectors in a dataset.

For example, the MovieLens dataset consists of around 200,000 movies rated by 610 users. However, a given user generally rates only a small fraction of those movies. We could simply take the entire dataset, but because it’s 200,000 dimensions, this will impact performance, and because each user doesn’t generally rate all of the movies, we’re probably doing far more work than is necessary. That is, we can almost certainly find a subset of those 200,000 dimensions that will give us enough information to meaningfully analyze the dataset.

The solution I’ve come up with is to sort the movies in the order of how many ratings they receive. That is, the top of this list is the movie that was rated by the most users. This is not necessarily the most popular movie, but rather, the movie that the most users took the time to rate.

As we proceed down this list, the probability that a given user rated the movie in question declines. As a result, each movie contributes less and less to the information about the dataset as we move down the list, simply because, by definition, fewer users have reviewed it.

We can then, for example, take a subset of the original dataset that corresponds to the top 100 movies in terms of their review count. Again, these are not necessarily the 100 most popular movies, but are instead the top 100 movies in terms of how many people reviewed them.

Even though these are the most rated movies, it’s still possible that the resultant subset of the dataset contains missing data. That is, just because a movie is frequently rated, doesn’t mean that all users rated it.

If we shape the dataset as a data matrix where each row corresponds to a user, and each column in a given row contains the rating the user gave to a particular movie, then we can then test which column entries are non-zero (i.e., non-empty), which will produce a binary matrix, which we can then feed to the expected intersection algorithm. This will calculate the expected number of overlapping, non-empty entries between any two rows from the dataset.

This is, therefore, a measure of the number of overlapping dimensions we’re expected to have between any two vectors from the dataset (i.e., any two rows from the data matrix).

We can then iterate the number of top movies sampled, and then measure the expected intersection count for each resultant subset of the dataset, which will allow us to measure how much information we’re getting in return for adding dimensions to our dataset. Below is a chart showing the expected intersection count for the MovieLens dataset as a function of the number of dimensions, where dimensions were added in the order of their rating count (i.e., dimension one is the most rated movie).


Dimension Compression

There is, however, no objective stopping point to this process. That is, the expected intersection between dimensions will continue to increase as we add more dimensions, and though we can measure the derivative of this function, there’s no obvious point at which we can say, this is the “right” amount of overlap between dimensions.

We can, however, apply my categorization algorithm (or some other categorization, or clustering technique) at each iteration, measure the entropy of the categorization, and find the number of dimensions that generates the greatest change in the structure of the categorization.

This is exactly what my core categorization algorithm does, except in that case, entropy is measured as a function of the level of distinction, \delta. In this case, we’d be measuring the rate of change in the entropy of the categorization as a function of the number of dimensions used to generate the categorization.

So in short, we sort the dimensions of a dataset by frequency as described above (i.e., the most commonly used dimensions go first), and begin by categorizing the data using only the most common dimension (or some other small number of dimensions), and measure the entropy of that categorization. Then, we add the next dimension, categorize the dataset using those dimensions, and measure the entropy of the resultant categorization. We repeat this process up to a predefined maximum, which will presumably depend upon performance constraints. So if the machine in question can handle the full dimension of the dataset, then we can probably iterate up to that full number of dimensions.

Note that at each iteration, we let the categorization algorithm (or other clustering technique) run to fruition, producing a complete categorization of the dataset using the number of dimensions in question.

By selecting the number of dimensions that generates the greatest change in the entropy of the categorization structure, we will select the number of dimensions that unlocked the greatest amount of structure in the dataset – i.e., the most relevant dimensions of the dataset.

Now, it would be fair to say that this process actually requires more work than simply categorizing the data once, using all of the dimensions in the dataset, and that is correct. That is, by definition, this process will repeatedly categorize the dataset for every dimension from 1 up to and including N (i.e., the actual number of dimensions). This is obviously more work than simply categorizing the dataset once using all N dimensions.

However, we can use the expected intersection to select a representative subset of the dataset, and apply this process to that subset, rather than the entire dataset. Specifically, we can randomly select vectors from the dataset, and test the expected intersection for that subset of vectors, and iteratively increase the number of vectors selected until we hit an expected intersection count that is sufficiently close to the expected intersection count for the entire dataset (and sufficiently stable as we add more vectors). That is, we add vectors until the expected intersection for the subset is close to, and stable around, the expected intersection for the entire dataset.

We can also test the individual frequencies of the dimensions from this subset of vectors during this process, which is also produced by the expected intersection function. Specifically, we can calculate the norm of the difference between the frequencies of the subset and the frequencies of the entire dataset.

This ensures that we have a number of vectors with dimensions that overlap approximately as much as the entire dataset (as measured by the expected intersection count), and dimensions with approximately the same frequencies of usage as the original dataset (as measured by the frequencies of the dimensions calculated by the expected intersection function in the “weight vector”).

Put together, this will allow us to find the optimum dimension for the dataset without having to repeatedly categorize the entire dataset. Finally, note that we could apply this process to any resultant data structure capable of generating a measure of entropy. For example, we could apply this process to my  “generate data tree algorithm“, which generates a tree of subcategories of a dataset, measuring the entropy of the resultant data tree after each iteration over the dimensions.

Measuring Missing Information

We can also use these ideas to get a sense of how complete a dataset is by measuring the expected intersection on the full dataset, and dividing by the number of dimensions in the full dataset. For example, if the dimension of our dataset is 100, and the expected intersection over the full dataset (as described above) is 10, then the measure would in this case be .1, or 10%. This means that, on average, only 10% of the dimensions of the dataset will be relevant when comparing two vectors from the dataset, since only 10% of them are expected to overlap.

In the extreme case when the expected intersection is 0, then in some sense, we don’t really have a dataset – we have collection of independent observations. For example, if our dataset consists of two vectors (1,2,{}) and ({}, {}, 1), where {} indicates a missing entry, then there is no overlap between the dimensions of those two vectors. This means that without any other information about what the vectors represent, we can’t compare them. In contrast, if the two vectors were (1,2,3) and (4,5,1), then we can of course compare every dimension of the vectors.

Since the expected intersection allows us to form an expectation value as to the number of overlapping dimensions between any two vectors in a dataset, we can in turn measure the completeness of the dataset.

In the language of the theory of AI that I’ve been developing, I would say that this method allows us to measure the extent to which a dataset provides its own context. When the ratio of the expected intersection and the dimension of the dataset is 1, then the dataset provides the maximum context possible without any exogenous information about the dataset. When the expected intersection is zero, then the dataset provides no context, and is instead, ex ante, assuming no other prior information, a collection of unrelated observations.


Dynamic Matrices in Matlab and Octave

For other users of Matlab / Octave, I’ve noticed that dynamically increasing the size of a matrix in Matlab is slower than initializing the matrix to a size that is certain to be larger than necessary, and then deleting the unused entries.

I work in Octave, so perhaps Matlab is different, but the impact on runtime is measurable in my experience, so I thought I’d share this observation.

Categorizing and Predicting Discrete Data

In this article, I’m going to introduce a set of algorithms that can form categorizations on, and predict, discrete data comprised of sets. Their efficiency appears to be unprecedented: running on an iMac, they categorized a dataset of 200,000 vectors, with a dimension of 100 items per vector, in 13 seconds, with 100% accuracy.

I’m in the process of looking for a more formal dataset (e.g., Netflix preferences) to test them on, but nonetheless, using my own datasets, the results are impressive.

The code and training data is available on my researchgate blog.

There’s already another algorithm in my code archive that can form categorizations on datasets that consist of discrete sets, (see, “optimize_categories_intersection_N”). In terms of how it operates, this “intersection algorithm” is no different than the Euclidean algorithm that I introduced in my main research paper on AI, but the intersection algorithm uses discrete set intersection, rather than the Euclidean norm operator, to compare the difference between two elements of a dataset. That is, the Euclidean algorithm assumes the dataset is comprised of vectors, and so it compares two elements by taking the norm of the difference between those two elements. In contrast, the intersection algorithm compares two elements by treating each element as a set, and taking the intersection of those two sets. For example, if A = \{1, 2, 3\} and B = \{2, 3, 4\}, then the “distance” measure on those two sets would be |A \cap B| =| \{2, 3\}| = 2.

However, the “optimize_categories_intersection_N” algorithm makes use of a novel form of intersection that I’ve developed called the \delta-intersection operator that allows for the intersection between sets that contain vectors that are close in value, as opposed to exactly the same symbol. So, for example, the \delta-intersection operator allows us to calculate the intersection between sets A = \{1.001, 2.0002\} and B = \{1.004, 2.002, 3.0005\}. If \delta = .1, then |A \cap_{\delta} B| = 2, since |1.001 - 1.004| < \delta and |2.0002 - 2.002| < \delta. I developed this operator in the context of color processing, where you might want to take the intersection count between two sets of colors, and say close enough is good enough to constitute a match. For an explanation of how we determine the appropriate value of \delta, see my research note on vectorized image partitioning.

In this article, I’m going to present an otherwise identical categorization algorithm that makes use of ordinary intersection. But unlike the vectorized implementation for the \delta-intersection operator, which I’m quite proud of, the implementation of the intersection operator used in the algorithm I’ll present in this note is trivial and almost certainly widely used, which I’ll nonetheless explain below. As a result, in terms of code, there’s really nothing new in this algorithm when compared to the original categorization and prediction algorithms that I discuss in my main research paper on AI.

However, as a theoretical matter, using the intersection operator creates a completely different measure on the dataset when compared to Euclidean distance, since in Euclidean space, two elements are similar if they’re near each other when represented as points, which implies that the norm of their difference will be small. In contrast, if two sets are similar, then they have a lot of elements in common, which will produce a large intersection count. So even though the Euclidean categorization algorithm and intersection algorithm both operate in the exact same manner, using almost identical code, as a theoretical matter, the intersection algorithm is worth discussing, since it generates some interesting mathematics if we think about the intersection operator as a measure of similarity, or distance, on sets.

Moreover, because of the way these algorithms work, we need to have an expectation value for the intersection count between any two sets. We can of course actually calculate the intersection count between every pair of sets in a given dataset and take the average, but this requires O( {{N} \choose {2}} ) calculations, where N is the number of items in the dataset. The calculation of ordinary intersection can be vectorized trivially, so this is by no means intractable. But, using set theory, we can approximate this number using O(N) calculations, which on a low-end device, could mean the difference between an algorithm that runs in a few seconds, and one that runs in a few minutes.

Measuring the Difference Between Sets

There is no obvious geometric space in which we can visualize sets that would allow us to say, for example, that two sets are “close” to each other. This implies that the ordinary intuition for categorization, which comes from clustering points that are near each other in space, doesn’t work for categorizing sets. Instead, the intersection algorithm relies upon the intersection count to measure similarity between sets, with no notions of distance at all.

Consider a dataset that consists of consumer purchases at a local supermarket, where each element of the dataset represents the items purchased by a particular consumer. This type of data doesn’t need a location in Euclidean space to be represented. Instead, all we need is a set of distinct but otherwise arbitrary symbols that are sufficient in number to represent the items in question.  So if the supermarket sells 20 items, then we could use the names of the actual products (assuming they’re distinct), the numbers 1 through 20, the letters A through T, \log(20) bits, or any other set of 20 distinct symbols, or any other system that can be in 20 distinct states.

Assume that one of the elements is A = \{apple, banana, cereal\}. Any dataset that consists of sets like A would be a family of sets, and as a result, we can compare elements within the dataset using the set intersection operator. For example, if B = \{apple, banana, bread\}, then |A \cap B| = 2, whereas if C = \{apple, orange, tomato\}, then |A \cap C| = 1, and so we can reasonably say that A and B are more similar to each other than A and C. In general, the larger the intersection count is between two sets, the more similar the two sets are.

If a dataset consists of N individual sets, each of which could contain any one of M items, then we can represent the dataset as a binary matrix of N rows, and M columns, where if entry (i,j) = 1, then set i contains item j, and otherwise, the entries are 0.

We can then calculate the union of any two sets by simply taking the sum of their rows, and testing which entries are greater than zero. Similarly, the intersection is simply the product of corresponding entries. So if A = \{1,2,5\} and B = \{2,7\}, and there are 7 seven items in the dataset, then we would represent A as the row vector (1,1,0,0,1,0,0) and B as the row vector (0,1,0,0,0,0,1).

We can then calculate their union as,

A \cup B = ((1,1,0,0,1,0,0) + (0,1,0,0,0,0,1)) > 0

= (1,2,0,0,1,0,1) > 0

= (1,1,0,0,1,0,1) = \{1,2,5,7\}.

Similarly, we can calculate their intersection as,

A \cap B = ((1,1,0,0,1,0,0).*(0,1,0,0,0,0,1))

= (0,1,0,0,0,0,0) = \{2\}.

Note that in Matlab, the (.*) operator takes the product between the corresponding entries of two vectors.

Generating Categories

The fundamental approach used in the intersection algorithm is exactly the same as the approach used in all of my algorithms, which is to iterate through different levels of distinction, measuring the entropy of the resultant structure, and then choosing the level of distinction at which the rate of change in entropy is maximized. The intuition for this process is easier to understand in the context of image processing, which actually served as the impetus for all of these algorithms in the first instance.

Consider the image of the red brick wall below, and assume that we want to find the boundaries of the objects in the image. We could accomplish this by identifying significant changes in colors, which in turn requires answering the question of how different two colors have to be in order to distinguish between them. Intuitively, this should depend upon the context of the image. For example, if an image consists solely of very similar colors, then a small change in color could indicate a boundary. In contrast, if an image consists of wildly different colors, then it’s probably only a large change in color that indicates a boundary.


The solution to this problem, which underlies nearly all of my work in AI, is to pick an initial level of distinction \delta, and then gradually increase that value, measuring the rate of change in the entropy of the structure in question as a function of \delta. The point at which the entropy changes the most is the point at which the level of distinction produces the greatest change in structure. Therefore, my algorithms treat this value of \delta as the level of distinction that is best suited for analyzing an object, or dataset. In the case of boundary detection, we’d increase our level of distinction between colors, and measure the entropy of the structure of the boundaries as a function of our level of distinction \delta. As we iterate, there will be a point where the complexity of the boundaries “pops”, causing a spike in entropy, and it is at this value of \delta that the boundary detection algorithm effectively terminates.

In the case of the intersection algorithm, the goal is to generate categories of sets, grouping together sets that are sufficiently similar to warrant co-categorization. Further, our measure is an intersection count, so rather than asking whether the difference between two sets is less than some \delta, we will instead ask whether the intersection count between two sets is at least \delta. That is, the question is restated as whether two sets have enough elements in common to warrant co-categorization.

The method used by the intersection algorithm is otherwise the same as the Euclidean algorithm, in that it picks an initial anchor set from the dataset, and then iterates through the entire dataset, testing whether the intersection count between each set and that anchor set is at least \delta. If the intersection count between a given set and the anchor set is at least \delta, then the set in question is added to the category associated with that anchor set. In short, if a given set has at least \delta elements in common with the anchor set of a category, then we add that set to the category. We do this until all sets have been assigned to exactly one category, test the entropy of the resultant categorization, and then repeat this process for a new and increased minimum intersection value \delta, selecting the categorization associated with the value of \delta that produces the greatest change in entropy.

The Expected Set

We will of course need to determine the range over which the values of \delta iterate. The Euclidean algorithm uses a function of the standard deviation of the dataset as the maximum \delta, and the minimum \delta is the maximum \delta divided by the number of iterations. By analogy, I’ve used the average of the intersection counts over the entire dataset with what I call the expected set, which is a mathematical construct, and not an actual set of elements. This calculation tells us, on average, what the intersection count would be between a given set, and another set that contains a typical collection of elements from the dataset.

For example, assume that the dataset consists of the family of sets F = \{ \{1, 2, 3 \}, \{2,14 \}, \{1, 2, 5 \}, \{2, 6\} \}. The union over all of the sets in F is U = \{1, 2, 3, 5, 6, 14 \}, and every item in the union occurs once in the dataset, except for 1, which occurs twice, and 2, which occurs four times. If we randomly select a set from F, the probability of selecting a set that contains 1 is \frac{1}{2}, since 1 appears in two of the four sets. As a result, if a set contains a 1, then the expectation value of that item’s contribution to the intersection count between that set and a random set chosen from F is \frac{1}{2}. If a set contains a 2, then the expectation value of that item’s contribution to the intersection count is 1, since 2 occurs in every set.

More generally, we can take the intersection between a given set and the union over F, and weight each item by the weights just described, where the weight of an item is the number of instances of that item in the dataset, divided by the number of sets in the dataset. Returning to the example above, the union over F is   U = \{1, 2, 3, 5, 6, 14 \}, and the corresponding weights are W = \{ \frac{1}{2}, 1, \frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4} \}. The intersection count between A = \{1,2\} and the expected set over F is the intersection count between A and U, as weighted by the elements of W, which is in this case \frac{1}{2} + 1 =  \frac{3}{2}.

Formally, the intersection count between a given set and the expected set is really a function on the elements generated by the ordinary intersection between that set and the union over the applicable family. For example, first we calculate the intersection between A and U, which is the set \{1, 2\}. Then, we apply a function that maps the elements of that set to the elements of W, which are in this case \{ \frac{1}{2}, 1 \}. Then, we take the sum over that set, which I’ve defined as the intersection count between the set in question and the expected set, which is in this case \frac{3}{2}.

If we actually take the intersection count for every pair of sets in F, and then take the average, it comes out to \frac{7}{6} = 1.1667 elements. If we calculate the average intersection with the expected set for each of the four sets, it comes out to \frac{3}{2} = 1.5 elements. On a percentage-wise basis, there is a significant difference between these two values, but nonetheless, this method allows us to generate a reasonable approximation of the average intersection count between every pair of elements with only O(N) calculations (note that we can generate the union by taking the sum over the rows of the entire dataset, which requires O(N) calculations).

So to summarize, we begin with a dataset comprised of sets, calculate the expected intersection count using the method outlined above (or actually calculate the average intersection), and then iterate \delta from the expected intersection divided by the number of iterations, up to the expected intersection, constructing categories for each value of \delta, and then select the categorization that is the result of the value of \delta that generated the greatest change in the entropy of the categorization.

Predicting Data

The steps outlined above will allow us to categorize data comprised of sets. We can also find the category of best fit for a new input set by testing the intersection count between that input set and each of the anchor sets generated by the categorization algorithm, and then selecting the category corresponding to the anchor set with which the input set had the greatest intersection count. As a practical matter, this would allow us to make predictions given a discrete set that we know to be part of a larger dataset that we’ve already categorized.

Returning to the example above, if we’ve categorized a dataset of prior consumer purchases, and a new set of purchases by a particular consumer fits best in category A, then it’s reasonable to conclude that the consumer in question might also, at some future time, consider purchasing the items contained within the other sets in category A.

But we can do better, by using the same technique that I used for the Euclidean algorithm, which is to repeatedly reapply the categorization algorithm to each of the categories generated by the first application, which will generate subcategories for each category. We keep doing this until the categorization algorithm no longer finds any further opportunities for subdivision. This process will generate a tree of subcategories, where the depth represents the number of times the categorization algorithm was applied.

We can then attempt to place a new input set into the tree, and the advantage of this approach is that the subcategories are by definition smaller than the categories within the top level categorization, which will allow us to search through a larger number of smaller categories for a match, thereby generating more accurate predictions. Because a match is determined by the intersection count, and the intersection operator is vectorized, this can be done quickly in at most O(M) calculations, where M is the number of categories in the tree, which is generally a low multiple of N, the number of items in the dataset.

Note that this method is distinguishable from testing whether the presence of one item in an input set implies that another item is likely to be in the input set (perhaps as a piece of missing information). So for example, if a consumer purchases kale, then perhaps it is the case, that as a general matter, that consumer will also probably purchase quinoa. We could look for correlations like this in a dataset of sets using vectorized processes. Specifically, recall that in this case, each row corresponds to a consumer, and each column corresponds to an item. Each row will contain a sequence of  0/1’s that will indicate what items were purchased by each consumer. We can then find, for example, all the rows where column 1 contains a 1, which corresponds to the set of consumers that purchased item 1. We can easily take the sum over any other column we’d like from that subset of rows, which will tell us how many times a consumer that bought item one also bought some other item. This will allow us to produce a vectorized measure of correlation between items in the dataset.

But that’s not what this prediction algorithm does.

Instead, this prediction algorithm takes a set of consumer purchases, and says, as a general matter, to what category does this consumer belong based upon the entirety of the consumer’s purchases. This will allow us to generate an entire set of predictions, as opposed to testing whether one particular item implies that some other item is likely.

Specifically, if an input set maps to a particular category, then all of the items in that category are reasonable predictions for the input set, and we can of course sort them by frequency of occurrence, and as a general matter, use and present these predictions in a manner that is best suited for the particular application at hand. So, for example, if we know that a particular consumer likes a particular set of musical groups, then we can generate a set of suggestions based upon the entirety of the consumer’s preferences.

We could conceivably take into account the complete profile of a consumer, across food, music, zip code, credit range, political party, etc., and then categorize consumers based upon the entirety of the data that is available about them. We then can form expectation values for any missing information based upon the statistics of the category to which the consumer is mapped by the prediction function.

Vectorized Image Classification

In this post, I’m going to introduce a set of algorithms that can quickly classify images. Running on an iMac, it took just 1.363 seconds per image (on average) from start to finish (including all pre-processing time) to classify the dataset of images below. Note that this is the amount of time it took per image to train the classification algorithm, as opposed to the amount of time it would have taken to classify a new image after the algorithm had already been trained. Like all of my algorithms, these algorithms have a low-degree polynomial runtime.

Vectorized Boundary Detection

In a previous paper, I introduced a set of algorithms that detect the boundaries in an image by first partitioning an image into rectangular regions, and then scanning those regions from left to right, and top to bottom, and marking the locations of any likely boundaries based upon changes in color. How significant of a change in color is necessary to mark a boundary is discussed in that paper.

At a high level, the only difference between the algorithm I’ll present in this note (the code for which I’ve already posted here) and the algorithm I previously presented, is that this algorithm relies only upon the average color of each region, whereas the previous algorithm sampled about 100 colors from each region. As a result, this algorithm makes use of far less information, and is therefore much faster. However, because it is much faster, it can make use of a more fine-grained partition, which actually increases accuracy when compared to the previous version. The images below show a photograph of a wall that I took in Williamsburg, Brooklyn, together with the boundaries detected by the algorithm, which are drawn in black.

Image Classification

The first step of the classification process is to run the boundary detection algorithm on an image. This will produce two binary matrices: one representing the left to right boundaries, and another representing the top to bottom boundaries. If the (i,j) region of the image contains a boundary, then the (i,j) entry of one of the matrices will contain a 1. As a result, the boundary matrices contain location information about the boundaries in the image. A second function is then called that extracts that location information from the boundary matrices, which produces a collection of coordinates that represent the locations of the boundaries in the underlying image.

In summary, we run a boundary detection algorithm, and then extract the coordinates of those boundaries as a dataset of points, which provides shape information about the underlying image. We can visualize the boundary coordinates by simply removing the image, leaving only the boundaries, which is shown below.

Screen Shot 2019-07-15 at 7.22.13 PM

The MNIST dataset images are too small to be analyzed by this algorithm, so instead, I’ll present two datasets I’ve used previously: a dataset of 40 photos of handwritten A’s and B’s (20 of each), and a dataset of 40 photos of speakers and headphones (20 of each), each taken with an iPhone 6. As is evident, each dataset consists of two classes of images.

The classification is accomplished by first running the boundary detection algorithm on each image, extracting the point information from the boundaries, and then categorizing the resultant two-dimensional shapes as collections of points. In this case, I used the same categorization algorithm that I introduced in my main research paper on AI. Samples of the images are shown below, together with the boundaries detected by the algorithm.

I then test whether the categories generated are consistent with the hidden classification data. Specifically, if a category contains more than one class of image, then I treat that category as a fail. So, for example, if an image of a speaker is categorized together with four images of headphones, then that entire category of five images is treated as a fail. As a result, a category must be perfectly consistent with the hidden classification data in order to constitute a success. The accuracy is then the number of successes divided by the total number of categories. In this case, for the headphones / speakers dataset, the categorization algorithm was 100\% accurate, meaning all categories were perfectly consistent with the hidden classifiers. For the handwritten character dataset, the categorization algorithm had an accuracy of \frac{22}{23} = 95.65\%, since exactly 1 category out of 23 was inconsistent with the hidden classifiers.

All of the code necessary to run these algorithms is available on my researchgate page.

Discrete Data Categorization

Note that there’s an algorithm in my code archive that can form categorizations on datasets that consist of discrete sets, as opposed to Euclidean vectors (see, “Optimize_categories_intersection_N”).

This algorithm would be useful for categorizing data that consists of items with arbitrary labels, rather than positions. For example, datasets of consumer preferences (i.e., foods, music, movies, etc.).

I’m working on a research note explaining some theoretical aspects of the algorithm, but it’s no different than the “Optimize_categories_N” algorithm that I introduced in my main research paper. The only material distinction is that the intersection algorithm uses the intersection operator to compare items, rather than the Euclidean norm operator. That is, the “Optimize_categories_N” algorithm compares two elements by taking the norm of the difference between those two elements, whereas the intersection algorithm uses the intersection count between two sets.

I can see that the vectorized intersection image algorithm is quite popular (based upon the stats), so I thought I’d call attention to this algorithm as well, since it uses the exact same operator I discussed in the vectorized image partition article.

Chamber Music

I’ve just completed two original chamber music pieces, and though I don’t ordinarily post about my compositions, I’m extremely proud of this work, which you can find on my soundcloud page:

Fast Boundary Detection Algorithm

I’ve developed an extremely fast boundary detection algorithm, the code for which you can find on my researchgate page.

It’s based upon the same “big delta” concept I introduced in previous research notes here and here.

The algorithm operates in just a few seconds (often 2 to 3 seconds), and produces awesome results. I’ll follow up with a more formal research note explaining how it works in the next few days.

Here are a few examples: