Data Collections Functions

JavaTools also provides four functions to analyze/process the data in the JavaTools collections directly in Java. They are JMin[], JMax[], JDisjoint{} and JFrequency[]. They compute the minimum, the maximum, and the number of occurrences of a specified element in the collection. Although minimum, maximum, and element frequency can easily be determined with Mathematica functions when using Mathematica lists, these functions from JavaTools operate DIRECTLY in Java on the Java datastructures! (They use java.util.Collections internally) This means an enormous performance improvement for very large data collections, as

- the operation happens entirely in the Java virtual machine and not in the kernel

- the Java collections framework was written to handle data sizes in the Terabyte magnitude for entire databases

JavaTools effecctively allows you to manage data structures that are too large for the kernel to handle/store, while using entirely top-level Mathematica functions to perform these operations! For small data structures usage of the Stack, Queue, Set, Bimap, MultiSet, and MultiMap from JavaTools doesn't increase performance at all (in fact, it may even slow down things slightly compared to Mathematica), but for large data sizes the performance increase is enormous, as all operations happen in the Java virtual machine and not in the kernel, and the Java Collections framework was written for data of giant magnitudes.

Note that by default the JavaTools functions that return lists/collections do so by returning a Mathematica list. To process the data further with the underlying Java functions (not: Mathematica functions!), it's necessary to return them as Java object references, not Mathematica lists, otherwise the Java functions from the Collections class wouldn't be able to process the data as there is no reference to it. To return such collections as Java object references an optional second boolean argument is provided. If set to True, the JavaTools functions return Java object references instead of Mathematica lists with the data.

Here we store all US cities in a (Mathematica) list:

In[2]:=

This creates a new Multimap:

In[3]:=

Out[3]=

Next we store all {state, city} pairs of the US cities in the Multimap:

How many cities named Franklin are there in the US?

How many cities named Miami are there in the US?

All cities in Rhode Island:

Of course we can also reverse the mapping and store the states in which a city with a specified name occurs:

In[4]:=

Out[4]=

In which states are cities called Franklin?

In which states are cities called Miami?

The JavaTools function JDisjoint[] returns True if the two collections are disjoint, i. e. don't have any common elements, and False, if they are not disjoint, i. e. do have at least one common element.

In[5]:=

Out[5]=

In[6]:=

Out[6]=

In[7]:=

Out[7]=

In[8]:=

Out[8]=

In[9]:=

Out[9]=

Comparison with Mathematica

Of course, these examples could have been written with an efficient Mathematica program as well. However:

- These collections functions offer convenient top-level functions that allow the user to treat these data structures without any Mathematica programming and simply store, retrieve, access, and evaluate the data with the familiar paradigms to treat top-level data structures such as Sets, MultiSets, BiMaps, Multimaps, Stacks, and Queues.

- The collections in JavaTools are references to the data structures in the Java virtual machine, thereby making it easy to exchange them between Mathematica and other Java applications in both ways. One could create (or process) these data structures with external Java applications, eliminating any "cross-border" barriers between Mathematica and Java (as a Java symbol in Mathematica is merely a reference to that object in the Java virtual machine, not the actual data!).

- The user can specify the amount of memory to be used by the Java virtual machine (JavaTools supports this through the optional commandline setting in the configuration file, see documentation). When the kernel "is getting full" and doesn't release memory resources any more, but the computer system still has plenty of memory, it makes sense to "outsource" the memory requirements to store the data to the Java virtual machine.

- Depending on the quality of the Mathematica program to replicate these structures, executing them in the Java virtual machine may result in enormous performance gains, as anything that executes in the Java virtual machine runs with compiled speed. If they were written with Mathematica code, the data would only be accessible from within the kernel, and Mathematica would interpret the statements, rather than execute them in a compiled manner.