History
First of all – we know that there are many, many good relational databases available. The question, “Why did you create another one?” is a very good and valid question. The answer is much more complex, but has nothing to do with us not understanding what is on the market.
BlackRay was designed in 2004, to solve a very specific problem. We were facing a project, where 100Million+ records, each about 50 columns wide had to be searched. The data in question was yellow- and whitepage-data for all of Germany.
The basic non-functional requirements were at least 100 queries/second sustained load per server, and no query longer than 1 second. We tried just about any database we could get our fingers on, including MySQL, ORACLE, DB2, Fast Search and Transfer, Thunderstone…
After many painful mistakes it became obvious that the parameters for this project required an in-memory index. Again, we went back to evaluate the options available. ORACLE had just acquired TimesTen, there was memCacheD, mcObject, and ExorByte available, plus some more obscure ones that we could not get into a useful comparison test.
Not wanting to lose the project, we decided to build just what we were lacking in the existing systems. The basic design parameters were the following, besides the already mentioned performance criteria
- Lightweight administration
- Relational structure, to enable a somewhat normalized data model
- Full data load (100Million records) in less than three hours
- Periodic updates (half a million per day) in less than half an hour
- Data must be persisted onto disc, to ensure resumed operations after failure
- Scalability in terms of throughput, without the need to have a complicated setup
It turned out that the data load requirements were the hardest to fulfill. For some reasons, this type of operation does not receive much attention in most database designs. However, bulk data updates are a common issue, and we have experienced the need of the operator to do monthly full index rebuilds. It does not do any good if the index rebuild takes several days, and renders the entire system out of order…..
Regarding search features, directory searches have some requirements that are difficult to realize with a regular index. First, token sensitive search is quite important. People and especially companies frequently have names with multiple tokens. Descriptions (like full text data) also need to be tokenized, and possibly stemmed, to make them searchable.
In our use case, where directory searches are carried out in a call center, with live agents, all searches are abbreviated. Good agents will fill in two to three letters into each of the name, street and city fields and run a query. In most databases this will result in a serious problem. Typical databases will perform an index scan for the first (probably permutern indexed) column, but will perform a table scan for the other conditions of this query. Even worse, if the query spans two tables, the entire cross-product of possible row-IDs either has to be pre-computed, or again be derived from a full table scan. The complexity of this query is quadratic to the size of the smaller table.
It is however important to note that we originally had no intention whatsoever to make this into a database. In fact, even today, we call it a data-engine. From design and features, we are much closer to a storage engine than a real database. The query-API is object oriented, connections use a special high-speed toolkit (ZeroC ICE). The index essentially is a combination of the main issues discussed in the excellent book “Introduction to Information Retrieval“. (Too bad we did not have this book in 2004…)
In the end, further project rquirements forced us to add more and more features found in all databases today. Regular SQL, command line tools, SNMP monitoring, all these were added in the final two years of the project. In the end, we decided in late 2008 to open source the entire effort.


