Lily documentation
Book Index

Lily documentation

Table of Contents

1 Lily Documentation (1.2)

1.1 What is Lily ?

Lily is a scalable repository for storing, searching and retrieving records (or content items, documents, objects, ...) It is a distributed server application that fuses Apache HBase and SOLR and is designed to be used by front-end applications (CMS, DMS, DAM, ...) using the Lily API (Java or REST).

Getting started

To install Lily and give it a quick spin, see Running Lily. To get an overview of all available documentation, have a look at our sitemap.

Printing tip: to print an individual document, change the .html extension in the URL to .pdf. To print a collection of documents, choose 'Document Basket' in the Tools menu, select 'Select documents from the navigation tree', select the documents you want to print, and then choose 'Get documents aggregated as PDF'.

This is the documentation for Lily [unresolved variable: version]. The documentation for other releases can be found through our documentation service.

2 Running Lily

2.1 About

This guide will take you through a first Lily experience, with a sample schema about books and authors. This will only take a few minutes, but make use of built-in versions of Hadoop/HBase which means your data won't be saved between server restarts. It's a good way to familiarize yourself with the deployment of Lily before running it on a real install of Hadoop/HBase/ZooKeeper.

If at any point you run into problems, please let us know on the Lily mailing list.

2.2 Linux, Mac OS X, Windows

Linux is the only supported production platform for Hadoop.

For development purposes, you can also use other Unix-variants like Mac OS X.

Windows is not supported.

2.3 Java 1.6

You need to have Sun/Oracle Java 1.6 installed. An environment variable JAVA_HOME should point to where it is installed.

If everything is fine, you should be able to execute:

$JAVA_HOME/bin/java -version

and it should show something like:

java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) Server VM (build 17.0-b16, mixed mode)

2.4 Downloading Lily

Download the Lily binary distribution: lily-1.2.1.tar.gz.

2.5 Starting Lily

For testing purposes, Lily ships with a command called launch-test-lily which starts Lily and all its dependent services in one JVM. The started services are: HDFS, HBase, MapReduce's JobTracker and TaskTracker, ZooKeeper, Solr and Lily-server itself.

So start this now as follows:

bin/launch-test-lily -s samples/books/books_sample_solr_schema.xml -c 5

The -s option specifies the Solr schema we need for our demo, the -c option specifies that the Solr index will be auto-committed every 5 seconds.

Wait a few moments for it to be started completely, until you see this:

-----------------------------------------------
Lily is running

This setup will store its data in a temporary directory which is lost each time you stop or restart launch-test-lily.

See further on for running against a 'real' HBase & co.

2.6 Create Field & Record Types

Before putting content in Lily, you need to create some field types and record types.

For the purpose of this first run, we will upload some types for managing books and authors using the import tool:

bin/lily-import -s samples/books/books_sample.json

The -s option specifies that we only want to upload the schema at this point (the JSON file contains records too).

Behind the scenes, this command connects to ZooKeeper to find out the available Lily servers and picks one from it at random to talk to.

2.7 Define An Index

Define an index using:

bin/lily-add-index -n books -c samples/books/books_sample_indexerconf.xml -s shard1:http://localhost:8983/solr

The books_sample_indexerconf.xml file is the configuration for the indexer: it describes what records should be indexed and how the fields of the records should be mapped to Solr fields.

The lily-add-index command will modify the configuration of indexes stored in ZooKeeper. In response to this, the Lily server(s) will put everything necessary to keep the index up to date in action: register a message queue subscription and start the indexing processes.

2.8 Loading Records Into Lily

Use the import tool to upload some records into Lily:

bin/lily-import samples/books/books_sample.json

2.9 Querying The Solr Index

Browse to

http://localhost:8983/solr/admin/

Type 'frankenstein' in the input box and press search, you should get a result with one document in it. In some browsers you need to do view-source to see the XML result.

As mentioned above, it can take up to 5 seconds for the new records to be visible in the index, so if you were very fast you may have to retry.

2.10 REST interface

There are two protocols available to talk to Lily: an RPC-style binary one based on Avro, which is used when you use the client Java API, and a REST-style API (HTTP+JSON).

The port on which the REST interface is listening is printed on repository startup, by default it is 12060:

Protocol [HTTP/1.1] listening on port 12060

For example, here is how you can access one of the records created earlier by the import:

http://localhost:12060/repository/record/USER.mary_shelley

2.11 Rebuilding The Index

Usually an index is kept up-to-date incrementally by listening to repository events. Sometimes it can be useful to rebuild the index: when the configuration is changed or when it was defined after already loading content into Lily, or when the Solr index is lost, or whatever. It is also possible to disable incremental index updating completely, and only update the index through batch rebuilds.

Let's quickly run through how to trigger a batch index build.

A batch index build is triggered by changing the batch build state of an index to BUILD_REQUESTED, as follows:

bin/lily-update-index -n books --build-state BUILD_REQUESTED

In response to this state change, Lily will launch a Hadoop job to perform the index build, and change the batch build state to BUILDING. This can be observed by running lily-list-indexes:

bin/lily-list-indexes

which shows output like this:

books
  + General state: ACTIVE
  + Update state: SUBSCRIBE_AND_LISTEN
  + Batch build state: BUILDING
  + Queue subscription ID: IndexUpdater_books
  + Solr shards: 
    + shard1: http://localhost:8983/solr
  + Active batch build:
    + Hadoop Job ID: job_20101105103522869_0001
    + Submitted at: 2010-11-05T10:38:33.913+01:00
    + Tracking URL: http://localhost:45989/jobdetails.jsp?jobid=job_20101105103522869_0001

Notice it also shows the ID of the Hadoop Job and a tracking URL which will take you to a web ui that displays more information about the progress of the job.

After a little while the job will be finished, and when you run lily-list-indexes again, the batch build state will be INACTIVE and information about the last run batch build will be available:

books
  + General state: ACTIVE
  + Update state: SUBSCRIBE_AND_LISTEN
  + Batch build state: INACTIVE
  + Queue subscription ID: IndexUpdater_books
  + Solr shards: 
    + shard1: http://localhost:8983/solr
  + Last batch build:
    + Hadoop Job ID: job_20101105103522869_0001
    + Submitted at: 2010-11-05T10:38:33.913+01:00
    + Success: true
    + Job state: succeeded
    + Tracking URL: http://localhost:45989/jobdetails.jsp?jobid=job_20101105103522869_0001
    + Map input records: 2
    + Launched map tasks: 1
    + Failed map tasks: 0
    + Index failures: 0

2.12 Next steps

Now you know the basics of running Lily. Next steps include:

As mentioned before, the HBase, Hadoop, ZooKeeper and Solr instances launched using launch-hadoop and launch-solr store their data into a temporary directory which is lost when you stop them.

2.13 Installing A Lily Cluster

For instructions on how to install HBase, Hadoop, ZooKeeper and Solr, we refer to the installation guides of these individual products. Below we give advice on what versions to use and how to configure Lily to connect to your installation.

Lily Enterprise includes comprehensive tools for installation, administration and cluster deployments, and Debian/RPM-packaged versions of Lily and related software. It also is extensively tested against the Cloudera Distribution of Hadoop.

To provide the comfort of tested and supported releases of the Hadoop stack, we have selected to use Cloudera's Hadoop distribution. Similar HBase (0.90+) and Hadoop versions should also work, as long as the RPC interface is compatible.

2.13.1 Network configuration

Make sure your inter-host-nameresolving is set up correctly. The hostnames should be properly set up: on each server, the local hostname should resolve to the IP address of the network interface (eth0), and reverse resolving the IP address should again give the same hostname (and not localhost or the hostname with some domain suffix appended to it).
It is ok to fix this using /etc/hosts instead of changing DNS, but in that case it should be done consistently on each node so that the nodes know each other by name.

2.13.2 Installing Hadoop, HBase and ZooKeeper

We recommend to use the versions from Cloudera CDH3u3, available from Cloudera downloads.

We refer to the Cloudera documentation, and the generic Hadoop, HBase and ZooKeeper documentation, for more information on how to setup an Hadoop/HBase cluster.

HBase: deploy extra jar

The following jar should be copied from the Lily distribution to the hbase lib directory on each of the HBase nodes:

lib/org/lilyproject/lily-hbase-ext/[unresolved variable: version]/lily-hbase-ext-[unresolved variable: version].jar

In principle, this should only be necessary if you make use of Lily's blob fields.

2.13.3 Installing Solr

We have developed against Solr [unresolved variable: solrVersion]. Other versions should work as long as the REST interface and the javabin format are compatible. In particular, for Solr 1.4 you will need to switch to the XML format as the javabin format is not compatible, this is explained in  Solr Versions.

Download from the Solr website.

2.13.4 The Lily Server Process

Lily consists of a lily-server process which you can run on any number of nodes. The Lily clients will talk to this Lily server process, which in turn will make use of HBase and Solr. In contrast to HBase and Solr, the Lily server process is lightweight in terms of memory requirements. Typically, you will run a lily-server along with each HBase region server.

2.13.4.1 Configuring Lily to connect to your HBase, Hadoop & ZooKeeper

To configure Lily you need to know:

Then adjust the following files:

Note that you have to specify the ZooKeeper information twice: once for HBase, and once for Lily. You can use the same or different ZooKeeper installations for them.

As for Hadoop/HBase, you need to make sure these configuration changes are deployed to all your Lily nodes.

2.13.4.2 Running The Lily Server Process

Lily's server process is launched either by using the following shell script:

bin/lily-server

or by using the Java service wrapper (recommended):

service/lily-service start

The very first time you start Lily it will take a bit slower since the tables still need to be created.

When Lily is started, you will see a line like this logged in logs/lily-server:

[INFO   ] <2011-10-14 16:33:22,386> (org.kauriproject.runtime.info): Kauri Runtime started [October 14, 2011 4:33:22 PM CEST]

In case you started Lily using the shell script, this line will also be printed to standard out.

When starting Lily using the service wrapper, and it fails, be sure to check logs/lily-wrapper.log

If the lily-server JVM is running and the last line printed in the log is the following

[INFO   ] <2011-10-14 16:33:19,281> (org.kauriproject.runtime.info): Starting module general - /../lily-general-module-[unresolved variable: version].jar

then Lily is trying to connect to ZooKeeper. At startup it will retry this for an extended amount of time (configurable through conf/zookeeper/zookeeper.xml) to cope with services not being started in order.

2.13.4.2.1 Identifying the lily-server process

Using the Java command jps you can see an overview of the running Java processes (you might have to run this via sudo).

Depending on whether you start Lily via the shell script or the service wrapper, you will see a different class name.

$ jps -l

# in case of the lily-server shell script
24044 org.kauriproject.launcher.RuntimeCliLauncher

# in case of the service wrapper
23431 org.tanukisoftware.wrapper.WrapperSimpleApp

2.14 Upgrade from Lily 1.1

ATTENTION: the upgrade tool doesn't upgrade the linkindex table. If you use the Lily LINK value type and want to upgrade between Lily 1.1 and 1.2, please contact us.

2.14.1 During upgrading

After installing the Lily 1.2 software, but before launching Lily 1.2 for the first time, the following should be done.

Convert record IDs

The way record IDs are encoded into HBase row keys has changed slightly since Lily-1.1. This change is backwards incompatible and requires an upgrade of the record table. The way the upgrade works is by copying all your existing records into a new table (in the new format), then the old record table is dropped and the new one is renamed to take its place.

The creation of the new table is done using the upgrade tool called lily-upgrade-from-1.1. Renaming/dropping the record table is performed by yourself using commands on the HBase shell, as described below.

Use the -h option to list all options and the usage of the command.

Performs upgrade of the HBase storage format from Lily 1.1 to Lily 1.2

Be sure to read the Lily documentation on how to use this tool!


usage: lily-upgrade-from-1.1 [-confirm] [-dumplog] [-h] [-log <config>]
       [-tn <tablename>] [-to <filename>] [-v] [-wtw] [-z
       <connection-string>]
 -confirm                             Confirm you want to start the
                                      upgrade.
 -dumplog                             Dump default log4j configuration
 -h,--help                            Shows help
 -log <config>                        log4j config file (.properties or
                                      .xml)
 -tn,--table-name <tablename>         Destination table name, default
                                      record_lily_1_2
 -to,--table-options <filename>       Table creation options file, like
                                      conf/general/tables.xml
 -v,--version                         Shows the version
 -wtw,--write-to-wal                  Enable write to WAL, off by default.
 -z,--zookeeper <connection-string>   ZooKeeper connection string:
                                      hostname1:port,hostname2:port,...

WARNING! Lily should not be running when executing this upgrade tool.
         Only HBase, Hadoop and Zookeeper should be running.

In the instructions below all the default settings are used : table-name = record_lily_1_2, write-to-wall = off -z = localhost:2181

Now you can restart your lily servers.

You can do a little test to output a number of your records to make sure everything ran properly

$LILY_HOME/apps/scan-records/target/lily-scan-records -p -l 10

3 Architecture

3.1 Distribution

Lily has a distributed architecture. This distribution is manifested in two ways. First, there are nodes (= systems, servers) that perform different functions, causing a functional layering. Second, there are multiple nodes that perform the same function, for purposes of scalability and fault-tolerance.

This is illustrated in the following figure.

In this diagram, the Lily node serves as a black box node for different components, which are described further on.

Not every box in this diagram necessarily corresponds to a physical server. While multiple processes of the same kind should be run on different servers, you can run e.g. a Lily node, a HBase region node and a HDFS data node on the same server.

While the diagram shows three nodes of every kind, the actual numbers can differ for each type of node, depending on the needs.

For some kinds of nodes, it does not matter to what node to connect. For example, each client can connect to any arbitrary Lily node. For others, the node to connect to depends on the one that hosts the data. For example, a Lily node that wants to read a row from HBase will have to connect with the HBase node that hosts this row.

A Lily client does not connect to one fixed endpoint. It decides itself to what Lily node to connect, and directly talks to HDFS and HBase nodes when appropriate.

3.2 Main components

The diagram below shows the main components of the Lily content repository and the connections between them. For clarity, this figure shows only one instance of each component, but remember that there can be any number of them.

What we referred to as “Lily node” in the above section on distribution consists of different independent components such as the repository, the indexer, and the message queue. These could be run as different processes or in one process, this is of little importance for our discussion here.

3.2.1 HBase

Lily uses Apache HBase for the storage of fine-grained data. HBase is modeled after Google's BigTable. HBase has little in common with the SQL databases everyone knows: it does not offer much querying, nor transactions. But instead it offers scalability to very large amounts of data (billions of rows) by adding more hardware as needed. No manual repartitioning of the data is necessary. It also handles failing nodes automatically.

HBase has a special data model, whereby rows can contain very large amounts of columns, and columns without a value do not take space, so it is ideal for sparse data structures. The BigTable people concisely call it “a sparse, distributed, persistent multidimensional sorted map”. HBase does not know data types, it handles everything as bytes.

HBase stores its data on HDFS, described next.

3.2.2 HDFS

HDFS is the Hadoop distributed file system, thus a file system that spans across nodes. It is modeled after GFS, the Google File System. A file in HDFS is stored multiple times in the cluster (by default 3 times), so that if a node fails, the data is still available elsewhere. HDFS is best used for the storage of larger files. The namespace of the file system (the link between the names of the files and where they are stored) is maintained on one system, called the name node. The number of files one can store in HDFS is limited by the amount of memory in that system. Practically this means you can still store millions of files on it, but in Lily we will store smaller blobs in HBase, to avoid quickly hitting this limit. HDFS has a focus on high throughput rather than low latency.

3.2.3 The repository

The repository provides the basic record CRUD functionality. Clients connect to the repository using an Avro-based protocol. Avro is an efficient binary serialization system. The repository connects to HBase using the HBase Java API, which talks HBase RPC, also based on an efficient binary serialization.

The basic entity managed by the repository is called a record, see the repository model description. When reading a record, a client can specify to read only some fields, and when updating a record, a client only needs to communicate the changed fields.

The Java API exposed by the repository is based on simple data objects and service-style interfaces. This API-approach makes “playing” with the data objects straightforward.

The ID of the record can either be assigned by the user when creating the record, or is automatically assigned by the repository, in which case it is a UUID.

Fields in a record can be blobs. These blobs are stored either in HBase or on HDFS, depending on a size-based strategy. Smaller blobs like HTML pages can be stored in HBase, while bigger blobs that should be handled as streams are stored on HDFS (see also discussion on HDFS above).

One record, which can contain multiple versions, maps onto one row in HBase. This makes that a record is the unit of atomic manipulation.

3.2.4 The Write Ahead Log

When creating or updating a record, often secondary actions (= post-update actions) will need to happen, the most common example of which is keeping indexes up to date.

If we would naively update the row in HBase, and then update the corresponding indexes, there would be a possibility that the indexes would not be updated if the repository process dies.

In more traditional architectures, transactions are used to assure that multiple actions happen as one atomic operation. For our use-cases, full transaction support is not needed. We do not need atomicity, nor do we need rollback. All secondary actions are considered to be subordinate actions which should succeed, and if they fail, they should not invalidate the operation on the record.

The solution we use in Lily is a write-ahead-log, or WAL for short. Before performing an action to the repository, we write our intention to do this to the WAL. Then we update the repository, and confirm this to the WAL. Then the secondary actions are performed, each time confirming to the WAL. If at any point the process would be interrupted, upon restart the WAL can be checked to see up to where we got and to perform any remaining actions.

Lily's WAL is unrelated to the HBase WAL. It is also conceptually different, since Lily does not write the data of a record update to its WAL. Its only purpose is to guarantee the execution of the secondary actions.

3.2.5 The Message Queue

Updating indexes does not need to happen synchronously with the update of the record, while the client is waiting for its response. Rather, this can be done asynchronously. The usual solution to this is to make use of a message queue. Pushing a message onto the queue is a kind of secondary action that needs to happen when updating a record, and our WAL will assure that the message will surely be pushed to the queue even if the repository dies before it gets to that.

Now, rather than bringing an existing queue technology into our system, which would have its own persistence, admin needs, failover solution, etc. Lily will use a lightweight message queue that reuses HBase for persistence.

3.2.6 The Indexer

The role of the Indexer is to keep the Solr-index up to date when records are created, updated or deleted. For this purpose, the Indexer listens to the message queue.

The indexer maps Lily records onto Solr documents, by deciding (based on configuration) which records and what fields of the record need to be indexed. For blob fields, it can perform content extraction using the Tika library.

3.2.6.1 Denormalization

Lily records can contain link fields. Link fields are links to other records. During indexing, you can include information from linked records within the index of the current record. This is called denormalization. Information can be denormalized by following links multiple levels deep. Denormalization at index time is an alternative for SQL-join-like functionality at query time. Join-queries are not available in Lucene, and complicated to do with sharded databases in general. Denormalization makes querying faster and easier, but complicates indexing. Denormalization assumes you now at beforehand (= when indexing) what sort of queries you will want to do on linked content.

A consequence of denormalization is that when a record is updated, the index entries of other records might also become invalid, when they contain information from the updated record. The Lily Indexer will automatically update such index entries. For this, it makes use of another component, the LinkIndex, which maintains an index (based on the hbase indexing library) of all links between records.

3.2.7 Solr

Solr is a search server based on Lucene, the well-known excellent text-search library. It provides powerful search functionality including full-text search (with spell check, search suggestions, and so on), fielded search and faceted navigation. The configuration can be tweaked, e.g. with regards to text analysis, to provide an optimal search experience. It supports distributed querying across a set of Solr nodes (to support data sets that do not fit on a single server), and Solr nodes can be replicated (to support many concurrent search requests).

3.2.8 ZooKeeper

ZooKeeper provides some basic services for the coordination of distributed applications, like distributed synchronization, leader election and configuration. As these things are hard to get right, it is a good thing that many applications re-use ZooKeeper for this purpose. ZooKeeper is used by Lily, HBase, and is also starting to make an appearance in Solr.

4 Repository

4.1 Repository Model

Lily's repository model is designed for content management applications. Compared to more data-oriented applications, this means we offer rich field types like multi-value fields, versioning, a flexible schema, and variants (such as for different languages).

4.1.1 Basic concepts and terminology

Lily manages records. A record is a set fields. Records adhere to a record type which specifies the field types that are allowed within the record. Field types define the kind of value that can be stored in the field (string, long, decimal, link, ...) and the scope of the field. The scope determines if the field is versioned or not. Versioned fields are immutable: upon each change of a versioned field a new version is created within the record.

The below diagram shows the relation between these concepts, and some more that we will discuss further on in detail.

4.1.2 No hierarchy

Quite some content repositories use a file system metaphor for the structure of their repository, whereby content is put in a hierarchical namespace. For example, the Java Content Repository (JCR) API uses such a hierarchical model. Such models enforce users to think about a primary organization of the content, and require to decide where in the hierarchy to store each created entity.

In Lily, there is no such hierarchy. The repository is one big bag of records. This avoids that users need to think about where to store things in the primary hierarchy.

Lily does not have tables either, there is just one set of records.

4.1.3 Record identification

A record is uniquely identified by its ID. The ID can be assigned by the user, or can be generated by the system, in which case it is a UUID.

In case you choose to assign record IDs yourself, be sure to adjust the initial table region settings!

More precisely, the record ID consists of two components: the master record ID and a set of variant properties. It is the combination of these two which uniquely identifies a record. However, the variant properties are optional, and we will discuss them in detail later on.

Record re-creation

When a record is deleted in Lily, a deleted marker flag is put to true and all historical data (record type, record type version, field data) that existed for the record is cleared. The current version number is however kept. When later a record would be created with the same record id, this will be regarded as a record re-create. The record is created (as for a normal create), but the version numbering of the record will continue from where it was when it was deleted. (e.g. if the version number was 4 when the record was deleted, the re-created record will get verison number 5). For more information on the reasoning behind this, see  Repository Model To HBase Mapping .

4.1.4 Records, field scopes, versions

4.1.4.1 Records

A record is the core entity managed by the Lily repository. All data you store in Lily is in the form of records.

A record is the unit of atomic modification in Lily, thus the granularity of a read, update or delete operation. Since no concurrent operations can happen on a row, the number of updates to a row in a unit of time has a limit.

A record contains a set of fields. A field is a pair {field type id, value}.

Besides a pointer to its record type, a record has no built-in properties (like “last modified”, “owner”, ...), so there is no unwanted overhead of these.

4.1.4.2 Field scopes & versions

Records can have versions, so that older data stays available, but versioning is optional.

Fields can reside in three scopes: the non-versioned scope, the versioned scope, and the versioned-mutable scope. We respectively speak of non-versioned fields, versioned fields and versioned-mutable fields.

Fields that belong to the non-versioned scope are, as the name implies, not versioned. If a record has only fields in the non-versioned scope, the record will have no versions. If the record does have versions while it also has non-versioned fields, then you can consider the non-versioned fields as fields whose value counts for any version (= cross-version fields). If for such records, you modify only a non-versioned field, no new version will be created.

Fields that belong to the versioned scope are (obviously) versioned: each time a record is updated with new values for such fields, a new version will be created in the record (the fields are not versioned individually). Fields in the versioned scope are immutable after creation: you cannot modify their value in existing versions.

Fields that belong to the versioned-mutable scope are somewhat special: these fields are part of versions like the versioned fields, but they stay mutable (modifiable) in existing versions. They are ideal for metadata about a version, like the version's review status, a version comment, and the like.

Typically, you will either choose to use versioning or not to use versioning, and most fields will fall in one of these scopes. Still it can be useful to have non-versioned fields when using versioning, e.g. for a field which determines the access permissions to the record, as you will want this to affect all versions.

Versions can currently not be deleted.

4.1.5 Field types

The fields in a record are not free name-value pairs: each field in a record has to be defined by a corresponding field type. For each field type, there can be at most one value in a (version of a) record.

The field type fixes some important aspects of a field:

Except for the name, a field type is immutable after creation.

The name of a field type should be unique within the repository.

To illustrate what it means for the name of a field type to be unique, let's compare this with SQL databases. In these, the name of a field is unique within the table, but not across tables. In Lily, field types are defined independently from record types. The same field type can be added to many different record types. This has the advantage that all records which have a field of some type can be treated in a uniform way. For example, if we add a field type "name" to all record types, we will be able to use that name in listings containing records of different types. The same could have been achieved through mixins (see later) or a record type hierarchy. However, the reason we made field types as independent entities is not primarily because of this, but rather so that there would be a fixed {field name, scope} relation, and because a record can have a different record type per scope.

The name (namespace + simple name) can be changed after creation. It is the name users (= developers) will use for identifying fields. However, we expect name changes to be rare, it will typically be as part of a redesign/refactoring or because of a typo.

If Lily would allow to change the value type of a field, it would fail on reading existing field values. Allowing to change the scope would also lead to difficulties reading and writing records.

If you would like to change the scope or type of a field, the solution is to make a new field. You could then run a task which converts all existing records to copy the value from the old field to the new one. Or sometimes better, you make the application cope with both the old and new field when reading a record, and perform the conversion when an update is performed to a record. Note that since field types can change name, you can rename the old field type and give the new field type the name of the old one, so that it is virtually replaced.

Field types can currently not be deleted.

4.1.5.1 Value types

The value type indicates the (java) type of the values that can be stored in the field. Lily has some built-in value types which are listed in the javadoc of the TypeManager, method getValueType.

4.1.5.1.1 Basic value types

The basic value types include:

Name

Java Type

STRING

java.lang.String

INTEGER

java.lang.Integer

LONG

java.lang.Long

DOUBLE

java.lang.Double

DECIMAL

java.math.BigDecimal

BOOLEAN

java.lang.Boolean

DATE

org.joda.time.LocalDate

DATETIME

org.joda.time.DateTime

BLOB

org.lilyproject.repository.api.Blob

URI

java.net.URI

BYTEARRAY

org.lilyproject.bytes.api.ByteArray

4.1.5.1.2 Parametrized value types

Some value types are more complex and can be parametrized with extra information. When refering to these value types, their name is extended with a parameter between brackets: < >

These parametrized value types include:

LIST: a list value type represents a java.util.List

PATH: a path value type represents a org.lilyproject.repository.api.Hierarchy

LINK: a link value type represents a org.lilyproject.repository.api.Link

RECORD: a record value type represents a org.lilyproject.repository.api.Record

4.1.6 Record types

A record type is a named set of field types. Each record is associated with a record type, and in this way it is defined what fields a record should contain (the 'should' is explained later on).

A record type consists of the following:

4.1.6.1 Record type versioning

In contrast to field types, record types can be modified, you can e.g. add and remove field types. On each such change, a new version of the record type is created. Records always point to a specific version of a record type. This way, the state of the record type at the time of record creation or update is preserved.

The name of a record type is not versioned, thus changing the name affects all its versions.

When a record is updated, it will by default move to the last version of the record type.

4.1.6.2 Mixins

Mixins allow easy reuse of a set of fields in various record types.

The mixins of a record type is a list of references to other record types, or more correctly to a {record type, record type version} (references to record types are always to a specific version of the record type). In other words, mixins provide a way to include or import record types within other record types.

Different mixin record types might contain the same field types. This is no problem, the duplicates will be ignored. The association attributes are merged. The behavior for the mandatory attribute is that it is mandatory from the moment it is mandatory in one mixin.

Mixins work recursively: we can be mix in a record type which itself mixes in other record types. If there would be a loop within the mixins, this will be detected and the recursion will stop.

4.1.7 The record – record type relationship

We mentioned earlier that each record is associated with a (version of a) record type.

This is only part of the truth. Each record type is actually associated with three record types, one per scope: non-versioned, versioned, versioned-mutable.

The record type of the non-versioned scope is the main record type of a record.

When a version is created, as part of the version we store the reference to the current record type at the moment of version creation. This way, when older versions are consulted, we can know what their record type was at that time (= the reference to record type itself is also like an immutable, versioned field). New versions are always created with the same record type as the one of the non-versioned scope.

Lastly, the versioned-mutable scope also stores its own pointer to the record type, corresponding to the non-versioned record type at the moment the versioned-mutable data was modified.

4.1.8 Record type as a guide rather than a straightjacket

A record type defines the fields that should be used within a record. When saving a record, the record is validated with respect to the record type: all mandatory fields should be present, fields that are not in the record type are not allowed, and the value of the fields should correspond to the value type of the field types.

However, this validation is optional and can be disabled (when storing a record). When it is disabled, you can add any field you like to a record, and the repository will store it. As such, technically a record is just a set of fields, and the record type an optional guide defining the structure of a record. The repository does not need the record type to be able to read or write the record.

Disabling validation is currently not yet implemented.

Let's contrast this to XML. An XML document is self-describing and can be parsed without a schema (if we forget about DTD's for a moment). An XML Schema can be used as an optional layer to be sure the XML document conforms to a certain structure. Lily records are somewhat the same but also somewhat different. While Lily does not need access to the record type, Lily does need access to the field types to be able to read and write records. This could have been avoided if we stored the value type along with each value. The reason we did not go this way is because of the scopes. Without a fixed {field name, scope} relation the user would have to specify the scope each time she wants to get or set a field, since the name alone would not uniquely identify a field across scopes. Now this is enforced in Lily because the scope is part of the field type, and field types have a globally unique name.

This being said, usually validation will be left enabled. Disabling it can be useful e.g. for system processes that do not want to care about the structural validity of records.

4.1.9 Variants

As mentioned in the section on record identification, the record ID consists of two components:

The most common use-case for variants is to maintain different language variants of the same record. They can also be useful for other purposes, such as for source-control-like branches.

The variant properties can be empty, in which case the record ID is equal to the master record ID. A record which has such an ID is called a master record.

The variant properties are a free set of name-value pairs. For example: {lang=en, branch=dev} (this syntax is just an informal notation used here).

The names of the name-value pairs are sometimes called the variant dimensions.

4.1.9.1 Why variants?

If we would not have variants, different languages of the same document would need to be created as different records, with hence different IDs, in the repository. The problem is these would then not have a shared identity. This can be annoying with respect to the links between these records. Suppose you have some records in one language, which have links between them (in link fields or in HTML blobs), these links point to other records by means of their ID. If you would now want to translate these records to some other language, you would create a new set of records, and these will have different IDs. When copying the content from the original records to the new records (as a start for translating them), you will have to adjust all the links to the point to the new record IDs.

When using variants, the links can be based on just the master record ID, and the variant properties can be resolved from the context (= are the same as those of the document that contains the link). For more information about context-dependent resolving of links, see the javadoc of the Link class, especially its resolve method.

4.1.9.2 Cross-variant data

It is possible to have a variants of a record for the different languages, while at the same time also having a variant without the language dimension. So, supposing a master ID of 'record1', we could have these variants:

In such cases, the master record can be used to store fields that should not be translated, such as numbers or dates. If these would be stored within each language, they would need to be updated in each of them when they change. The same ideas can of course be applied to variants with more dimensions.

4.1.10 Operations

The granularity of CRUD operations is on the level of a record. So one record at a time can be atomically created, read, updated or deleted. A single update operation can update fields in all the scopes. A record read can be limited to the fields you are interested in. When updating, only fields that are modified need to be communicated.

4.2 How To Create A Schema

To create a schema (field types and record types) in Lily you have several options:

The last option of using the JSON file is often the most convenient. Not only does it avoid to write lots of API calls, it is also smarter: it will detect when a field type or record type already exists and update it if necessary.

When you write an application that needs to set up some fixed schema, you can also use the import tool as a library within your application. For example, the mbox import tool contains a JSON file describing its schema and calls the import tool programmatically to create or update the schema. Have a look at its source code to see how this can be done.

4.3 How To Create Records

To add, update or delete records in Lily you have several options:

In contrast to the creation of a schema, to create records you will typically use one of the APIs rather than the import tool.

5 Indexer

The Indexer is the component responsible for keeping the Solr index up to date. In essence, it takes records from the Lily repository and puts them into Solr. It does this in reaction to asynchronously-processed events produced by the repository.

The mapping of repository records onto Solr is more than just forwarding data, the Indexer offers features such as denormalization, indexing of multiple versioned views, and blob content extraction.

Denormalization: data from one record can be stored in the index entry of another record. This is useful since Solr is not able to do joins like a SQL database, and since the index can be partitioned over many nodes. Denormalization makes searching simpler, but indexing more difficult: when a record is updated, possibly the index entry of other records needs updating too. The Indexer takes care of this.

Indexing of multiple versions of one record: tags can be assigned to versions, and you can configure for which tags an index should be maintained. A version tag is like a snapshot of the record state across records.

Incremental index updating: upon each record change, Lily produces a message queue event (see the rowlog component). The indexer can subscribe to these events to incrementally update an index as changes are happening. If you run multiple Lily nodes, the indexers will run in each of the nodes and each perform a part of the work. Also within one Lily node, the indexer will run on multiple threads.

Batch index building: when you create a new index, change the configuration of an existing index, or for some reason your index got lost, you can trigger a batch index build job. This job executes as a map-phase-only MapReduce task which runs over all records in Lily and re-indexes them. The number of map tasks is equal to the number of HBase regions.

Blob content extraction, using the Tika library. Many common formats are supported such as HTML, PDF, Microsoft Office, OpenOffice and OpenDocument format, RTF, and more.

Sharding towards multiple Solr instances: if you have too much data to fit into one Solr instance, you can shard it over multiple ones.

5.1 Setting Up A Generic Index

The quickest way to see your content indexed in Solr is when we can avoid having to write configuration files first. For this purpose, Lily comes with a sample configuration based on dynamic field rules which will index any content.

This is a quick way to get started, but after that you'll soon want to customize things. For this, the Indexer Tutorial will help you get started.

There are basically two steps:

  1. Start Solr with the generic schema

  2. Define the index in Lily

5.1.1 Start Solr with the dynamic_solr_schema.xml

The Solr schema to be used can be found in:

{lily}/samples/dynamic_indexerconf/dynamic_solr_schema.xml

5.1.1.1 Using standalone Solr

Assuming you have just downloaded Solr, you can put the schema in place using:

cp {lily}/samples/dynamic_indexerconf/dynamic_solr_schema.xml \
   {solr}/example/solr/conf/schema.xml

And then start Solr using

cd {solr}/example
java -jar start.jar

5.1.1.2 Using launch-test-lily

When using launch-test-lily, you can specify the schema for the Solr instance using the -s parameter:

launch-test-lily -s {lily}/samples/dynamic_indexerconf/dynamic_solr_schema.xml

Tip: you might also want to use the -c argument to auto-commit the index, e.g. "-c 60" will commit it every minute.

5.1.2 Define the index in Lily

The indexer configuration to be used can be found in:

{lily}/samples/dynamic_indexerconf/dynamic_indexerconf.xml

To define the index in Lily, execute the following command. If you're using a real cluster rather than running everything on localhost, you will need to adjust the host name of ZooKeeper (-z option) and of Solr (-s option).

lily-add-index \
  -z localhost \
  -c samples/dynamic_indexerconf/dynamic_indexerconf.xml \
  -n genericindex \
  -s shard1:http://localhost:8983/solr

5.1.3 And we are done

If you add any new content now, it will be indexed. If you have existing content in Lily, you can launch a batch index build to re-index it.

If you would have made any errors in the parameters to lily-add-index, you can change them using lily-update-index.

Before you will find your content in Solr, you need to commit the index, e.g. using:

curl http://localhost:8983/solr/update -H 'Content-type:text/xml' --data-binary '<commit/>'

You can perform queries via Solr's admin console:

http://localhost:8983/solr/admin/

5.2 Indexer Tutorial

5.2.1 Overview

Getting documents indexed into Solr requires the following steps:

  1. write an indexer configuration, this specifies which records to index and how to map the Lily fields onto Solr fields

  2. write a matching Solr schema, launch a Solr instance that makes use of this schema

  3. declare an index in Lily that makes use of this configuration

5.2.2 The Lily schema

Before setting up an index, you should already have a schema with field types and record types, since in your indexer configuration you will refer to these types. We assume you are already familiar with this part.

5.2.3 Version Tags

A record in Lily can have one or more versions, or it can have no versions at all. This depends on the scope (versioned, non-versioned) of the fields in the record. A record which has only non-versioned fields will have no versions.

To index records, we need some way to identify what version(s) of the records we want to be indexed. The mechanism for this is version tags. A version tag (often shortened to vtag) is a named pointer to a specific version of a record. For example, you could define a tag called 'live' which points to the version that contains the ready-for-publishing content. In one record, this live tag could point to version 5, for another record, it could point to version 3, etc.

You can have multiple version tags, and have the versions corresponding to all those tags indexed. When searching, you can then limit your search to the versions having some tag.

To make records without versions fit in this system, a special version '0' is supported: version 0 is essentially a pointer to the set of non-versioned fields of a record. This also works for records that do have versions.

Technically, a version tag is just another field in the record. A version tag field should be non-versioned, single-valued, of type long integer. Version tag fields should be in the namespace org.lilyproject.vtag.

The vtag 'last'

To make things easier, Lily comes with a built-in virtual vtag that is automatically defined for all records. This vtag is called 'last' and always points to the last version of the record, or to the '0' version for records without versions. This vtag is not actually stored as a field in the record.

So in case you simply want to index the last content, or when you are not using versioning at all, then all you need is the 'last' vtag.

5.2.4 Indexer configuration sample

Here is a sample indexer configuration:

<?xml version="1.0"?>
<indexer xmlns:b="org.lilyproject.bookssample"
         xmlns:sys="org.lilyproject.system">

  <records>
    <record matchNamespace="b" matchName="Book"   vtags="last"/>
    <record matchNamespace="b" matchName="Author" vtags="last"/>
  </records>

  <fields>
    <field name="title"      value="b:title"/>
    <field name="authors"    value="b:authors=>b:name"/>
    <field name="name"       value="b:name"/>
    <field name="recordType" value="sys:recordType"/>
  </fields>

</indexer>

There are two parts to this configuration: the 'records' and the 'fields'.

Records

The 'records' section defines what records should be indexed. This decision is made based on the record type of the record, this is specified using the matchNamespace and matchName attributes. As the name of these attributes suggest, these can contain wildcard expressions, refer to the reference documentation for full details on this. If a record matches one of these rules, than the vtags attribute is used to define what versions of the record should be indexed. This can contain a comma separated list, here we only used the built-in vtag 'last'.

Fields

The 'fields' section defines all the fields that can be sent to Solr, and their binding to Lily record fields. The fields are all global, they are not grouped per record type or so. If an index field has no value for some record, it will obviously not be added to the index. For example, the author records have a name but no title field, so for authors no title will be added to the Solr document. If for some record, there are no index fields that produce a value, the record will not be added to the index.

In the example above, the title and name field map straight to the Lily field of the same name. For the authors field we do something special. The authors field is a LIST<LINK> field pointing to the authors of a book. The "=>" symbol is the dereference operator. The expression "b:authors=>b:name" tells the indexer this: follow the link(s) in the b:authors field to the author records they point to, and from those records, take the b:name field. These dereference expressions work both for single-valued and multi-valued (LIST) links, and can follow links multiple levels deep.

Record type indexing

A common need is to index the record type of the record, so that you can limit your queries to records of a certain type. The record type information can be addressed like any other field, through a special system namespace. Notice how the sys prefix maps to the namespace org.lilyproject.system. The sys:recordType will index the record type in the format "{namespace}name". There are other possibilities: to index the namespace and name separately, to index the mixins, etc. This is explained in the indexer configuration reference.

Dynamic field mappings

If you have many fields or frequently do changes to the schema, you might desire some way to define generic field mapping rules. This is possible, and is again covered in the indexer configuration reference (look for 'dynamic fields').

5.2.5 Solr configuration sample

The following is a snippet from a Solr configuration that matches the above indexer configuration:

<schema name="example" version="1.2">

 <types>
   [snipped: see Solr's example schema]
 </types>

 <fields>
   <!-- Fields which are required by Lily -->
   <field name="lily.key" type="string" indexed="true" stored="true" required="true"/>
   <field name="lily.id" type="string" indexed="true" stored="true" required="true"/>

   <!-- Fields which are required by Lily, but which are not required to be indexed or stored -->
   <field name="lily.vtagId" type="string" indexed="true" stored="true"/>
   <field name="lily.vtag" type="string" indexed="true" stored="true"/>
   <field name="lily.version" type="long" indexed="true" stored="true"/>

   <!-- Your own fields -->
   <field name="title" type="text" indexed="true" stored="true" required="false"/>
   <field name="authors" type="text" indexed="true" stored="true" required="false" multiValued="true"/>
   <field name="name" type="text" indexed="true" stored="true" required="false"/>
   <field name="recordType" type="string" indexed="true" stored="true"/>
 </fields>

 <!-- Lily requires the uniqueKey to be set to lily.key -->
 <uniqueKey>lily.key</uniqueKey>

 <defaultSearchField>title</defaultSearchField>

 <solrQueryParser defaultOperator="OR"/>

</schema>

We have left out the Solr field type definitions, as the ones we use here are those from Solr's example schema.

There are two sets of fields you need:

Note that after changing the Solr configuration, Solr should be restarted. For some kinds of changes, the full index might need to be rebuild.

It is required to set the uniqueKey property to lily.key.

5.2.6 Declaring an index

Once you have launched Solr with an appropriate schema configured, and have written an indexer configuration, you can add an index in Lily with the lily-add-index command:

lily-add-index \
  -n indexName
  -c indexerconf.xml \
  -s shard1:http://localhost:8983/solr \
  -z zookeeperhost

The lily-add-index command has three required arguments:

The -z option specifies the ZooKeeper connection string. By default 'localhost' is used. Since this and other indexer CLI commands are short-running, it is not really required to specify the full ZooKeeper connection string, just one host name will do.

Once defined, you can update or delete the index, this is described in more detail in managing indexes.

While Lily supports adding multiple indexes, many users will only need one index. The ability to have multiple indexes is not for functional separation, but rather for technical reasons, as explained over at managing Indexes. You should (typically) not have multiple indexes that point to the same Solr instance!

5.2.7 Triggering indexing

Indexing is triggered by events generated by the repository. Thus when you create, update or delete records the Indexer will be triggered.

You can also re-index the existing records in the repository through a batch index build. It is also possible to disable incremental indexing and only use batch index building. Or you can temporarily pause incremental indexing (the events will be queued). All this is described in managing indexes.

5.2.8 Committing the index

Suppose you have defined an index, added some records, and now try to find them in Solr. This will not give any results, unless the Solr index has first been committed. This is because Solr buffers updates and only after a while flushes this buffer into a new, searchable, index segment.

You can configure Solr to commit the index automatically at the interval of your choice, or you can also trigger the commit manually, as follows:

curl http://localhost:8983/solr/update -H 'Content-type:text/xml' --data-binary '<commit/>'

5.2.9 Querying

To query the index, directly make use of Solr. Consult the Solr documentation or book for more information on this.

For example, a simple query on the word 'something' is done like this:

curl 'http://localhost:8983/solr/select/?q=something'

If you prefer to work with JSON, like in Lily's REST interface, use:

curl 'http://localhost:8983/solr/select/?q=something&wt=json'

If you use multiple vtags, you will most often want to limit your search to one vtag-view. This can be done by adding a condition on the lily.vtag field. For example, we could enforce this condition through Solr's filter query feature, as follows:

curl 'http://localhost:8983/solr/select/?q=something&fq=%2Blily.vtag%3Alast'

in which %2B is a plus sign and %3A a colon, so the filter query is "+lily.vtag:last".

5.2.10 Debugging indexing

By enabling debug logging for the category org.lilyproject.indexer.engine you will see information about what the Indexer is doing.

If you are simply launching Lily from the command line (e.g. in a development setup), you can enable logging to standard out with the -l and -m options:

lily-server -l debug -m org.lilyproject.indexer.engine

You can as well edit the lily-log4j.properties file. The above is just a shortcut to temporarily enable logging to stdout for some category. You can also change the logging configuration at runtime through JMX (jconsole).

Among other things, this will output lines like this when a record gets actually pushed to the index:

[Thread-8] DEBUG org.lilyproject.indexer.engine.Indexer - Record UUID.6ce28c20-bcb4-41f9-af97-63a774242208, vtag live: indexed

5.2.11 Further information

With the above you should have a basic understanding of the Indexer. You can also read about:

5.3 Managing Indexes

5.3.1 About multiple indexes

Lily allows to define multiple indexes. Each of these indexes should point to a different Solr instance (or to a different Solr core). For many uses, having just one index will suffice.

When is it useful to have more than one index?

5.3.2 Index states

Each index has three kinds of states:

The states are read-write, though certain values can only be assigned by the system, i.o.w. certain state transitions can only be performed by the system.

5.3.2.1 The general state

The general state can be one of:

The ACTIVE and DISABLED states are not used by Lily at this time, you can use them to indicate whether some index is still intended to be used.

When you want to delete an index, you change its general state to DELETE_REQUESTED. The system will pick this up by moving the state to DELETING. After this, the index will either dissappear or change to DELETE_FAILED. Deleting an index only deletes the definition of the index in Lily, the actual Solr instance is left untouched.

5.3.2.2 The update state

The update state is about the incremental updating of the index. It can be one of:

5.3.2.3 The batch build state

The batch build state can be one of:

By default the batch build state is INACTIVE. When you want to launch a batch build job for this index, you change its state to BUILD_REQUESTED. The system will react to this state change and launch the batch build job, it will then move the state to BUILDING. The system will then follow up on the state of this batch build job and move the state back to INACTIVE when done. Information such as the ID of the job, and whether it succeeded or not, are stored in other index properties, as described further on.

5.3.3 Performing common index actions

5.3.3.1 General notes

5.3.3.1.1 Command line clients and programmatic access

All index related actions are performed through a set of command line utilities. These utilities internally make use of the API provided by the lily-indexer-model project. You could write your own clients using this API, for example to provide other user interface or to integrate certain actions as part of a bigger system. If you would like to do this, we recommend to look into the source code of the indexer admin utilities (in the source tree: cr/indexer/admin-cli).

Information about the indexes can also be retrieved from the REST interface, though at the time of this writing the changes that could be made to it were limited to changing the index states.

5.3.3.1.2 ZooKeeper connect string

All command line utilities need to know one common setting: the ZooKeeper connect string. This is specified using the -z option, something like:

lily-list-indexes -z zookeeper1:2181,zookeeper2:2181,zookeeper3:2181

By default, localhost:2181 is used. Since the CLI utilities are short running, you can get away with specifying just one of the ZooKeeper hosts rather than full connection string.

5.3.3.1.3 Getting help

Use the -h option to get information on the full set of available options of each utility.

5.3.3.1.4 Forgot the name of an index

Most commands require to specify the name of an index. If you forgot the name, use lily-list-indexes to get a list of the defined indexes.

5.3.3.2 Knowing what indexes exist

Perform the following command:

lily-list-indexes

If you have three indexes, this will show something like this:

index1
  + General state: ACTIVE
  + Update state: SUBSCRIBE_AND_LISTEN
  + Batch build state: INACTIVE
  + Queue subscription ID: IndexUpdater_index1
  + Solr shards: 
    + shard1: http://solr:8983/solr/core1
index2
  + General state: ACTIVE
  + Update state: SUBSCRIBE_AND_LISTEN
  + Batch build state: INACTIVE
  + Queue subscription ID: IndexUpdater_index2
  + Solr shards: 
    + shard1: http://solr:8983/solr/core2
index3
  + General state: ACTIVE
  + Update state: SUBSCRIBE_AND_LISTEN
  + Batch build state: INACTIVE
  + Queue subscription ID: IndexUpdater_index3
  + Solr shards: 
    + shard1: http://solr:8983/solr/core3

5.3.3.3 Creating an index

Creating an index is done through the lily-add-index command:

lily-add-index -n indexName -c indexerconf.xml -s shard1:http://localhost:8983/solr

The above shows the three arguments minimally required:

5.3.3.3.1 Incremental indexing is enabled by default

When a new index is added, it is by default created with the state SUBSCRIBE_AND_LISTEN, which means incremental updating will be immediately enabled. You can create it with a different initial state through the option --update-state.

5.3.3.3.2 Using multiple shards

If you want to use more than one shard, specify a comma-separated list of Solr URLs, prefixing them a name:

-s shard1:http://solr1:8983/solr,shard2:http://solr2:8983/solr,shard3:http://solr3:8983

Lily has a built-in default strategy for assigning records to shards, but you can provide a custom configuration too. Sharding is explained in more detail in Solr index sharding.

5.3.3.4 Updating the indexer configuration of an index

If you make a change to the indexer configuration, you can update an existing index with:

lily-update-index -n indexName -c indexerconf.xml

Do not forget that when you have added new index fields, you need to add them to the Solr schema too. Also, do not forget that existing content will not be automatically re-indexed: you need to start a batch build job for that.

If you would not have the indexerconf.xml file anymore, you can retrieve it as follows:

lily-get-indexerconf -n indexName -o indexerconf.xml

5.3.3.5 Updating other index properties

Similar to the indexer configuration, you can also update other index properties such as the Solr shard URLs.

5.3.3.6 Deleting an index

Deleting an index is done by updating its general state to DELETE_REQUESTED:

lily-update-index -n indexName --state DELETE_REQUESTED

This will remove the message queue subscription (if any). If a batch build job would be running, it will be killed. If all successful, the index will be deleted. Otherwise, it will move to the state DELETE_FAILED. You can check up on this using lily-list-indexes. In case of failures, check the log file of the Lily server that is running the indexer master.

Note that this only deletes the definition of the index in Lily, the Solr index itself is not dropped as this is not managed by Lily.

5.3.3.7 Performing a batch build (rebuilding an index)

A batch index build will (re-)index all records in the repository. A batch build can be, but is not required to be, run concurrently with incremental index updating, so that any changes happening after the batch build is started are also reflected in the index.

A batch index build will not first delete the Solr index, so if you want to re-index from a blank slate, you first have to delete the Solr index yourself. See also this Solr FAQ entry which suggests doing a query-based deletion. Alternatively, you can simply clear out the Solr index directory while Solr is shut down.

To start a batch (re)build of an index, execute:

lily-update-index -n nameOfYourIndex --build-state BUILD_REQUESTED

This change in state will be picked up by Lily which will launch a MapReduce job.

You can follow up on the progress via lily-list-indexes, its output will be similar to this:

index1
  + General state: ACTIVE
  + Update state: SUBSCRIBE_AND_LISTEN
  + Batch build state: BUILDING
  + Queue subscription ID: IndexUpdater_index1
  + Solr shards: 
    + shard1: http://localhost:8983/solr
  + Active batch build:
    + Hadoop Job ID: job_20101021170619294_0001
    + Submitted at: 2010-10-21T18:41:38.677+02:00
    + Tracking URL: http://localhost:43835/jobdetails.jsp?jobid=job_20101021170619294_0001

Note that the batch build state is now BUILDING and that a section 'Active batch build' appeared. Following the tracking URL will bring you to the Hadoop JobTracker web ui.

If there would be some failure starting the batch job, for example because the job tracker is unreachable, the batch build state will immediately move to INACTIVE, and the last batch build information will indicate that the job failed to start. It will also mention on which Lily node you should check the log files to see what the error was. In the log file, pay attention to messages for the log category org.lilyproject.indexer.master.IndexerMaster.

Once finished, the batch build state will become INACTIVE and the info about the last run batch build is shown below 'Last build job':

index1
  + General state: ACTIVE
  + Update state: SUBSCRIBE_AND_LISTEN
  + Batch build state: INACTIVE
  + Queue subscription ID: IndexUpdater_index1
  + Solr shards: 
    + shard1: http://localhost:8983/solr
  + Last batch build:
    + Hadoop Job ID: job_20101021170619294_0001
    + Submitted at: 2010-10-21T18:41:38.677+02:00
    + Success: true
    + Job state: succeeded
    + Tracking URL: http://localhost:43835/jobdetails.jsp?jobid=job_20101021170619294_0001
    + Map input records: 163
    + Launched map tasks: 1
    + Failed map tasks: 0
    + Index failures: 0

It is important to watch that the 'Index failures' property is 0. There can be indexing errors, even when the MapReduce job as a whole succeeded. When a record fails to be indexed, we do not abort the map task but only augment this counter. More details about the errors that occurred can be found in the Hadoop log files.

5.3.3.8 Interrupting a batch build

To stop a batch build prematurely, kill it directly in Hadoop:

hadoop job -kill {id}

Depending on the configuration of Hadoop, you can delete it through the JobTracker web ui too. See the property webinterface.private.actions in your Hadoop's core-site.xml.

Lily will notice that the job was killed, and update the index state accordingly.

5.4 Indexer Configuration

The indexer configuration defines how a Lily record should be mapped to a Solr document. You can configure what records, and what variants and versions of those records, need to be indexed. You can use link dereferencing to denormalize data in the index.

Besides the Lily's indexer configuration, you also need to configure Solr's schema.xml. This might seem double work, but the purpose of both files is different, and allowing manual Solr configuration gives maximum flexibility. Of course, nothing prevents you from generating both configurations from a common definition, maybe at some point Lily will include this itself as a feature.

It is possible to have generic rules in the configuration, so that not every record type or field type needs to be mapped individually. You can even go so far to make an indexer configuration that basically tells the indexer to index everything, see Setting Up A Generic Index.

The listing below gives an overview of the syntax of the indexer configuration. More details are given in the next sections (online version: see navigation or click the links in the listing).

<indexer xmlns:prefix="...">

  <


records>

    <record matchNamespace="..." matchName="..." matchVariant="..." vtags="..."/>

  </records>

  <


formatters default="...">

    <formatter
      name="{unique name}"
      class="{name of formatter class that can format this kind of value}"/>

  </formatters>

  <


fields>

    <field name="{solr field name, not necessarily unique}"
           value="{prefix:name or dereference expression}>
           [formatter="{formatter name}"]
           [extractContent="true|false"]>
    </field>

  </fields>

  <


dynamicFields>

    <dynamicField matchNamespace="..."
                  matchName="..."
                  matchType="{type pattern}"
                  matchScope="versioned|non_versioned|versioned_mutable"
                  name="{solr field name}"
                  extractContent="true|false"
                  continue="true|false"
                  formater="{formatter name}"/>

  </dynamicFields>

</indexer>

5.4.1 Indexerconf: Version Tag Based Views

Version tags are used to determine what versions of a record should be indexed.

A record in Lily can have one or more versions, or it can have no versions at all. This depends on the scope (versioned, non-versioned) of the fields in the record. A record which has only non-versioned fields will have no versions.

Typically, it is not necessary to index all versions of a record, since many versions will be draft versions or old, archived versions. In some cases, it is fine to simply index the last version. But in other cases, versions need to undergo some review workflow, and hence the published version might not be the last one.

The solution offered by Lily's indexer is based on version tags. A version tag is a label assigned to a version. The set of records having a version with a particular version tag attached to it forms a particular view on the repository.

An alternative to the tag-based system is to use time-based views. In this case, a 'point in time' determines the version used for each record, thus allowing to query the state of the repository as it was at some point in time. Lily does currently not support point-in-time based views, if you are interested in this, please contact us.

Technically, a version tag is just another field in the record. The value of the field is a version number, the name of the field is the tag. A version tag field should be non-versioned, single-valued, and of type long. Version tag fields should be in the namespace org.lilyproject.vtag.

Within a record, there can only be one version having a particular version tag (i.o.w. a particular vtag can point to at most one version). However, multiple version tags can point to the same version.

You can request any number of version tags to be indexed. Thus the same record might be indexed multiple times, in multiple version-tag-views.

To make records without versions fit in this system, a special version '0' is supported: version 0 is essentially a pointer to the set of non-versioned fields of a record. This also works for records that do have versions.

The built-in version tag: last

To make things easier, Lily comes with a built-in virtual vtag that is automatically defined for all records. This vtag is called 'last' and always points to the last version of the record, or to the '0' version for records without versions. This vtag is not actually stored as a field in the record.

So in case you simply want to index the last content, or when you are not using versioning at all, then all you need is the 'last' vtag.

Version tags & denormalization

As described further on, denormalization (= retrieving fields from linked records to store them within the index entry of the current record) also honors the version-tag views.

Non-versioned fields & version tags

A record can contain both versioned and non-versioned fields at the same time. When non-versioned fields are indexed, they are stored within the index entry of each indexed version. When a non-versioned field changes, the index entries for all indexed versions will be rebuilt.

5.4.2 Indexerconf: Records

<records>
  <record matchNamespace="..." matchName="..." matchVariant="..." vtags="..."/>
</records>

The 'records' section determines whether a particular record should be indexed or not. This is done based on:

5.4.2.1 Evaluation of the record rules

The list of <record> rules is evaluated in order, the first one for which the record type and variant expression matches counts, thus will be used to determine the vtags to be indexed.

If there is no matching rule for a record, it will not be indexed. However, it might be that information from this record is denormalized into the index entries of other records. Even if this record itself is not indexed, its denormalized information will still be updated.

5.4.2.2 matchVariant expression

The matchVariant expression is quite simple, and best explained with some samples:

5.4.2.3 Version tags

The version tags are specified as a comma separated list of version tag names. This is the same name as the field type name of the version tag, but without the namespace. For example: vtags="last,live,in-review".

5.4.3 Indexerconf: Formatters

Currently it is not possible to register custom formatter implementations, so you can ignore the formatters for now.

All values transmitted to Solr are strings. This means that non-string values need to be formatted (serialized) as string. This is made possible by the formatters.

Lily has a built-in formatter for all kinds of values, making the configuration of formatters completely optional.

The available formatters are declared in a section as follows:

<formatters default="...">
  <formatter
      name="{unique name}"
      class="{name of formatter class that can format this kind of value}"/>
</formatters>

The attribute default should match the name of one of the formatters. It is optional, Lily's built-in default formatter is used as fallback.

A formatter needs to implement the following interface (part of lily-indexer-model):

org.lilyproject.indexer.model.indexerconf.Formatter

To use a specific formatter for some field, specify the formatter attribute on the field tag:

<field name="..." value="..." formatter="{name of formatter}"/>

A formatter can return one or more strings, irrespective of whether the Lily field is a list or not. Likewise, the formatter handling a list value might return just one string.

While string values do not need to be formatted as string anymore, they are still passed through the same mechanics.

5.4.3.1 Built-in formatter

When no formatters are configured or no formatter matches the attributes specified, a built-in default formatter is used. For most kinds of values it uses the "toString()" representation.

Date-times are formatted as ISO8601 in UTC time zone, which Solr is able to handle. Date fields are formatted the same, but with "/DAY" appended to them (cfr. Solr date math).

PATH-type values are formatted with slashes between the elements, for example: "value1/value2/value3".

The first level of LIST-type values maps onto Solr multi-valued fields. For deeper nested lists, the individual items are formatted and then concatenated into one space-separated string.

RECORD-type values are formatted by formatting each of their individual fields and then concatenating everything together in one space-separated string. This works recursively.

5.4.4 Indexerconf: Fields

The 'fields' section of the indexerconf defines the fields that should end up in the index (= the fields that are sent to Solr). Let's call these index fields, to avoid confusion with record fields.

<fields>

  <field name="{solr field name, not necessarily unique}"
         value="{prefix:name or dereference expression}>
         [formatter="{formatter name}"]
         [extractContent="true|false"]>
  </field>

</fields>

The field mapping is independent of the records to which the fields belong. This matches well with the fact that field types in Lily are independent from record types, record types are only sets of field types.

Each index field is bound to some record field.

The value of an index field can be:

5.4.4.1 Correspondence between Lily LIST-type fields and Solr multi-value fields

As you can guess, Lily LIST-type fields map to Solr multi-value fields.

In Lily you can nest LIST fields, for example LIST<LIST<STRING>>. In such case, the first list level maps onto the Solr multi-value, while further nested lists will be formatted as a string (space separated).

5.4.4.2 Index field name

There can be multiple index fields with the same name. If these would each produce a value for a certain record, the result will be that a multi-value will be sent to Solr.

There are other ways of producing multi-values towards Solr: the most obvious is an index field mapped to a Lily LIST-type field. A formatter can also produce multiple values from a single input value.

Index field names starting with 'lily.' are reserved for internal uses.

5.4.4.3 Order is important

The index fields will be added to the Solr document in the order specified  in the indexer configuration. This can be important for multi-valued fields, for which Solr maintains the order.

5.4.4.4 Determination of the relevant index Fields for an input record

Not all fields will be sent to Solr for all records, but (obviously) only fields for which the value is non-null.

For non-deref values, this will usually be the case if the field exists within the record, though the formatter might strip it to null.

For deref values, if the first field in the expression exists in the current record, it might still very well be that further on something evaluates to null. In such case, the index field will not be added to the Solr document.

5.4.4.5 Content extraction

Content extraction is performed using the Tika library. While Tika can extract both content and metadata, it is only the content we are interested in here. Metadata extraction should probably be handled when storing content in Lily, and mapped onto Lily fields.

<field name="..." value="..." extractContent="true">

When extractContent is true, no formatter will be used to format the field value (if any is specified, it will be ignored), rather content extraction will be performed. The field value has to be a blob.

If the value is a blob but extractContent is not true, the blob value will be handled by a formatter instead (the default formatter will not do anything useful).

LIST<BLOB> and nested lists are supported.

Tika is using the AutoDetectParser using default configuration. The amount of data extracted from a single blob is limited to 500K. If a blob contains more content, an info-level message is logged and the first 500K will be sent to Solr. It should be noted that Solr also has a maxFieldLength (see solrconfig.xml), which by default is 10000 tokens (not characters).

5.4.4.6 Index fields that use a value from the current record

There is not much to say about this, the syntax is as follows:

<field name="..." value="prefix:value"/>

5.4.4.7 Index fields that use a value from a nested record or that dereference links towards other records

RECORD and LINK type fields are actually quite similar: they both lead to another record. Therefore they are handled in the same way in the indexer. You can navigate through them using the symbol "=>", which is called the dereference operator.

Examples:

<field name="..." value="prefix:value1=>prefix:value2"/>
<field name="..." value="prefix:value1=>prefix:value2=>prefix:value3"/>

Each field to the left of the '=>' operator should be a LINK-type or a RECORD-type field. The expression is evaluated from left to right, the dereferencing can go multiple levels deep. The last field in the list can be of any type, of course.

Dereferencing through LIST<LINK> or LIST<RECORD> fields also works, at any level in the follow-field chain. The order of the values is maintained. Dereferencing through nested lists, such as LIST<LIST<LINK>>, is not supported.

If somewhere in the chain a field evaluates to null (either the link field has no value or it points to a non-existing record), the deref expression as a whole is null and hence no index field will be added. In the LIST-case, in case one entry in the list points to a non-existing record, it will be dropped from the list. Obviously, if all entries would be dropped from the list, evaluation of the dereference stops.

The actual value from the field at the end of the chain will be handled as for non-dereferenced field values: a formatter will be applied, or content extraction will be performed.

Dereferencing happens in a certain vtag-based view. So when we are indexing vtag X of a record, then any information dereferenced from other records will also be taken from the version bearing vtag X. If the target record would not have a version with tag X, than the deref evaluates to null. Even if the deref'ed field would be a non-versioned field so that it does not really matter.

5.4.4.8 Index fields that dereference towards less-scoped variants of the same record

Next to dereferencing via link fields, it is also possible to dereference towards less-dimensioned variants of a record. The general idea behind this is that the data which is specified on the less-dimensioned variants applies to all the more-dimensioned ones (for example, imagine the case of non-translatable content when working with language variants).

The syntax still uses the dereference operator, =>, but instead of a field name you can use:

Examples:

<field name="..." value="master=>prefix:field1"/>
<field name="..." value="-x,-y=>prefix:field"/>

This can also be combined with field dereferencing:

<field name="..." value="prefix:field1=>master=>prefix:field2"/>

5.4.4.9 Denormalized information and index updating

Denormalizing information in the index is a powerful feature but you should be aware of what is involved in maintaining the denormalized information.

On each change of a record, regardless of whether the record itself needs indexing, the Indexer needs to check for each index field that uses a deref-value if it is possible that the deref-value points to the current record. This is done by querying an index of links between records, we call this index the link index. If you have ten index fields that use a deref-value, this means at least ten queries on the link index for each record create, update or delete operation.

Deref-values should not be used for many-to-one links where the many is a large number. Suppose you have a million records that all have a link the same record, and all these records store in their index entry a field from this record. When this field is updated, all million records will have to be re-indexed.

5.4.5 Indexerconf: Dynamic Index Fields

If you have lots of fields, or when you often make changes to the schema, it would be impractical to map each field individually in the indexer configuration.

Therefore, it is also possible to define dynamic rules, similar to the dynamicFields in Solr.

Let's repeat the syntax:

  <dynamicFields>

    <dynamicField matchNamespace="..."
                  matchName="..."
                  matchType="{type pattern}"
                  matchScope="versioned|non_versioned|versioned_mutable"
                  name="{solr field name}"
                  extractContent="true|false"
                  continue="true|false"
                  formatter="{formatter name}"/>

  </dynamicFields>

You can define any number of dynamic fields.

The evaluation is as follows:

5.4.5.1 Matching fields

Each of the match attributes defines some condition to which the field must adhere to match this rule. All of the match attributes are optional, a <dynamicField> rule without any match attribute will match any field. So it only makes sense to have such a rule as the last one (unless it has continue="true").

Here is what you can do with each of the match attributes:

The matchType pattern: basics

In its simplest form, this contains one, or a comma-separated list, of type names. For example:

matchType="STRING,LONG,INTEGER"

The type name can contain a wildcard at the start or the end of the expression, so you could write:

matchType="STR*"

to match all types which start with STR, which is only STRING.

The matchType pattern: matching types with arguments

If you would specify this:

matchType="LIST"

this will never match any type, since LIST has an obligatory type parameter. Specifying the literal type name, including pattern, does work:

matchType="LIST<STRING>"

Note that the above is not valid XML, we need to escape the less than symbol:

matchType="LIST&LT;STRING>"

Since this is rather unreadable, you can replace the angle brackets by round ones:

matchType="LIST(STRING)"

You might want to match any kind of list. This can be done with:

matchType="LIST(*)"

This will match LIST<STRING> or LIST<INTEGER> but not nested lists such as LIST<LIST<STRING>>

The following special constructs are available for matching the type argument:

<*>

matches types without argument or with one argument, but not deeper nested arguments. In case the pattern is LIST<*>, this matches LIST, LIST<STRING> but not LIST<PATH<STRING>>

<*>

matches types without argument or with one argument, but not deeper nested arguments. In case the pattern is LIST<*>, this matches LIST, LIST<STRING> but not LIST<PATH<STRING>>

<+>

same as <*> but the type argument is required

<++>

same as <**> but the type argument is required

The distinction between <*> and <+> does not matter for types like LIST which always have a type argument. It can be useful for RECORD, where the type argument (a record type name) is optional.

In the type pattern, you can of course also list the type argument in full (as already shown above with LIST(STRING)), and you are allowed to use the star wildcard in both the name of the type and in its argument (the wildcard only works at the start or end of the string).

The following matches all RECORD types with a record type in the namespace "foo":

RECORD<{foo}*>

Other examples to show what is syntactically possible:

LIST<STR*>
LIST<LIS*<**>>

5.4.5.2 The name

The name of the Solr field can be defined using an expression (a template). This expression is a string in which the following constructs can be embedded:

Expression

Notes

${namespace}

${name}

${baseType}

gives the type name without parameters, in lowercase. For "STRING" this gives "string". For "LIST<STRING>" this gives "list".

${nestedBaseType}

gives the type name of the nested type, without parameters, in lowercase. If there is no nested type, gives the base name of the type itself. For "LIST<STRING>" this gives "string". For "STRING" this gives "string"

${type}

the type name, followed by the names of any nested types, separated by underscores. For "LIST<STRING>" this gives "list_string". For "LIST<LIST<STRING>>" this gives "list_list_string". For "RECORD<{dc}Title>" this gives "record" (the argument of RECORD is not a nested type)

${nestedType}

similar to type, but then for the nested type, and fall back to the current type if there is no nested type

${deepestNestedBaseType}

gives the base name of the deepest nested type. For "LIST<LIST<LIST<STRING>>>" this gives "string", while nestedBaseType would give "list" in this case

${list}

true or false, depending on whether the field is a LIST (regardless its type argument)

${nameMatch}

if the name expression contained a wildcard, this is the text matched by that wildcard

${namespaceMatch}

similar to ${nameMatch}

${list?yesvalue:falsevalue}

allows to conditionally insert a string when the field is of type LIST. The falsevalue is optional: ${list?yesvalue}

Examples:

<!-- The name is a literal string -->
<dynamicField matchNamespace="my.namespace" matchName="field1" name="field1"/>

<!-- Use the text matched by the wildcard -->
<dynamicField matchNamespace="my.namespace" matchName="f*" name="something_${nameMatch}"/>

<!-- A dynamic field without any match attribute: will match anything. We embed the
     type in the name, so that we can have matching dynamicField rules in
     Solr. List (multi-value) fields are suffixed with '_mv'. -->
<dynamicField name="${name}_${nestedBaseType}${list?_mv}"/>

On the dynamic field you can also specify the formatter and extractContent attributes. It is allowed to specify the extractContent attribute if the dynamic field might map other than blob fields: the attribute will only have significance in case the field is a blob field.

Dynamic fields do not support link dereferencing.

Dynamic fields are evaluated after the classic, static field mappings. The only significance of this is for the order of multi-values, in case the same Solr field name would occur. It is not because a field has been used in a static field mapping that it will not be used in the evaluation of dynamic fields anymore.

5.4.6 Indexerconf: Indexing The RecordType

Lily does not by default index the record type of a record (e.g. as one of the built-in 'lily.' fields), because there are many options for indexing the record type: you might want to index only the record type or also the mixins, you might want to index the namespace separately to be able to search across everything in a namespace, etc.

To allow to index record type information, a set of system fields are available in the indexer configuration that can be used in normal <field> mappings.

To use these fields, define the following namespace:

xmlns:sys="org.lilyproject.system"

And then refer them as any other field:

<field name="recordType" value="sys:recordType"/>
<field name="recordTypeWithVersion" value="sys:recordTypeWithVersion"/>

The below table contains the full list of available types.

System field

Data type

Notes

recordType

string

The namespace and the name of the record type, in the following format: {namespace}name

recordTypeName

string

Just the name of the record type.

recordTypeNamespace

string

Just the namespace of the record type.

recordTypeVersion

long

recordTypeWithVersion

string

Namesapce, name, and version of the record type, in the following format: {namespace}name:version

mixins

mv string

The mixins of the record type, without the record type itself, in the same syntax as recordType

mixinsWithVersion

mv string

mixinNames

mv string

mixinNamespaces

mv string

If there are duplicates (likely), they are not indexed.

recordTypes

mv string

The mixins and the record type, in the same syntax as recordType

recordTypesWithVersion

mv string

recordTypeNames

mv string

recordTypeNamespaces

mv string

Technically it is possible to index the record type of some other record by following a link field:

<field name="recordType_deref" value="ns:linkfield=>sys:recordType"/>

However, you should be aware that this information will not be updated automatically when the type of the other record would change (which is a rare case anyway).

When the name of a record type changes (which should be an infrequent event, except during project development), the index will not be automatically updated, since this would affect possibly lots of records. Rather, perform a manual batch index build.

When another update to a record type happens, e.g. its mixins changes, there is no index updating that needs to happen since each record points to a specific version of a record type.

5.5 Required Fields In The Solr Schema

The following field declarations MUST be included into every Solr schema file:

<!-- Fields which are required by Lily -->
<field name="lily.key" type="string" indexed="true" stored="true" required="true"/>
<field name="lily.id" type="string" indexed="true" stored="true" required="true"/>

<!-- Fields which are required by Lily, but which are not required to be indexed or stored -->
<field name="lily.vtagId" type="string" indexed="true" stored="true"/>
<field name="lily.vtag" type="string" indexed="true" stored="true"/>
<field name="lily.version" type="long" indexed="true" stored="true"/>

The unique key field MUST be set as follows:

 <uniqueKey>lily.key</uniqueKey>

This is the meaning of each of the built-in fields:

Field

Notes

lily.key

the unique identification of the Solr document, it is the combination of the Lily record id and the id of the version tag (the id of the field type of the version tag)

lily.id

the record id

lily.vtagId

the id of the version tag

lily.vtag

the name of the version tag (without namespace). For example, the string 'last'.

lily.version

the version of the record, thus the version the vtag pointed to at the time the record was indexed.

5.6 Solr Index Sharding

5.6.1 Introduction

When your index is too large to be managed by a single Solr instance on one node, then you can shard your index.

Index sharding is not the solution to handle high traffic in case of many users: then you should rather use Solr replication. The same holds for high availability: use replication rather than sharding.

For this, you need to set up multiple Solr instances. Typically these should all have the same configuration (especially the schema.xml).

When you add an index to Lily, you can specify multiple Solr shards by specifying their URLs. You also give each shard a logical name (which does not have to be unique across indexes). For example you could name them “shard1”, “shard2”, and so on. See managing indexes.

Shards cannot be added or removed on the fly: if you decide you want more or less shards, you need to define a new index and re-index your content into that new index. Nonetheless, Lily allows changing the sharding configuration of existing indexes on the fly without complaining. When doing this, working indexers will be restarted to take the new configuration into account (a running index re-building job would be unaffected). You have to consider yourself if the changes you make have sense without rebuilding the index.

5.6.2 Shard selection

Index updates for a certain record should always go towards the same Solr shard. The decision of what shard to use for what record can only be based upon the record ID.

While it might be interesting to allow selecting a shard based on the value of a field of a record, this is difficult in case the record has been deleted. The value of the field on which the sharding is based should also never change its value, something which Lily does not help with, leaving more responsibility to the user.

Lily can use a default sharding strategy (based on hash of the master record id modulus the number of available shards) or you can customize it through a configuration, specified when creating the index.

5.6.2.1 Sharding configuration (shard selection configuration)

In many situations the default sharding behavior will suffice. It is only if you really have an opinion to what shard which record goes, probably based upon a variant property, that you need a custom configuration.

Below you find the structure of the sharding configuration. It consists of two main parts: the definition of the value to shard on (the sharding key) and the mapping of this sharding key onto a shard.

{
  shardingKey: {
    value: {
      source: "recordId|masterRecordId|variantProperty",
      property: "prop name" /* only if source = variantProperty */
    },
    type: "long|string",
    hash: "md5", /* optional, only if you want the value to be hashed */
    modulus: 3 /* optional, only possible if type is long */
  },
 
  mapping: {
    type: "list|range",

    in case of list:

    entries: [
      { shard: "shard1", values: [0, 1, 2] }, /* values in array should be long or */
      { shard: "shard2", values: [3, 4, 5] }  /* string according to type */
    ]

    in case of range:

    entries: [
      { shard: "shard1", upTo: 1000 }, /* upTo value is exclusive */
      { shard: "shard2" } /* upTo is optional for last shard */
    ]
  }
}

The "shard1" and "shard2" are the logical shard names specified when creating the index.

Suppose you have a variant property "language" and want to shard based upon language, then you could use something like the following configuration:

{
  shardingKey: {
    value: {
      source: "variantProperty",
      property: "language"
    },
    type: "string"
  },

  mapping: {
    type: "list",
    entries: [
      { shard: "shard1", values: ["en", "it"] },
      { shard: "shard2", values: ["nl", "de", "es"] }
    ]
  }
}

5.6.3 Example usage

If you want to play a bit with multiple shards, here is how to get started. These instructions are only for playing on your local machine.

First set up multiple Solr instances. For example, using the launch-solr tool you know from Running Lily, you can do:

launch-solr -s schema.xml -p 8984
launch-solr -s schema.xml -p 8985

If you are starting Solr via its start.jar, make two copies of the Solr home dir and start with:

java -Djetty.port=8984 -Dsolr.solr.home=solr1 -Dsolr.data.dir=solr1/data -jar start.jar
java -Djetty.port=8985 -Dsolr.solr.home=solr2 -Dsolr.data.dir=solr2/data -jar start.jar

Create a two-sharded index without specifying a sharding configuration (here using the mbox-import sample):

lily-add-index \
  -n mail \
  -s shard1:http://localhost:8984/solr/,shard2:http://localhost:8985/solr/ \
  -c samples/mail/mail_indexerconf.xml

This will use the default sharding configuration, which is generated on the fly depending on the number of shards you have. For our situation here, the configuration will be like the following:

{
  shardingKey: {
    value: {
      source: "masterRecordId"
    },
    type: "long",
    hash: "md5",
    modulus: 2
  },

  mapping: {
    type: "list",
    entries: [
      { shard: "shard1", values: [0] },
      { shard: "shard2", values: [1] }
    ]
  }
}

Suppose you save this in a file called shardingconfig.json, then you can specify it as follows when creating the index:

lily-add-index \
  -n mail \
  -s shard1:http://localhost:8984/solr/,shard2:http://localhost:8985/solr/ \
  -c samples/mail/mail_indexerconf.xml \
  -p shardingconfig.json

Now you are ready to start creating records. If you keep an eye on the consoles of your Solr instances, you will see both of them being called.

5.7 Solr Versions

Lily is built against the client libraries of Solr [unresolved variable: solrVersion], and uses by default the javabin format (rather than XML) to communicate with Solr.

5.7.1 Using Solr 1.4(.1)

Since Solr's javabin format changed in incompatible ways, you have to configure Lily to use the XML format in case you want Lily to talk to Solr 1.4.

This is done by editing the configuration file conf/indexer/indexer.xml, and adjusting the value of the following two properties to the values shown here:

<requestWriter>org.apache.solr.client.solrj.request.RequestWriter</requestWriter>
<responseParser>org.apache.solr.client.solrj.impl.XMLResponseParser</responseParser>

This configuration change has to be done on each of the Lily nodes.

When using Lily Enterprise, be sure to edit the central template configuration and redeploy the configuration in order to apply it across the cluster.

5.8 Indexer Error Handling

5.8.1 Solr unreachable

When Solr, or one of the Solr's when using sharding, is unreachable, then the incremental indexers will block indefinitely until the Solr becomes reachable again. The operation will be retried at regular intervals, each time it fails an error message will be logged to the category org.lilyproject.indexer.solrconnection.

The following kinds of errors are all in category 'Solr unreachable':

In order for administrator to be aware of this, the following metric is incremented: solrClient.{indexname}_{shardname}.retries. It is recommended that administrators are notified when there is a change in this metric, especially if it keeps augmenting for anything longer than a short time.

When these kinds of errors happen, the indexers will retry until the indexing succeeds, this means that no index updates will be lost (in contrast to e.g. simply logging the error and processing the next message, as happens for unexpected errors as explained below).

5.8.2 Solr misconfiguration

When there is an error in the Solr configuration, for example a missing field in the schema.xml, the error will be logged and the same metric as for generic unexpected errors will be incremented, see below.

The indexing of the record will hence have been skipped, so the index will not be up to date.

These kinds of errors usually happen during project development. After the situation has been corrected, a batch index build can be performed to make sure the index is up to date.

5.8.3 Indexerconf misconfiguration

When using lily-add-index or lily-update-index, various checks are performed on the 'indexerconf.xml' configuration tot minimize the possibility of runtime errors: the structure of the configuration is validated as well as the existence of all referenced field types. These validations can however be skipped with the '--force' option.

If for some reason the loading of the configuration would still fail, the message queue listener(s) that should perform the indexing will fail to start, and hence no indexing will be performed. An error will be logged. (At this time, no metric is incremented for this failure)

Note that when failed to start, the MQ-listeners (indexers) will not retry to start until a change to the index definition happens. To trigger this without actually changing anything, use the 'lily-touch-index' command.

5.8.4 General indexer errors

The indexer handles softly all sorts of errors that are bound to happen, such as receiving a message for a record which has meanwhile been deleted, or link fields pointing to non-existing records. These kinds of problems are not logged.

When an unexpected error occurs, the error will be logged, and the following metric will be incremented: indexUpdater.{indexname}.errors. It is recommended that administrators are notified when there is a change in the value of this metric.

The indexing of the record will be skipped, and will continue with the next message from the message queue. So in such cases, the index might not be completely up to date.

5.9 Indexer Architecture

Here we briefly discuss the main components of the indexer. This can be helpful for a better understanding of how things works or as an introduction for people who want to dive into the source code.

5.9.1 The indexer model

This is a library that offers an API to query and modify the definition of the indexes. Other components that want access to the definition of the indexes always perform it through this library. This includes components within the Lily server (those discussed further on) as well as for example the command line utilities such as lily-add-index.

Basically, the information managed by the indexer model is what you see when you execute lily-list-indexes.

When you want access to the definition of the indexes, you do not need to talk to one of the Lily nodes, but only need to make use of this library, which only needs access to ZooKeeper. This means the indexer model can also be manipulated while no Lily nodes are running.

The Lily nodes register change listeners on the indexer model to react dynamically as the model changes (this is implemented through ZooKeeper watchers).

All information about an index is stored within the data of one znode (ZooKeeper node), this includes the indexer configuration. Storing it within one znode makes it easy to atomically modify it and watch it.

5.9.2 The indexer engine

The indexer engine contains two parts:

5.9.3 The indexer worker

The indexer worker is a component that runs on each Lily node and that registers one or more message queue listeners (whose implementation is provided by the indexer engine) for each index for which incremental indexing is enabled (update state: SUBSCRIBE_AND_LISTEN).

This happens dynamically: the message queue listeners are added or removed as indexes are added, removed, or when their update state changes.

In the future, we might add the possibility to enable or disable the indexer worker for selected Lily nodes. You could for example have some Lily nodes which are dedicated to indexing, and others which server client CRUD requests. Let us know if you are interested in this.

5.9.4 The indexer master

The indexer master is a component which is active on only one of the Lily nodes, based on ZooKeeper-based leader election. If the Lily node on which it runs dies, another node will take over the role.

The tasks of the indexer master include:

All these tasks are very lightweight and hence should not have much influence on the Lily node on which the index master runs.

5.9.5 The batch build MapReduce job

The batch build MR job is a map task for the MapReduce programming model that takes as input the row keys of all records stored in HBase, and calls the indexer engine for each of these records.

It makes use of the HBase-provided MapReduce support, which means that the input will be split into as many parts as there are HBase regions in the records table.

There is no reduce part to this job, and neither does the map task produce any output key-values. It simply calls Solr directly. This approach is used since it allows to run the batch build concurrently with an ongoing incremental update of the index.

The map task does not talk to external Lily nodes to retrieve the records, but rather uses an embedded repository.

Since the map task spends time waiting on IO (as it reads records from HBase and sends to Solr), it uses multiple threads to perform the indexing.

5.9.6 The link index

Conceptually, the link index is unrelated to the indexer, but as its main use is currently for the indexer we discuss it here too.

The link index is an index based on the hbaseindex library, a generic library (that is part of Lily) for creating HBase-based secondary indexes.

The index is maintained by a secondary action, that is an action which is guaranteed to run after each update to a record. It is executed before the message related to this update is put onto the message queue, thus it is guaranteed that the link index will be updated before any indexers receive events about the related change (putting the message onto the message queue is itself also performed as a secondary action).

6 Tools

6.1 Import Tool

The import tool allows to load a JSON file describing field types, record types and records into Lily.

For basic usage options, execute

lily-import -h

6.1.1 The import JSON format

The JSON format is basically the same as that of Lily's REST interface, but allows for multiple field types, record types and records to be described within one JSON structure.

The general structure of the JSON import file is as follows:

{
  namespaces: {

  },
  fieldTypes: [

  ],
  recordTypes: [

  ],
  records: [

  ]
}

The import tool accepts relaxed JSON without quoted property names and with comments in /* ... */ format.

For the format of the field types, record types and records to be embedded within the arrays, we refer to the documentation of the REST Interface. The only difference is that the namespaces are declared once at the top, instead of repeating them within each individual object.

The order of the sections in the import file is important: first namespaces, then fieldTypes, then recordTypes, then records. This is because the file is processed in order, to avoid having to read it entirely into memory.

The import tool works in a "create or update" mode, basically the same as when you do a PUT in the REST-interface. For the field types and record types, the identification is always performed based on their name, for records based on their ID. For example if a recordType with the given name already exists, it will be updated (if necessary). If there would be some conflict (e.g. a field type with a different scope), an error will occur. If records do not specify an ID, then they will be recreated with a different ID upon each import.

Below is a sample import file describing a Person record type. For more examples see also the samples directory of the Lily distribution.

{
  namespaces: {
    "org.sample.person": "p"
  },
  fieldTypes: [
    {
      name: "p$name",
      valueType: "STRING",
      scope: "versioned"
    },
    {
      name: "p$birthDay",
      valueType: "DATE",
      scope: "non_versioned"
    }
  ],
  recordTypes: [
    {
      name: "p$Person",
      fields: [
        {name: "p$name", mandatory: true },
        {name: "p$birthDay", mandatory: true }
      ]
    }
  ],
  records: [
    {
      type: "p$Person",
      fields: {
        "p$name": "Anonymous Coward",
        "p$birthDay": "1978-10-13"
      }
    }
  ]
}

6.2 mbox Import Tool

6.2.1 About

The mbox import tool allows to import mbox mail archive files into Lily. This provides an easy way to load some 'real' content into Lily.

The import uses a simple model: for each mail message, one "Message" record is created, and for each part in the MIME message, a "Part" record is created. The content of each part is stored in a blob field of the Part records. The Message record only holds global fields like from, to and subject. The import tool currently handles all the parts equally, and does not attempt to select one as the main body of the mail.

+----------------------+                  +------------------+
|                      | 1              * |                  |
|       Message        |------------------|       Part       |
|                      |                  |                  |
+----------------------+                  +------------------+

Usage instructions are included within the mbox tool itself, execute:

lily-mbox-import -h

Below we run through the concrete steps to get it working, including indexing.

6.2.2 Mail usage run-through

6.2.2.1 Get some mbox files

One source of mbox files are the Apache mailing list archives, which can be found at:

http://{top level project}.apache.org/mail/{list name}

You can for example download them using curl:

curl -f http://hadoop.apache.org/mail/mapreduce-user/[2008-2010][01-12].gz -o "#1#2.gz"
curl -f http://cocoon.apache.org/mail/dev/[2000-2010][01-12].gz -o "#1#2.gz"

Other mbox sources:

6.2.2.2 Run HBase & Lily

As explained in the Running Lily guide, you can run a test HBase instance with the command below, or you can use your own HBase installation.

bin/launch-hadoop

Start the Lily server:

bin/lily-server

6.2.2.3 Create the schema

If you run the import tool with the -s option, it will just create the schema.

bin/lily-mbox-import -s

If you need to connect to a ZooKeeper different from 'localhost:2181', use the -z option to specify the connection string.

6.2.2.4 Run SOLR and define an index

This step is optional and can be skipped.

A sample SOLR schema configuration is provided in the file samples/mail/mail_solr_schema.xml

To run a test SOLR instance with this configuration, use:

bin/solr-launcher -s samples/mail/mail_solr_schema.xml

Now configure an index in SOLR using:

bin/lily-add-index -n mail -s shard1:http://localhost:8983/solr/ -c samples/mail/mail_indexerconf.xml

6.2.2.5 Run the import

You can import one file at a time or a complete directory. Files ending in ".gz" will be decompressed on the fly.

lily-mbox-import -f {file name or directory name}

Again, use -z to specify the ZooKeeper connection string:

lily-mbox-import -z localhost:2181 -f {file name or directory name}

6.3 Tester Tool

The tester tool is a tool that can run a configurable scenario of CRUD operations against Lily.

It features the following:

Performance metrics are generated while the tester is running, do a "tail -f Tester-metrics" to see them. These metrics can include Lily and HBase system metrics if you launch the tester with the -lm and -hm options. This requires that you have enabled JMX access for Lily and HBase on all nodes. For this, comment out the respective lines in hbase-env.sh and lily/service/wrapper.conf. Afterwards you can generate charts from these metrics using the lily-metrics-report tool.

If errors would occur, these are logged to the file failures.log.

A default configuration can be generated by running :

lily-tester -d

Here's an example configuration file,  config.json, also containing explanations of the different configuration settings.

To run this configuration execute :

lily-tester -c config.json

More usage information is available via:

lily-tester -h

7 REST (HTTP+JSON) API

7.1 REST Interface Tutorial

7.1.1 Abstract

This is a quick introduction to the REST interface. The full details are described in the reference documentation.

For demo purposes we will use the curl tool and assume the usage of a unix-like shell. The URI used in the samples assume you have a Lily node running on localhost listening to port 12060.

7.1.2 Creating a schema

Before we can create any records in Lily, we need to define our schema. For the purpose of this example, let's create two field types called name and price and combine them into a record type called product.

7.1.2.1 Creating the name field type

You can create the field type by entering (or copy-pasting) the following command on the shell. Since we end the first line with a quote, the shell will ask for more lines until we close the quote.

curl -XPOST localhost:12060/repository/schema/fieldType -H 'Content-Type: application/json' -d '
{
  action: "create",
  fieldType: {
    name: "n$name",
    valueType: "STRING",
    scope: "versioned",
    namespaces: { "my.demo": "n" }
  }
}' -D -

To create a field type we POST to the resource representing the collection of field types.

The server needs to know what kind of content we are submitting, this is specified using the -H option.

The JSON we submit follows a structure that will return for all usage of the POST method: it is an object specifying an action and an actual object, here a fieldType. The REST interface is liberal in what it accepts: the submitted JSON does not need to have property names quoted, even though this is required by the JSON specification.

For the field type, we specify its essential properties: the name, the value type and the scope.

Names of field types are namespaced. Similar to XML, the namespace is not embedded directly into the name but associated with a prefix. So in this example the namespace is "my.demo" and the associated prefix is "n". In contrast to XML, the prefix and local name are not separated with a colon but rather with a dollar sign. The reason for this is that the same syntax is used in URIs, where the colon is a reserved character. This saves us from escaping it each time.

The namespace mapping is declared such that namespaces are mapped onto prefixes. It is done this way because when you read an entity (like a field type or a record), you are usually interested in finding out what prefix is used for a particular namespace, rather than the other way around. However, the map can be easily reversed, since each namespace occurs only once and is bound to a different prefix.

Finally, we specify the option "-D -" to dump the response headers to standard out. This is useful to see things like the status code and the Location header.

The response you get when executing the above command will be similar to this:

HTTP/1.1 201 Created
Content-Type: application/json; charset=UTF-8
Date: Thu, 28 Apr 2011 13:43:12 GMT
Accept-Ranges: bytes
Location: http://localhost:12060/repository/schema/fieldType/n$name?ns.n=my.demo
Server: Restlet-Framework/2.1snapshot
X-Kauri-ModuleInfo: rest (version: 1.0)
Transfer-Encoding: chunked

{
  "id": "04359728-6824-4e64-8e41-f7e496148c03",
  "name": "ns1$name",
  "scope": "versioned",
  "valueType": "STRING",
  "namespaces": {"my.demo": "ns1"}
}

The JSON will however not contain any whitespace and newlines, but rather appear as one long line. We added the whitespace here for readability.

The Location response header shows where the newly created field type can be retrieved from, you can try that as well:

curl -XGET http://localhost:12060/repository/schema/fieldType/n\$name?ns.n=my.demo

Next to the resource /repository/schema/fieldType, under which field types are addressed by name, there is also the resource /repository/schema/fieldTypeById, under which field types are addressed by ID. This resource behaves the same: we could as well have created the field type by POSTing to this resource. The only difference is that the Location header in the response would then be set to:

http://localhost:12060/repository/schema/fieldTypeById/04359728-6824-4e64-8e41-f7e496148c03

7.1.2.2 Creating the price field type

Just for illustration, we will create the price field type in a different way: using the PUT method. PUT will either update or create the field type, depending on whether it already exists.

The command is as follows:

curl -XPUT localhost:12060/repository/schema/fieldType/n\$price?ns.n=my.demo -H 'Content-Type: application/json' -d '
{
  name: "n$price",
  valueType: "DECIMAL",
  scope: "versioned",
  namespaces: { "my.demo": "n" }
}' -D -

Here you see how a namespaced name is represented in an URI: again using a prefix, which mapped on a namespace using a request parameter starting with "ns.". The \ before the $ sign is only necessary here because $ has a special meaning in the shell.

In contrast to when using POST, we now submit just the field type, without the wrapper object specifying an action.

The field type name occurs in both the URL and the submitted entity, which might leave you wondering which one will be used. The one specified in the URI will be used to retrieve the existing field type, if any. The one specified in the body will be used to update the name of the field type, or when creating the field type.

When you execute this curl command the first time, the response status will be "201 Created". If you would execute it a second time the status will be "200 OK" since the field type already exists. The field type will have been updated if necessary to correspond with the submitted json. So the PUT operation behaves as "create or update". In contrast, if you would retry the POST operation that we used to create the name field type, it will respond with "409 Conflict". Hence, you would use POST if you want to avoid updating an existing field type.

7.1.2.3 Creating the product record type

Creating a record type is done in the same way as a field type. You have again the choice between using POST (if you want to be sure to be creating something) or PUT (if you want to either update or create the record type).

The submitted JSON format is of course a bit different: for a record type we specifies the list of fields it should contain.

The command is as follows:

curl -XPOST localhost:12060/repository/schema/recordTypeById -H 'Content-Type: application/json' -d '
{
  action: "create",
  recordType: {  
    name: "n$product",
    fields: [
      { name: "n$name", mandatory: true},
      { name: "n$price", mandatory: true}
    ],
    namespaces: { "my.demo": "n" }
  }
}' -D -

Just for illustration, this time we posted to the 'ById' resource.

The response is:

HTTP/1.1 201 Created
Content-Type: application/json; charset=UTF-8
Date: Thu, 28 Apr 2011 14:03:20 GMT
Accept-Ranges: bytes
Location: http://localhost:12060/repository/schema/recordTypeById/3f8f3526-0c90-4cbf-aecd-1d0f014bf5ac
Server: Restlet-Framework/2.1snapshot
X-Kauri-ModuleInfo: rest (version: 1.0)
Transfer-Encoding: chunked

{
  "id": "3f8f3526-0c90-4cbf-aecd-1d0f014bf5ac",
  "name": "ns1$product",
  "fields": [
    {"id": "0d096b72-826b-481b-970a-0097e987b066", "mandatory": true},
    {"id": "04359728-6824-4e64-8e41-f7e496148c03", "mandatory": true}],
  "version": 1,
  "mixins":[],
  "namespaces": {"my.demo": "ns1"}
}

Again, you can retrieve this record type using the URI found in the Location header:

curl -XGET http://localhost:12060/repository/schema/recordTypeById/3f8f3526-0c90-4cbf-aecd-1d0f014bf5ac

Record types are versioned, specific versions can be retrieved as follows:

curl -XGET http://localhost:12060/repository/schema/recordTypeById/3f8f3526-0c90-4cbf-aecd-1d0f014bf5ac/version/1

7.1.3 Creating records

7.1.3.1 Create record using POST, server assigns record ID

Let's create a product record:

curl -XPOST localhost:12060/repository/record -H 'Content-Type: application/json' -d '
{
  action : "create",
  record: {
    type: "n$product",
    fields: {
      n$name: "Bread",
      n$price: 2.11
    },
    namespaces: { "my.demo": "n" }
  }
}' -D -

The response is:

HTTP/1.1 201 Created
Content-Type: application/json; charset=UTF-8
Date: Thu, 28 Apr 2011 14:05:33 GMT
Accept-Ranges: bytes
Location: http://localhost:12060/repository/record/UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0
Server: Restlet-Framework/2.1snapshot
X-Kauri-ModuleInfo: rest (version: 1.0)
Transfer-Encoding: chunked

{
  "id": "UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0",
  "version": 1,
  "type": {"name": "ns1$product", "version": 1},
  "versionedType": {"name": "ns1$product", "version": 1},
  "fields": {
    "ns1$name": "Bread",
    "ns1$price": 2.11
  },
  "namespaces": {"my.demo": "ns1"}
}

The response JSON is more extensive than what we submitted:

7.1.3.2 Creating a record using PUT, assigning the record ID yourself

Another way to create a record is to PUT to the resource /repository/record/{id}. This is different from the example with POST above in two important ways:

The PUT method has the advantage that if an update would fail because of some IO error (a network problem, or when the Lily node died while handling the request), you can simply retry the operation. The end result will be the same: there will be a record in the repository with the given ID and with the specified field values.

So the first thing we now have to do is to decide on a record ID. Lily allows to either invent your own custom record ID, which can be an arbitrary string, or to use UUIDs. To use a custom record ID, simply use a string of the form "USER.something". To use a UUID, the string should be in the form "UUID.{valid uuid string following rfc 4122}".

For this example, let's use a UUID. In Linux, you can generate one with the command uuidgen:

$ uuidgen -r
a7166289-eb7a-4715-8c8e-3c997d752926

Now let's post our record:

curl -XPUT localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926 -H 'Content-Type: application/json' -d '
{
  type: "n$product",
  fields: {
    n$name: "Butter",
    n$price: 4.25
  },
  namespaces: { "my.demo": "n" }
}' -D - 

The response is as before:

HTTP/1.1 201 Created
Content-Type: application/json; charset=UTF-8
Date: Fri, 29 Apr 2011 08:30:05 GMT
Accept-Ranges: bytes
Location: http://localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926
Server: Restlet-Framework/2.1snapshot
X-Kauri-ModuleInfo: rest (version: 1.0)
Transfer-Encoding: chunked

{
  "id": "UUID.a7166289-eb7a-4715-8c8e-3c997d752926",
  "version": 1,
  "type": {"name": "ns1$product", "version": 1},
  "versionedType": {"name": "ns1$product", "version": 1},
  "fields": {
    "ns1$name": "Butter",
    "ns1$price": 4.25
  },
  "namespaces": {"my.demo": "ns1"}
}

7.1.4 Reading records

Reading a record is very simple: just perform a GET operation on its URL. The following URL was simply copied from the 'Location' header in the response of the previous example:

curl http://localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926 | json_reformat

You can of course also use your web browser to view the record at this URL.

The GET operation supports a request parameter schema=true to include the schema information of each of the requested fields. This can be useful for generic applications that have no baked-in knowledge about the field types:

curl http://localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926?schema=true | json_reformat

This gives:

{
  "id": "UUID.a7166289-eb7a-4715-8c8e-3c997d752926",
  "version": 1,
  "type": {
    "name": "ns1$product",
    "version": 1
  },
  "versionedType": {
    "name": "ns1$product",
    "version": 1
  },
  "fields": {
    "ns1$name": "Butter",
    "ns1$price": 4.25
  },
  "schema": {
    "ns1$name": {
      "id": "2f03a71a-1c94-4005-b56a-12db8d58c1e6",
      "scope": "versioned",
      "valueType": "STRING"
    },
    "ns1$price": {
      "id": "f2c4dedb-145a-4a7e-9580-980cf07c5928",
      "scope": "versioned",
      "valueType": "DECIMAL"
    }
  },
  "namespaces": {
    "my.demo": "ns1"
  }
}

If you are only interested in a subset of the fields of a record, you can specify the fields to return with a request parameter called fields. So suppose we only want the name:

curl 'http://localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926?fields=n$name&ns.n=my.demo' | json_reformat

The URL has been put between quotes so the shell would ignore the special characters like $.

Other things you can do are retrieving a specific version of a record, retrieving the list of versions, etc. All this should be straightforward: see the reference.

7.1.5 Creating a record with a blob field

Something which might take a bit more time to figure out is how to create a record with a blob field.

Before we can do this, we need a blob field type and a record type containing this field. For the purpose of this sample, we will create a field called data and a record type called file.

The command to create the data field type is:

curl -XPOST localhost:12060/repository/schema/fieldType -H 'Content-Type: application/json' -d '
{
  action : "create",
  fieldType: {
    name: "n$data",
    valueType: "BLOB",
    scope: "versioned",
    namespaces: { "my.demo": "n" }
  }
}' -D -

The command to create the file record type is:

curl -XPOST localhost:12060/repository/schema/recordType -H 'Content-Type: application/json' -d '
{
  action : "create",
  recordType: {  
    name: "n$file",
    fields: [
      { name: "n$data", mandatory: true}
    ],
    namespaces: { "my.demo": "n" }
  }
}' -D -

Creating a record with a blob field happens in two steps:

  1. upload the blob(s)
  2. create the record with a reference to the blob

A blob is uploaded by POSTing it to the /repository/blob resource. It is required to specify the Content-Length header, which curl does automatically for you, and the Content-Type header. In the following command, I am uploading a file which I had lingering on my disk: zookeeper-3.3.1.tar.gz:

curl -XPOST localhost:12060/repository/blob --data-binary @zookeeper-3.3.1.tar.gz -H 'Content-Type: application/x-gzip' -D -

As response, this gives some JSON:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Date: Mon, 30 Aug 2010 14:48:15 GMT
Accept-Ranges: bytes
Server: Restlet-Framework/2.0snapshot
Content-Length: 91

{
  "value": "AAAAA0RGU07g2WL00E0Ol7Fg3xroOWo",
  "mimeType": "application/x-gzip",
  "size": 10279804
}

It is exactly this piece of JSON that you need to use as the value for the record field, as follows:

curl -XPOST localhost:12060/repository/record -H 'Content-Type: application/json' -d '
{
  action : "create",
  record: {
    type: "n$file",
    fields: {
      n$data: {
        "value": "AAAAA0RGU07g2WL00E0Ol7Fg3xroOWo",
        "mimeType": "application/x-gzip",
        "size": 10279804
      }
    },
    namespaces: { "my.demo": "n" }
  }
}' -D -

This was it, we created a record with a blob field.

To download the blob, you can access it via the resource /repository/record/{id}/field/{name}/data, in this case (you can find the record ID in the output of the previous command):

curl localhost:12060/repository/record/UUID.1248605b-0c8d-40a9-a684-7c01c94a5c0c/field/n\$data/data?ns.n=my.demo --output download.tar.gz

7.1.6 Creating A Record With A Complex Field

Sometimes you might want to store a more complex value in a field. Thus not a simple value like a string, but a complex value which is again composed of multiple fields. In Lily this is possible by creating fields of type RECORD. These are fields in which you can put Record objects. These are not real records with their own identity, it is just a re-use of the top-level Record data structure to use it as value within the field of another record. Since any record object can have fields which by themselves can again contain records (or lists of records), this allows for modeling arbitrarily complex structures.

Before you use complex fields, you should always ask yourself the question if you want to use either complex fields or rather link fields (which contain pointers to other records). Both enable you to store the same kinds of nested/complex structures. In the case of complex fields, the nested structures (nested records) are all stored within one record, so don't have their own identity and are hence not separately retrievable or indexable. Link fields pointing to other records give each part of the nested structure its own identity, but at the cost of having to create/read multiple records, and loosing the atomicity of the create operation.

Since complex fields are modeled in Lily by creating field types with as value type RECORD, they are also called record-type fields.

In the following example, we will create articles which have authors. Each author has a name and email attribute. For the sake of this example, we are going to store the authors within the article, in a complex field. So there will be no re-use of the same author records across articles.

For this example, we will create the schema using the import tool. Save the following in a file called schema.json:

{
  namespaces: {
    "article": "a"
  },
  fieldTypes: [
    { name: "a$name", valueType: "STRING" },
    { name: "a$email", valueType: "STRING" },
    { name: "a$title", valueType: "STRING" },
    { name: "a$authors", valueType: "LIST<RECORD<{article}author>>" },
    { name: "a$body", valueType: "STRING" }
  ],
  recordTypes: [
    {
      name: "a$author",
      fields: [
        {name: "a$name", mandatory: true },
        {name: "a$email", mandatory: true }
      ]
    },
    {
      name: "a$article",
      fields: [
        {name: "a$title", mandatory: true },
        {name: "a$authors", mandatory: true },
        {name: "a$body", mandatory: true }
      ]
    }
  ]
}

And then import it using:

lily-import schema.json

Now we can create an article, with authors nested in it, as follows:

curl -XPUT localhost:12060/repository/record/USER.my_article -H 'Content-Type: application/json' -d '
{
  type: "a$article",
  fields: {
    a$title: "Title of the article",
    a$authors: [
      {
        type: "a$author",
        fields: {
          a$name: "Author X",
          a$email: "author_x@authors.com"
        }
      },
      {
        type: "a$author",
        fields: {
          a$name: "Author X",
          a$email: "author_x@authors.com"
        }
      }
    ],
    a$body: "Body text of the article"
  },
  namespaces: { "article": "a" }
}' -D - 

The authors field contains a list in which each entry again follows the same structure as for a top-level record: you specify its type and its fields.

7.1.7 Scanning Over Records

Scanners allow to sequentially run over all or part of the records stored in the repository. For an introduction to scanners, see Scanning Records And Record Locality.

To start, you need to create a scanner, giving the parameters for the scanner in the body:

curl -XPOST localhost:12060/repository/scan -H 'Content-Type: application/json' -d '
{
  recordFilter: {
    "@class": "org.lilyproject.repository.api.filter.RecordTypeFilter",
    recordType: "{my.demo}product"
  }
}' -D -

In the above example, we run over all records and use a filter so that only the records of the desired type are returned. If you would just want to run over all records, post an empty json object, { }. There are many other options available, see JSON Formats.

The response will contain the URL of the created scanner in the Location header:

HTTP/1.1 201 Created
Content-Length: 0
Content-Type: application/octet-stream; charset=UTF-8
Date: Thu, 15 Mar 2012 13:54:43 GMT
Accept-Ranges: bytes
Location: http://localhost:12060/repository/scan/7832142591753684320

We can now query this scanner to return the next record(s). By default, just one record is returned, use the batch parameter to retrieve multiple records.

curl 'http://localhost:12060/repository/scan/7832142591753684320?batch=10' | json_reformat

This gives our two products:

{
  "results": [
    {
      "id": "UUID.0a6e8ca6-ab06-4c7e-bcc5-33c1f048a4d9",
      "version": 1,
      "type": {
        "name": "ns1$product",
        "version": 1
      },
      "versionedType": {
        "name": "ns1$product",
        "version": 1
      },
      "fields": {
        "ns1$name": "Bread",
        "ns1$price": 2.11
      },
      "namespaces": {
        "my.demo": "ns1"
      }
    },
    {
      "id": "UUID.a7166289-eb7a-4715-8c8e-3c997d752926",
      "version": 1,
      "type": {
        "name": "ns1$product",
        "version": 1
      },
      "versionedType": {
        "name": "ns1$product",
        "version": 1
      },
      "fields": {
        "ns1$name": "Butter",
        "ns1$price": 4.25
      },
      "namespaces": {
        "my.demo": "ns1"
      }
    }
  ]
}

You can repeatedly call GET on the scanner resource, until the scanner has reached the end. At that point, it will respond with '204 No Content':

curl --dump-header - 'http://localhost:12060/repository/scan/2764478081058669015'
HTTP/1.1 204 No Content
Content-Type: application/json; charset=UTF-8
Date: ...

When done with the scanner, delete it to free up the resources:

curl -XDELETE 'http://localhost:12060/repository/scan/2764478081058669015'

Scanners only live in the server where you created them. So all requests related to a single scanner should go to the same Lily server.

7.2 REST API Reference

7.2.1 About the REST interface

For an introduction to the REST interface, see REST Interface: Getting Started.

The REST API reference documentation consists of two parts:

7.2.2 JSON Formats

7.2.2.1 About JSON

Lily's REST interface is liberal in the JSON it accepts: it supports unquoted property names and comments.

7.2.2.2 Content-Type

The REST interface supports only JSON as content type. Requests that submit JSON should have a header “Content-Type: application/json”.

7.2.2.3 Namespaces

7.2.2.3.1 Namespaced names

You have two options for specifying namespaces names: either you specify them in full, or you use a namespace prefix.

7.2.2.3.1.1 Specify in namespaced names in full

To specify the name in full, use the following syntax:

{namespace}name

Thus, the namespace is specified between curly braces, followed by the name.

7.2.2.3.1.2 Use prefixes

Alternatively, for shorter typing when your namespaces are long, you can bind them to prefixes.

The syntax for the names then becomes:

prefix$name

Thus a prefix, followed by a dollar sign, followed by the non-namespaced name.

The prefix can be freely chosen, and is bound to the actual namespace as described next.

The prefixes used in entities retrieved from Lily will usually be different from those you use when submitting entities: Lily does not remember the prefixes, only the namespaces.

7.2.2.3.1.3 Declaring namespaces

In each format, a property called namespaces can be present containing namespace declarations.

The format for namespaces is as follows:

{
  "namespace1": "prefix1",
  "namespace2": "prefix2"
}

Since the namespace is used as the key, each namespace can be mapped to just one prefix. This makes it easier to read e.g. fields in a record: just find out what the prefix for the namespace is, and use that to retrieve the name. This would be more complicated in case different prefixes could map to the same namespace.

Obviously, each namespace should be mapped to a different prefix.

7.2.2.4 Field type format

{
  id: "string", [not required upon submit]
  name: "prefix$name",
  valueType: "STRING|INTEGER|LONG|...",
  scope: "versioned|non_versioned|versioned_mutable", [default = non_versioned],
  namespaces: { ... }
}

The full list of available value types can be found in the section on the record format.

7.2.2.5 Record type format

{
  id: "string", [not required upon submit]
  name: "prefix$name",
  version: long,
  fields: [
    {
      id: "string",    [upon submit, you can specify either id or name]
      name: "prefix$name", [not present upon retrieval]
      mandatory: true|false [default = false]
    }
  ],
  mixins: [
    {
      id: "string", [upon submit, you can specify either id or name]
      name: "prefix$name", [not present upon retrieval]
      version: long
  ]
  namespaces: { ... }
}

7.2.2.6 Record format

{
  id: "string",
  type: "prefix$name" or { name: "prefix$name", version: long},
  versionedType: {name: "prefix$name", version: long}, [only when applies, ignored upon submit]
  versionedMutableType: {name: "prefix$name", version: long}, [only when applies, ignored upon submit] 
  version: long, [only when the record has versions, ignored upon submit]
  fields: {
    "prefix$name": value [format for the value: described below}
  },
  fieldsToDelete: [ 'prefix$name', ...],
  schema: { [ ignored upon submit]
    "prefix$name": { field type json }
  }
  namespaces: { ... }
}

This format can use some more explanation.

7.2.2.6.1 The record ID

The record ID can be either:

Besides the core ID, the record ID can also contain variant properties. The format for these is described in RecordId.toString().

7.2.2.6.2 Formatting of value types

The following table shows the names of the value types, and what JSON type should be used for their values.

Value type name

JSON type

Example / details

STRING

string

INTEGER

number

LONG

number

DOUBLE

number

DECIMAL

number

BOOLEAN

boolean

DATE

string

DATETIME

string

URI

string

The string should be acceptable by the constructor of the java.net.URI class.

LINK

string

A Lily record ID, as obtained by, and described in the Javadoc of, RecordId.toString()

The LINK type can optionally be qualified with a record type name: LINK<{namespace}name>

BLOB

object

A Javascript object with the following properties:

  • size: integer number
  • mimeType: string
  • value: as returned in the response when creating the blob (POST on /repository/blob)
  • name: string [optional]

LIST<sometype>

array

an array of values which can in their turn be arrays again in case the nested value type is a LIST or PATH

The LIST type needs to be qualified with the kind of types in the list, for example: LIST<STRING> or LIST<LINK>.

PATH<sometype>

array

an array of values which can in their turn be arrays again in case the nested value type is a LIST or PATH.

The PATH type needs to be qualified with the kind of types in the path, for example: PATH<STRING>

RECORD
RECORD<recordtypename>

record json

json representation of a record, some properties will be ignored though: version, versionedType, versionedMutableType, fieldsToDelete

The RECORD type can optionally be qualified with a record type name: RECORD<{namespace}name>

BYTEARRAY

base64

Base64 encoded representation of the bytes

7.2.2.7 List format

Some resources return a list of records, record types or field types.

The format for these is:

{
  results: [
    { json of a record, record type or field type },
    ...
  ]
}

7.2.2.8 POST format

For POST requests, we use a generic format which is as follows:

{
  action: "update|create|...",
  entityName: { JSON object for the kind of entity}
}

in which entityName is one of: fieldType, recordType, record.

7.2.2.9 Record Scan Format

For a general introduction on scanners, see Scanning Records And Record Locality. Scanners allow to run sequentially over all or part of the records stored in the repository. Scanners can efficiently jump to the specified startRecordId, but from there accesses each record sequentially until the scan stops at the stopRecordId, or when a filter indicates the scanning should stop, or else it runs until the very end of the table. The filtering ability of scanners is not based on indexes, when you specify a filter the scan still runs over each record and evaluates the filter for it.

The syntax:

{
  startRecordId: "UUID.something or USER.something",
  stopRecordId: "...",
  rawStartRecordId: "...",
  rawStopRecordId: "...",
  recordFilter: { /* see syntax below */ },
  returnFields: { /* see syntax below */ },
  caching: integer,
  cacheBlocks: true|false
}

All properties are optional: in this case the scan will run over all records (and without caching, which is off by default).

Some further explanation of the properties:

returnFields

The syntax for the returnFields property is:

{
  type: "NONE|ALL|ENUM"
  fields: [ "field qname" ]
}

The fields property is only relevant when the type is ENUM.

7.2.2.10 Filter Format

7.2.2.10.1 General

Each filter contains an attribute @class identifying the type of filter. The other properties are filter dependent.

{
  "@class": "..."
}
7.2.2.10.2 Record Type Filter

Only lets through records of the given record type.

{
  "@class": "org.lilyproject.repository.api.filter.RecordTypeFilter",
  recordType: "record type qname",
  version: integer (optional)
}
7.2.2.10.3 Field Value Filter

Only lets through records for which the given field equals (or not equals) the given value.

{ 
  "@class": "org.lilyproject.repository.api.filter.FieldValueFilter",
  field: "field qname",
  fieldValue: ...,
  compareOp: "EQUAL|NOT_EQUAL",
  filterIfMissing: true|false
}

The field value should be specified in the same syntax as used in records.

filterIfMissing: if false (default is true), and the record does not have the field, it will let the record through.

7.2.2.10.4 Record ID Prefix Filter

Only lets through records whose ID starts with the given record ID. For example, specifying "USER.a" will let through "USER.afoo" and "USER.abar" but not "USER.b". This filter causes the scanning process to stop as soon as a key is encountered which is larger than the given prefix, since no further record could then again be a match.

When using this filter, you will usually set the startRecordId of the scan to the same record ID.

{
  "@class": "org.lilyproject.repository.api.filter.RecordIdPrefixFilter",
  recordId: "..."
}
7.2.2.10.5 Filter List

Combines multiple filters. Filter list can by itself again contain a filter list, allowing to create arbitrary hierarchies of filters.

{  
  "@class": "org.lilyproject.repository.api.filter.RecordFilterList",
  operator: "MUST_PASS_ALL|MUST_PASS_ONE"
  filters: [ ]
}

7.2.3 REST Protocol

7.2.3.1 Nodes / connecting / load balancing

The REST interface is exposed by each individual Lily node. Ideally clients should load-balance their requests over the set of available Lily nodes. Right now, Lily does not offer a standard solution for this.

The port to which the REST interface listens is configured in conf/kauri/connectors.xml.

7.2.3.2 Error responses

Whether a request was succesful or not can be detected through the HTTP status code, all status codes starting with 2xx indicate success.

Failures can be due to the client (e.g. a syntax error in the URI or the submitted JSON), or can be due to failures in Lily itself.

The most used error responses are:

400 Bad Request

404 Not Found

500 Internal Server Error

For most errors, the entity is a JSON with the following format (look at the Content-Type header of the response).

{
  status: long, [the HTTP status code repeated]
  description: "description of the status code",
  causes: [
    message: "string",
    type: "fully qualified java class name",
  ],
  stackTrace: "complete java stack trace"
}

The causes array contains the message and type of the exceptions that happened, and all its causes.

Here is a sample error response. The request tried to submit a record containing a non-defined field name. The stack trace has been snipped for the most part.

{
  "status": 500,
  "description": "Internal Server Error",
  "causes": [
    {
      "message": "Error reading submitted JSON.",
      "type": "org.lilyproject.rest.ResourceException"
    },
    {
      "message": "FieldType '{my.demo}someNonExistingField' could not be found.",
      "type": "org.lilyproject.repository.api.FieldTypeNotFoundException"
    }
  ],
  "stackTrace": "org.lilyproject.re  [snipped] odyReader.java:63)\n\t... 65 more\n"
}

For some errors we currently have no control over the formatting, and the response will depend on the framework. See issue 104.

7.2.3.3 Method tunneling

Some HTTP clients are not able to perform methods like PUT and DELETE. In such cases, you can tunnel these methods over the POST method.

Use a request header X-HTTP-Method-Override

With curl you would do this like this:

curl -XPOST -H 'X-HTTP-Method-Override: PUT' ...
Use a request parameter 'method'

Example:

curl -XPOST localhost:8888/repository/record/USER.foobar?method=PUT

7.2.3.4 Resources for field types

7.2.3.4.1 /repository/schema/fieldType/{prefix$name}?ns.prefix=namespace
7.2.3.4.1.1 GET

Gets a field type by name.

As a reminder, the name should be a namespaced name and the namespace should be bound to a prefix declared in a request parameter. Example:

http://myhost/repository/schema/fieldType/p$title?ns.p=my.namespace

If the namespace would be an URL, it should be properly escaped.

7.2.3.4.1.2 PUT

Create or update a field type.

The field type name specified in the URI is used to determine what field type to update (if it already exists). After update, the name of the field type will be changed to what is in the submitted JSON (if it is different).

In case of a created or a renamed field type, the response Location header will point to /repository/schema/fieldType/{prefix$name}?ns.prefix=namespace.

In case a field type is renamed, the response will be "301 Moved Permanently" rather than "200 OK".

The only property that you can update of a field type is its name. If you try to change other properties such as the value type or the scope, you will get a response status of 409 Conflict.

7.2.3.4.2 /repository/schema/fieldTypeById/{id}
7.2.3.4.2.1 GET

Gets a field type by ID.

7.2.3.4.2.2 PUT

Update a field type. You cannot create a field type this way, since the ID is assigned by the system.

If you try to update immutable properties, you will get a 409 Conflict response.

7.2.3.4.3 /repository/schema/fieldType
7.2.3.4.3.1 GET

Get the list of all field types. The returned entity is in the list format.

7.2.3.4.3.2 POST

Creates a new field type. The advantage of this method (over PUT on /repository/schema/fieldType/{name}) is that you are sure you are performing a create, not an update.

The posted entity should be a field type embedded in the following structure:

{
  action: "create",
  fieldType: {}
}

The Location header in the response will point to /repository/schema/fieldType/{prefix$name}?ns.prefix=namespace.

7.2.3.4.4 /repository/schema/fieldTypeById
7.2.3.4.4.1 GET

Same as for /repository/schema/fieldType

7.2.3.4.4.2 POST

Same as for /repository/schema/fieldType.

The Location header in the response will point to /repository/schema/fieldTypeById/{id}.

7.2.3.5 Resources for record types

7.2.3.5.1 /repository/schema/recordType/{prefix$name}?ns.prefix=namespace
7.2.3.5.1.1 GET

Gets a record type by its name, returns the latest version of the record type.

7.2.3.5.1.2 PUT

Creates or updates a record type.

The record type name specified in the URI is used to determine what record type to update (if it already exists). After update, the name of the record type will be changed to what is in the submitted JSON (if it is different).

In case of a created or a renamed record type, the response Location header will point to /repository/schema/recordType/{prefix$name}?ns.prefix=namespace.

In case a record type is renamed, the response will be "301 Moved Permanently" rather than "200 OK".

7.2.3.5.2 /repository/schema/recordTypeById/{id}
7.2.3.5.2.1 GET

Gets a record type by its ID, returns the latest version of the record type.

7.2.3.5.2.2 PUT

Update a record type. You cannot create a record type this way, since the ID is assigned by the system.

7.2.3.5.3 /repository/schema/recordType
7.2.3.5.3.1 GET

Gets the list of all record types. The returned entity is in the list format.

7.2.3.5.3.2 POST

Creates a new record type. The advantage of this method (over PUT on /repository/schema/recordType/{name}) is that you are sure you are performing a create, not an update.

The posted entity should be a record type embedded in the following structure:

{
  action: "create",
  recordType: {}
}

The response Location header will point to /repository/schema/recordType/{prefix$name}?ns.prefix=namespace.

7.2.3.5.4 /repository/schema/recordTypeById
7.2.3.5.4.1 GET

Same as for /repository/schema/recordType.

7.2.3.5.4.2 POST

Same as for /repository/schema/recordType.

The response Location header will point to /repository/schema/recordTypeById/{id}.

7.2.3.5.5 /repository/schema/recordType/{prefix$name}/version/{version}?ns.prefix=namespace
7.2.3.5.5.1 GET

Gets a specific version of a record type.

7.2.3.5.6 /repository/schema/recordTypeById/{id}/version/{version}
7.2.3.5.6.1 GET

Gets a specific version of a record type.

7.2.3.6 Resources for records

7.2.3.6.1 Common stuff
7.2.3.6.1.1 Specify fields to return

For operations which return a record, you can specify the fields which should be returned using a request parameter field. For example:

/repository/record/{id}?fields=p$field1,p$field2&ns.p=namespace
7.2.3.6.2 /repository/record
7.2.3.6.2.1 POST

Allows to create a record. Create a record this way (rather than using PUT on /repository/record/{id}) when:

The posted entity should be a record embedded in the following structure:

{
  action: "create",
  record: {}
}
7.2.3.6.3 /repository/record/{id}
7.2.3.6.3.1 GET

Gets a record.

7.2.3.6.3.2 PUT

Creates or updates a record. For create, this assumes you assign the ID yourself. You can use the POST method on the /repository/record resource if you want Lily to assign the ID. If you want to update a record without 'risking' to create it, use the POST method on this resource.

The set of submitted fields can be sparse: you only need to specify fields which you want to update. Missing fields will not be deleted, to delete fields specify them in the fieldsToDelete property.

TODO: returned record contains currently same fields a submitted record. See issue 100.

An update might cause the creation of a new version. The response to a successful update is however always 200.

7.2.3.6.3.3 POST
7.2.3.6.3.3.1 Using POST to update a record

The posted entity should be a record embedded in the following structure:

{
  action: "update",
  record: {}
}
7.2.3.6.3.3.2 Using POST to conditionally update a record

It is possible to update a record only if certain conditions are satisfied. This is typically used for optimistic concurrency control.

The conditions are specified as an extra property 'conditions' next to the record itself.

Example syntax:

{
  action: "update",
  record: {},
  conditions: [
    { field: 'prefix$name',
      value: value or null,
      operator: 'less|less_or_equal|equal|not_equal|greater_or_equal|greater',
      allowMissing: true|false
    },
    [ more conditions ]
  ],
  namespaces: {}
}

You can specify one ore more conditions, all conditions must be satisfied for the update to go through.

For each condition, you can specify:

Since you need to use qualified field names in the conditions, the namespaces must be visible at that level, and hence declared outside of the record (they do not need to be repeated inside the record).

If the update cannot be performed because one of the conditions is not satisfied, the response status will be 409 Conflict. The response body will contain the stored record state.

Below is a full example of an update with two conditions. One of the conditions checks on the record version through the special system namespace.

{
  action: "update",
  record: {
    fields: {
      'p$field1': 'value2'
    }
  },
  conditions: [
    {
      name: 'p$field1',
      value: 'value1'
    },
    {
      name: 's$version',
      value: 1
    }
  ],
  namespaces: {
    'my.namespace': 'p',
    'org.lilyproject.system': 's'
  }
}
7.2.3.6.3.3.3 Using POST to delete a record

To do a normal delete of a record, use the DELETE method on this resource.

Deleting via POST allows to specify conditions, thus to do a conditional delete, similar as for updates.

The posted entity should follow this syntax:

{
  action: "delete",
  conditions: [],
  namespaces: {}
}

Conditions are optional, and hence namespaces too.

A successful delete reports 204 No Content. In case the conditions are not satisfied, then the response status is 409 Conflict, and the body will contain a record snapshot containing the fields of the record that were used in the conditions (as far as they exist in the record).

7.2.3.6.3.4 DELETE

Deletes a record.

In case of success, this will report 204 No Content.

7.2.3.6.4 /repository/record/{id}/version/{version}
7.2.3.6.4.1 GET

Retrieves a specific version of a record.

The returned entity is a normal record JSON with the version attribute set to the specific version.

7.2.3.6.4.2 PUT

Use this to update the versioned-mutable fields of an existing version. Cannot be used to update versioned or non-versioned fields, any such fields will be ignored.

7.2.3.6.5 /repository/record/{id}/vtag/{vtag}
7.2.3.6.5.1 GET

Retrieves a version of a record identified by version tag.

The vtag namespace (org.lilyproject.vtag) is implied, the vtag should not contain a prefix.

For example:

http://myhost/repository/record/USER.foobar/vtag/last
7.2.3.6.6 /repository/record/{id}/version
7.2.3.6.6.1 GET

Gets information from multiple versions of a record in one call.

The following request parameters are used to specify the set of versions to be returned:

As when retrieving individual records, you can use the request parameter fields to specify what fields to return.

The returned entity is in the list format.

7.2.3.6.7 /repository/record/{id}/variant
7.2.3.6.7.1 GET

Gets the list of variants of this record.

The response format is the list format. The only record property that will be assigned is the id, no fields are returned.

7.2.3.7 Resources for blobs

7.2.3.7.1 Introduction

To create a record with blobs, you first need to upload the blobs by POSTing them to the /blob resource. This gives you back a JSON blurb, which is exactly the value you should provide for the blob field in the record.

7.2.3.7.2 /repository/blob
7.2.3.7.2.1 POST

Creates a new blob.

The request must specify the headers Content-Type and Content-Length, as these might be used to determine the storage location for the blob.

The blob content itself should be the submitted entity (without any wrapping or encoding).

If successful, this will respond with 200 OK. It does not respond with "201 Created" and a Location header because no accessible resource is created at this point. You need to associate the blob with a record in order for it to become accessible.

The response body will contain something of the following form:

{
  value: "string (encoded byte array, identifying the blob)",
  size: long,
  mimeType: "string"
}

This JSON is what you need to put in the value of the blob field when creating or updating a record.

7.2.3.7.3 /repository/record/{id}/field/{fieldName}/data
7.2.3.7.3.1 GET

Retrieves the blob from the specified field, from the latest version of the record.

Remember that the fieldName is a namespaced name:

http://myhost/repository/record/USER.foobar/n$myBlobField/data?ns.n=org.my.namespace

If the blob field would be of value type list or path, you can specify which blob to retrieve using the request parameter indexes which is a comma separated list of integers. The indexes are zero-based.

7.2.3.7.4 /repository/record/{id}/version/{version}/field/{fieldName}/data
7.2.3.7.4.1 GET

Similar as the previous one.

7.2.3.8 Resources for scanners

Scanners allow to run sequentially over all or part of the records stored in the repository. For an introduction to scanners, see Scanning Records And Record Locality.

In contrast to other resources in Lily's REST interface, scanners are stateful (in the sense that the scan resource encapsulates runtime application state). They only exist in the server where they are created. This means that requests for some scanner always need to go the same server, thus cannot be arbitrarily load-balanced.

7.2.3.8.1 /repository/scan
7.2.3.8.1.1 POST

Creates a new scanner.

The body should contain the definition of the scanner as described in the record scan format.

The response Location header will point to the created scan: /repository/scan/{scan-id}.

When it is no longer needed, a scan should be deleted. An expiry mechanism cleans up scans after some delay (default 1 hour), but for regular use you should not rely on this.

7.2.3.8.2 /repository/scan/{scan-id}
7.2.3.8.2.1 GET

Gets the next record(s) from the scanner. You can retrieve multiple records at once using the request parameter batch:

/repository/scan/{scan-id}?batch=100

The returned entity is in the list format.

If there are no more (or simply no) records, a 204 No Content response is given.

If the scan does not exist, a 404 Not Found response is given.

7.2.3.8.2.2 DELETE

Deletes (cleans up) a scan.

7.2.3.9 Resources for index management

These resources give access to the definition of the SOLR indexes, similar to what you can do with commands like lily-list-indexes (see Managing Indexes).

The update functionality is currently limited to updating the index state flags.

7.2.3.9.1 /index
7.2.3.9.1.1 GET

Gives the information of all the indexes.

Sample output, here just one index is defined:

[
  {
    "name": "index1",
    "configuration": "PD94bWwgdmVy...",
    "generalState": "ACTIVE",
    "batchBuildState": "INACTIVE",
    "updateState": "SUBSCRIBE_AND_LISTEN",
    "activeBatchBuildInfo": null,
    "lastBatchBuildInfo": null,
    "solrShards": {
      "shard1": "http://localhost:8983/solr"
    },
    "shardingConfiguration": null,
    "queueSubscriptionId": "IndexUpdater_index1",
    "zkDataVersion": 1
  }
]

The configuration, which is cut of here, are the XML bytes encoded as base64.

7.2.3.9.2 /index/{name}
7.2.3.9.2.1 GET

Gives information about one specific index.

7.2.3.9.2.2 PUT

Allows to update the state flags of the index. This is done by submitting the same JSON as returned by the GET operation, or in fact is should contain just one or more of these attributes: generalState, updateState, buildState.

7.2.3.9.3 /index/{name}/config
7.2.3.9.3.1 GET

Returns just the indexer configuration for this index (still embedded within the json structure).

7.2.3.10 Resources for the rowlog

7.2.3.10.1 /rowlog
7.2.3.10.1.1 GET

Gives the list of rowlogs, with information about their subscriptions and listeners.

7.2.3.10.2 /rowlog/{id}
7.2.3.10.2.1 GET

Same as /rowlog, but only returns the information of one specific rowlog.

8 Java Developers

8.1 Repository API Tutorial

8.1.1 Before reading this

Before reading this, it is recommended to first go through the repository model documentation.

8.1.2 API design

In the design of Lily's repository API we choose to use dumb data objects (objects which are pure data structure) in combination with a few service-style interfaces. The use of these data objects makes that there is no difference between Record objects that you instantiate yourself or that you retrieve from the repository.

The repository API consists mostly of interfaces, even for the data objects. As you will see in the examples below, the consequence is that these objects are instantiated via factory methods.

The API classes are defined in a separate project, lily-repository-api, independent from any implementation.

8.1.3 API tutorial code

All the code used in this tutorial can also be found in the class TutorialTest of the project repository-api-tutorial.

8.1.4 API reference

See the Javadoc-based API documentation.

8.1.5 API run-through

8.1.5.1 Project set-up

For programming against the API, you only need a dependency on the project lily-repository-api.

For actually talking to Lily, you need a bunch of implementation classes too. Basically, you need the lily-client project and all its dependencies. If you use Maven to build your project and take a dependency on lily-client, everything you need is automatically pulled in.

Below we show a Maven pom you can use to get started. Note that this assumes you have actually build Lily from source so that the Lily artifacts are installed in your local Maven repository.

<project
    xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

  <modelVersion>4.0.0</modelVersion>
  <groupId>org.mydomain</groupId>
  <artifactId>myproject</artifactId>
  <version>1.0-dev</version>
  <name>My Lily-based project</name>

  <build/>

  <dependencies>
    <dependency>
      <groupId>org.lilyproject</groupId>
      <artifactId>lily-client</artifactId>
      <version>[unresolved variable: artifactVersion]</version>
    </dependency>
  </dependencies>

</project>

8.1.5.2 Connecting to Lily

In all the examples below, we will assume you have already obtained access to a Lily Repository object.

Here we explain how you can get access to that.

The Lily nodes publish their availability and address to ZooKeeper. The class LilyClient uses this information to provide you Repository objects.

In your code, you can get access to the Repository as follows:

import org.lilyproject.client.LilyClient;
import org.lilyproject.repository.api.Repository;

...

LilyClient lilyClient = new LilyClient("localhost:2181", 20000);
Repository repository = lilyClient.getRepository();

The argument to the LilyClient constructor is the ZooKeeper connecting string.

Upon each method you call on the Repository object, it will at random pick one of the available Lily servers to perform the operation against. If a method would fail due to an IO related exception (for example a Lily node went down or a temporary network hick up), it will automatically be retried. When an IO exception occurs, we cannot know if the server already got our request and performed it. For an operation like 'create record', this means that when the operation is retried it could fail because the record was created in the previous, 'failed', request. Or if you let the server assign record IDs, it can mean that two records are created. Therefore, create operations are by default only retried when we are sure the request was not yet initiated. This behavior can be configured by manipulating the RetryConf object obtained via Repository.getRetryConf().

8.1.5.3 Prerequisites

To avoid a bit of boilerplate code in the code listings, we make the following assumptions.

A variable typeManager is available, which is obtained from the Repository as follows:

TypeManager typeManager = repository.getTypeManager();

A variable BNS (book namespace) is available, which is the namespace for the schema types, and can be declared as follows:

String BNS = “book”;

8.1.5.4 Creating a record type

Before we can create any records in the repository, we need to create a schema: a record type and some field types. For the purpose of this tutorial, we will make a Book record type.

// (1)
ValueType stringValueType = typeManager.getValueType("STRING");

// (2)
FieldType title = typeManager.newFieldType(stringValueType, new QName(BNS, "title"), Scope.VERSIONED);

// (3)
title = typeManager.createFieldType(title);

// (4)
RecordType book = typeManager.newRecordType(new QName(BNS, "Book"));
book.addFieldTypeEntry(title.getId(), true);

// (5)
book = typeManager.createRecordType(book);

// (6)
PrintUtil.print(book, repository);

It is useful to explain this piece of code in detail, as the same patterns will be repeated in the remainder of the code samples.

The steps (1) to (3) can actually be done in just one statement, we will do that in the next example.

Output:

Name = {book}Book
ID = d716e794-213c-4ffe-be11-359cb52e017b
Version = 1
Fields:
  Versioned scope:
    Field
      Name = {book}title
      ID = 93bca82e-0b93-496f-9a67-19ded2b3740b
      Mandatory = true
      ValueType = STRING

8.1.5.5 Updating a record type

We will now update the previously created record type with some more fields. We use a variety of value types. The full list of built-in value types can be found in the Javadoc of the TypeManager, method getValueType.

FieldType description = typeManager.createFieldType("BLOB", new QName(BNS, "description"), Scope.VERSIONED);
FieldType authors = typeManager.createFieldType("LIST<STRING>", new QName(BNS, "authors"), Scope.VERSIONED);
FieldType released = typeManager.createFieldType("DATE", new QName(BNS, "released"), Scope.VERSIONED);
FieldType pages = typeManager.createFieldType("LONG", new QName(BNS, "pages"), Scope.VERSIONED);
FieldType sequelTo = typeManager.createFieldType("LINK", new QName(BNS, "sequel_to"), Scope.VERSIONED);
FieldType manager = typeManager.createFieldType("STRING", new QName(BNS, "manager"), Scope.NON_VERSIONED);
FieldType reviewStatus = typeManager.createFieldType("STRING", new QName(BNS, "review_status"), Scope.VERSIONED_MUTABLE);

RecordType book = typeManager.getRecordTypeByName(new QName(BNS, "Book"), null);

// The order in which fields are added does not matter
book.addFieldTypeEntry(description.getId(), false);
book.addFieldTypeEntry(authors.getId(), false);
book.addFieldTypeEntry(released.getId(), false);
book.addFieldTypeEntry(pages.getId(), false);
book.addFieldTypeEntry(sequelTo.getId(), false);
book.addFieldTypeEntry(manager.getId(), false);
book.addFieldTypeEntry(reviewStatus.getId(), false);

// Now we call updateRecordType instead of createRecordType
book = typeManager.updateRecordType(book);

PrintUtil.print(book, repository);

Output:

Name = {book}Book
ID = d716e794-213c-4ffe-be11-359cb52e017b
Version = 2
Fields:
  Non-versioned scope:
    Field
      Name = {book}manager
      ID = bd9e6764-222f-4b82-bb4d-d1bc72a2c0bb
      Mandatory = false
      ValueType = STRING
  Versioned scope:
    Field
      Name = {book}authors
      ID = 9cfb5c07-dec5-469e-bdcd-436c299badd1
      Mandatory = false
      ValueType = LIST<STRING>
    Field
      Name = {book}description
      ID = 5ce3bdc0-319d-4a57-b052-54bedc77b145
      Mandatory = false
      ValueType = BLOB
    Field
      Name = {book}pages
      ID = cfb90755-f78e-43f8-8c5e-e46397b10296
      Mandatory = false
      ValueType = LONG
    Field
      Name = {book}released
      ID = 50f283d5-30b3-45d3-92d4-0245d8068902
      Mandatory = false
      ValueType = DATE
    Field
      Name = {book}sequel_to
      ID = 57431a4a-bab9-41c5-a340-19c5c2bf537c
      Mandatory = false
      ValueType = LINK
    Field
      Name = {book}title
      ID = 93bca82e-0b93-496f-9a67-19ded2b3740b
      Mandatory = true
      ValueType = STRING
  Versioned-mutable scope:
    Field
      Name = {book}review_status
      ID = 295c9568-43bf-406d-b241-e9582a62d5b0
      Mandatory = false
      ValueType = STRING

The version of the Book record type is now 2.

8.1.5.6 Creating a record

Now that we have a record type, let's create a record.

// (1)
Record record = repository.newRecord();

// (2)
record.setRecordType(new QName(BNS, "Book"));

// (3)
record.setField(new QName(BNS, "title"), "Lily, the definitive guide, 3rd edition");

// (4)
record = repository.create(record);

// (5)
PrintUtil.print(record, repository);

This asks for some more explanation:

In the PrintUtil output for records, the namespaces of the fields are listed once at the top, and in the remainder of the output a prefix is used, like n1, n2, ... This is only a feature of PrintUtil, the record itself knows nothing about these prefixes.

Output:

ID = UUID.dc799aca-bb4b-4e02-8fc0-8569d26368b5
Version = 1
Non-versioned scope:
  Record type = {book}Book, version 2
Versioned scope:
  Record type = {book}Book, version 2
  {book}title = Lily, the definitive guide, 3rd edition

8.1.5.7 Creating a record with a user-specified ID

In the previous example, the record ID was assigned by the repository. You can also assign it yourself. If you would assign an ID that already exists within the repository, a RecordExistsException will be thrown.

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");
Record record = repository.newRecord(id);
record.setDefaultNamespace(BNS);
record.setRecordType("Book");
record.setField("title", "Lily, the definitive guide, 3rd edition");
record = repository.create(record);

PrintUtil.print(record, repository);

The self-assigned record IDs will never clash with those generated by the repository, they are in a different namespace.

8.1.5.7.1 Use setDefaultNamespace to avoid QName

In the example above you will notice another difference: the setField() and setRecordType() methods take a simple string as argument instead of a QName object. This is possible because we first called setDefaultNamespace() on the record. Internally, the QName object is still created. The default namespace is just a volatile helper attribute on the record: it is not stored in the repository.

Output:

ID = USER.lily-definitive-guide-3rd-edition
Version = 1
Non-versioned scope:
  Record type = {book}Book, version 2
Versioned scope:
  Record type = {book}Book, version 2
  {book}title = Lily, the definitive guide, 3rd edition

8.1.5.8 Updating a record

Updating a record consists of calling repository.update() with a record object of which the ID has been set to that of an existing record. If the record would not exist, a RecordNotFoundException will be thrown.

We use the repository.newRecord() method, even if what we are doing is updating an existing record. Remember that this method is used to instantiate a record object, not to create a record. When updating a record, you only need to set the fields in the record that you actually want to change. Fields that are not set will not be deleted, deleting fields is done by calling record.delete(fieldName, true) or record.addFieldsToDelete().

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");
Record record = repository.newRecord(id);
record.setDefaultNamespace(BNS);
record.setField("title", "Lily, the definitive guide, third edition");
record.setField("pages", Long.valueOf(912));
record.setField("manager", "Manager M");
record = repository.update(record);

PrintUtil.print(record, repository);

When updating a record, its record type will automatically move to the last version of the record type, unless you specify a specific version. The record type of each scope in which fields were modified will be set to this record type, in addition to the record type of the non-versioned scope which is always updated, since it is considered to be the reference record type.

In the output, you will notice that the version has been increment to 2:

ID = USER.lily-definitive-guide-3rd-edition
Version = 2
Non-versioned scope:
  Record type = {book}Book, version 2
  {book}manager = Manager M
Versioned scope:
  Record type = {book}Book, version 2
  {book}pages = 912
  {book}title = Lily, the definitive guide, third edition

8.1.5.9 Updating a record via read

Besides updating a record by creating a record object via newRecord and setting the updated field values on it, you can also read an existing record and modify that object to supply it to the repository.update() method.

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");
Record record = repository.read(id);
record.setDefaultNamespace(BNS);
record.setField("released", new LocalDate());
record.setField("authors", Arrays.asList("Author A", "Author B"));
record.setField("review_status", "reviewed");
record = repository.update(record);

PrintUtil.print(record, repository);

The authors field is a LIST-type field, its value should be specified as a List object.

Output:

ID = USER.lily-definitive-guide-3rd-edition
Version = 3
Non-versioned scope:
  Record type = {book}Book, version 2
  {book}manager = Manager M
Versioned scope:
  Record type = {book}Book, version 2
  {book}authors = 
    [0] Author A
    [1] Author B
  {book}pages = 912
  {book}released = 2012-01-10
  {book}title = Lily, the definitive guide, third edition
Versioned-mutable scope:
  Record type = {book}Book, version 2
  {book}review_status = reviewed

As you can see, meanwhile this record has 3 versions. Each time one or more versioned fields are updated, a new version is created. If in a certain update operation you only change non-versioned fields, then no new version will be created. If you create a new record with only non-versioned fields, it will not have any versions (TODO: at the time of this writing, this is not true, a dummy version 1 is created).

8.1.5.10 Updating versioned-mutable fields

Normal versioned fields are immutable after creation. After all, the purpose of versions is to see the history of previous edits, and hence it should not be possible to rewrite that history. Versioned-mutable fields are versioned fields which can be updated for existing versions. This is useful for meta-data about the version.

[TODO: example of this.]

8.1.5.11 Updating a record conditionally

It is possible to let an update of a record only go through if the current record state satisfies some conditions. This is useful for optimistic concurrency control.

The example below shows how to update the manager field to "Manager P", but only if the current value is "Manager Z" (which it is not).

List<MutationCondition> conditions = new ArrayList<MutationCondition>();
conditions.add(new MutationCondition(new QName(BNS, "manager"), "Manager Z"));

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");
Record record = repository.read(id);
record.setField(new QName(BNS, "manager"), "Manager P");
record = repository.update(record, conditions);

System.out.println(record.getResponseStatus());

When the conditions are not satisfied, as is the case here, the update() method will not throw an exception, but rather the responseStatus field of the record object will be set to ResponseStatus.CONFLICT.

If you supply multiple MutationCondition's, they all need to be satisfied for the update to go through. The MutationCondition's allow for other operators than simple equals checks, for checking if a field is null or not-null, for checking on the record version, etc.

8.1.5.12 Reading a record

Let's have a look at the different options for reading a record.

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");

// (1)
Record record = repository.read(id);
String title = (String)record.getField(new QName(BNS, "title"));
System.out.println(title);

// (2)
record = repository.read(id, 1L);
System.out.println(record.getField(new QName(BNS, "title")));

// (3)
record = repository.read(id, 1L, Arrays.asList(new QName(BNS, "title")));
System.out.println(record.getField(new QName(BNS, "title")));

Output:

Lily, the definitive guide, third edition
Lily, the definitive guide, 3rd edition
Lily, the definitive guide, 3rd edition

8.1.5.13 Working with blob fields

Blob fields ("binary large object") are fields for storing arbitrary binary data. Since this could be a large amount of data, the content of blobs is not simply transported as part of the repository.read() or repository.update() calls. Instead, blobs are read and written as streams.

On the level of the Record object, the value of a blob field is a Blob object. This object holds some metadata such as a mime-type (this identifies the type of content such as "text/html" or "image/png"), the size of the blob, and an optional name which is often used as a suggestion for filename in case a user would download the blob to the desktop.

The actual data of a blob can be stored in different ways, depending upon configuration:

As repository API user, you are not really aware of these different stores.

Below is some example code:

//
// Write a blob
//

String description = "<html><body>This book gives thorough insight into Lily, ...</body></html>";
byte[] descriptionData = description.getBytes("UTF-8");

// (1)
Blob blob = new Blob("text/html", (long)descriptionData.length, "description.xml");
OutputStream os = repository.getOutputStream(blob);
try {
    os.write(descriptionData);
} finally {
    os.close();
}

// (2)
RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");
Record record = repository.newRecord(id);
record.setField(new QName(BNS, "description"), blob);
record = repository.update(record);

//
// Read a blob
//
InputStream is = null;
try {
    is = repository.getInputStream(record, new QName(BNS, "description"));
    System.out.println("Data read from blob is:");
    Reader reader = new InputStreamReader(is, "UTF-8");
    char[] buffer = new char[20];
    int read;
    while ((read = reader.read(buffer)) != -1) {
        System.out.print(new String(buffer, 0, read));
    }
    System.out.println();
} finally {
    if (is != null) is.close();
}

(1) To store a blob in the repository, you first create a Blob object. You need to specify the size of the blob, the repository will use this to determine where to store the blob. Then you request an output stream to upload the blob via repository.getOutputStream(blob), and write all the data to it. Finally, the output stream is closed, at that moment the repository will update the Blob object with a reference to the storage location that sits behind the output stream.

(2) Once the blobs are uploaded, you can create the record object as usual, setting the blob field (here description) with the blob object, and then call repository.create() to create the record on the repository.

If the operation would have been abandoned between the previous two steps, there would be an orphan blob in the repository. You do not need to worry about this, it will be automatically expire and be removed (by default, after 1 hour).

Reading the blob is done by using the repository.getInputStream() method, specifying the record and field from which to read the blob. Instead of passing the record object to the getInputStream method, you could as well specify the record id, so it is not required to first retrieve the record. But if you have already retrieved the record anyway, then passing the record object will allow for optimized retrieval of blobs which are stored inline in the record (which is the case for small blobs).

Above we wrote a custom while loop to retrieve the data from the InputStream, but we recommend to use the IOUtils class from the Apache commons-io project instead.

8.1.5.14 Creating variants

Creating a variant record is the same as creating a record, you just have to use an ID that contains variant properties.

In the example below we use variants to create records about the same book in two languages (en - English, nl - Dutch). The two records will share the same master record ID.

// (1)
IdGenerator idGenerator = repository.getIdGenerator();
RecordId masterId = idGenerator.newRecordId();

// (2)
Map<String, String> variantProps = new HashMap<String, String>();
variantProps.put("language", "en");

// (3)
RecordId enId = idGenerator.newRecordId(masterId, variantProps);

// (4)
Record enRecord = repository.newRecord(enId);
enRecord.setRecordType(new QName(BNS, "Book"));
enRecord.setField(new QName(BNS, "title"), "Car maintenance");
enRecord = repository.create(enRecord);

// (5)
RecordId nlId = idGenerator.newRecordId(enRecord.getId().getMaster(), Collections.singletonMap("language", "nl"));
Record nlRecord = repository.newRecord(nlId);
nlRecord.setRecordType(new QName(BNS, "Book"));
nlRecord.setField(new QName(BNS, "title"), "Wagen onderhoud");
nlRecord = repository.create(nlRecord);

// (6)
Set<RecordId> variants = repository.getVariants(masterId);
for (RecordId variant : variants) {
    System.out.println(variant);
}

Some more explanation:

Output:

UUID:d947dda0-cadb-4e84-b1bc-38567d05fb56VARIANT:language,nl
UUID:d947dda0-cadb-4e84-b1bc-38567d05fb56VARIANT:language,en

While not shown in this example, it is also possible to create the record that corresponds to the plain master record ID, which could be used to store information shared by all the variants. In this example that could be for information that does not need to be translated.

Other than the shared identity between variant records, the repository itself does not have special functionality around variants. It are rather the indexer and the front-end which will add this, for example by aggregating information from different variants.

8.1.5.15 Link fields

One of the field value types supported by Lily is the link type. We usually simply speak of link fields (just as we use 'string fields', 'long fields', etc. for the other value types). A link field allows to store a link to another record in a field.

The following example illustrates this.

// (1)
Record record1 = repository.newRecord();
record1.setRecordType(new QName(BNS, "Book"));
record1.setField(new QName(BNS, "title"), "Fishing 1");
record1 = repository.create(record1);

// (2)
Record record2 = repository.newRecord();
record2.setRecordType(new QName(BNS, "Book"));
record2.setField(new QName(BNS, "title"), "Fishing 2");
record2.setField(new QName(BNS, "sequel_to"), new Link(record1.getId()));
record2 = repository.create(record2);

PrintUtil.print(record2, repository);

// (3)
Link sequelToLink = (Link)record2.getField(new QName(BNS, "sequel_to"));
RecordId sequelTo = sequelToLink.resolve(record2.getId(), repository.getIdGenerator());
Record linkedRecord = repository.read(sequelTo);
System.out.println(linkedRecord.getField(new QName(BNS, "title")));

In this example, we created a record about a book "Fishing 2" which is a sequel to the book "Fishing 1". We link them via the sequel_to field. The value that should be assigned to a link field is a Link object. In its simplest form, a link is basically a RecordId. The RecordId of a record can be obtained via the Record.getId() method.

Now suppose we had read record2 outside of this context, so without knowing what it was a sequel to. In that case, we could find out what book preceded Fishing 2 by reading its sequel_to field. This gives a Link object, which needs to be resolved in the context of the record it occurs in, see the resolve call. The resolve method returns a RecordId which can be used to fetch the record from the repository, as shown in step (3).

The output is obviously:

Fishing 1

As for variants, the repository itself does not do much fancy things with link fields, but for example the indexer can denormalize information from linked documents to search on it.

8.1.5.15.1 Link versus RecordId

In the example above, the Link could as well have been the RecordId, and the resolve step was not really necessary. However, it is also possible to have relative links which need to be resolved against the record they occur in. For example, a link can inherit the variant properties of the record it occurs in. For more information on this, see the javadoc of the Link class.

8.1.5.16 Complex Fields

Sometimes you might want to store a more complex value in a field. Thus not a simple value like a string, but a complex value which is again composed of multiple fields. In Lily this is possible by creating fields of type RECORD. These are fields in which you can put Record objects. These are not real records with their own identity, it is just a re-use of the top-level Record data structure to use it as value within the field of another record. Since any record object can have fields which by themselves can again contain records (or lists of records), this allows for modeling arbitrarily complex structures.

Before you use complex fields, you should always ask yourself the question if you want to use either complex fields or rather link fields (which contain pointers to other records). Both enable you to store the same kinds of nested/complex structures. In the case of complex fields, the nested structures (nested records) are all stored within one record, so don't have their own identity and are hence not separately retrievable or indexable. Link fields pointing to other records give each part of the nested structure its own identity, but at the cost of having to create/read multiple records, and loosing the atomicity of the create operation.

Since complex fields are modeled in Lily by creating field types with as value type RECORD, they are also called record-type fields.

In the following example, we will create articles which have authors. Each author has a name and email attribute. For the sake of this example, we are going to store the authors within the article, in a complex field. So there will be no re-use of the same author records across articles.

Here's the code:

final String ANS = "article"

// (1)
FieldType name = typeManager.createFieldType("STRING", new QName(ANS, "name"), Scope.NON_VERSIONED);
FieldType email = typeManager.createFieldType("STRING", new QName(ANS, "email"), Scope.NON_VERSIONED);

RecordType authorType = typeManager.newRecordType(new QName(ANS, "author"));
authorType.addFieldTypeEntry(name.getId(), true);
authorType.addFieldTypeEntry(email.getId(), true);
authorType = typeManager.createRecordType(authorType);

// (2)
FieldType title = typeManager.createFieldType("STRING", new QName(ANS, "title"), Scope.NON_VERSIONED);
FieldType authors = typeManager.createFieldType("LIST<RECORD<{article}author>>",
        new QName(ANS, "authors"), Scope.NON_VERSIONED);
FieldType body = typeManager.createFieldType("STRING", new QName(ANS, "body"), Scope.NON_VERSIONED);

RecordType articleType = typeManager.newRecordType(new QName(ANS, "article"));
articleType.addFieldTypeEntry(title.getId(), true);
articleType.addFieldTypeEntry(authors.getId(), true);
articleType.addFieldTypeEntry(body.getId(), true);
articleType = typeManager.createRecordType(articleType);

// (3)
Record author1 = repository.newRecord();
author1.setRecordType(authorType.getName());
author1.setField(name.getName(), "Author X");
author1.setField(email.getName(), "author_x@authors.com");

Record author2 = repository.newRecord();
author2.setRecordType(new QName(ANS, "author"));
author2.setField(name.getName(), "Author Y");
author2.setField(email.getName(), "author_y@authors.com");

// (4)
Record article = repository.newRecord();
article.setRecordType(articleType.getName());
article.setField(new QName(ANS, "title"), "Title of the article");
article.setField(new QName(ANS, "authors"), Lists.newArrayList(author1, author2));
article.setField(new QName(ANS, "body"), "Body text of the article");
article = repository.create(article);

PrintUtil.print(article, repository);

Explanation:

Finally, we dump the record, which gives the following output:

ID = UUID.141a11c3-66b8-4c2a-a0d7-c01aa38c33fa
Version = null
Non-versioned scope:
  Record type = {article}article, version 1
  {article}authors = 
    [0] 
      Record of type {article}author, version null
      {article}name = Author X
      {article}email = author_x@authors.com
    [1] 
      Record of type {article}author, version null
      {article}name = Author Y
      {article}email = author_y@authors.com
  {article}body = Body text of the article
  {article}title = Title of the article

We see the authors field is a list containing two entries, each of which is a record of type author.

Suppose that authors would have been a link field, and that each author was stored a separate record in its own right. Then the dump would have looked like this:

ID = UUID.141a11c3-66b8-4c2a-a0d7-c01aa38c33fa
Version = null
Non-versioned scope:
  Record type = {article}article, version 1
  {article}authors = 
    [0] UUID.fa1cd18b-ab5b-43f5-95ce-1c1bcced603a
    [1] UUID.95f2cf92-814a-46a6-a815-7dc26e1b3b52
  {article}body = Body text of the article
  {article}title = Title of the article

8.2 Creating Records And Schema Using The Builder API

8.2.1 Introduction

Besides the core repository API, Lily offers an alternative API using builder objects. This API makes use of method call chaining to make for a more fluent way of writing code, a small internal DSL if you like. It avoids having to declare intermediate variables to keep references to things. It also allows more combinations for setting the parameters of the objects to be created, since each parameter is typically set with a different method.

Here we provide a tutorial on getting started with the builder API. You are free to choose between either Lily's core API or the builder API, just use what fits best for your situation and taste. One disadvantage is that, because of method chaining, very long statements are created, which makes it sometimes harder to track down what part of the statement caused an error.

If you have a custom object model that you want to map onto Lily, you might want to check out FrogPond, a pojo-Lily mapper.

If you already have a schema and just want to create records, you can directly skip to the section Creating Records.

8.2.2 Creating A Schema

Before we start

If your schema is static, than rather than writing code statements to create the schema, you are better of describing it in the JSON format and importing that. You can also do the import programmatically. Having the schema in JSON rather than code has its advantages: it can be easily transformed, it is isolated, it can be easily shared with non-Java programmers, etc.

Classic API

Let's start with a very simple schema, and first look how it is created using the classic API:

String NS = "my_namespace";

FieldType field1 = typeManager.createFieldType("STRING", new QName(NS, "field1"), VERSIONED);
FieldType field2 = typeManager.createFieldType("STRING", new QName(NS, "field2"), VERSIONED);

RecordType recordType = typeManager.newRecordType(new QName(NS, "recordtype1"));
recordType.addFieldTypeEntry(field1.getId(), false);
recordType.addFieldTypeEntry(field1.getId(), true);
recordType = typeManager.createRecordType(recordType);

System.out.println(recordType.getId());
Create the record type using a builder

Now assuming the field types are already created, let's change the creation of the record type to make use of the builder API:

(1) RecordType recordType = typeManager
(2)        .recordTypeBuilder()
(3)        .defaultNamespace(NS)
(4)        .name("recordtype1")
(5)        .field(field1.getId(), false)
(6)        .field(field2.getId(), true)
(7)        .create();

Let's discuss this code in some detail:

(1) and (2): we create a builder by calling TypeManager.recordTypeBuilder().

(3) we set a default namespace. This namespace will be used for all further names, removing the need to supply QName objects, though you can still use QName as well.

(4) we set the name for the record type, simply as a string. The default namespace set on line (3) will be used to construct the QName.

(5) and (6) we add field type entries to the record type. We refer to the previously created field type objects to fetch their ID. The boolean argument is the mandatory flag.

(7) we create the record type in the repository. This method returns a RecordType object, while all the previous methods returned the builder itself. You can also use other operations: update(), createOrUpdate(), or build(). The build() method will just create the RecordType object without modifying anything in the repository.

Add the field entries using a builder

Some more flexibility in adding field type entries is available through a sub-builder, as illustrated in the next example.

(1) RecordType recordType = typeManager
(2)        .recordTypeBuilder()
(3)        .defaultNamespace(NS)
(4)        .name("recordtype1")
(5)        .fieldEntry().name("field1").add()
(6)        .fieldEntry().name("field2").mandatory().add()
(7)        .create();

(5) and (6) The method fieldEntry() returns a different builder object, on which you set the properties for the field type entry. Calling add() on it will add the field type entry to the record type and return the record type builder.

The field entry builder allows to set the identity of the field in different ways: using its name (either relying on the default namespace, or by supplying a QName), using its ID, or by supplying the field type object. The mandatory flag is set by calling a different method (by default, mandatory is false).

Create the fields while creating the record

While previously we relied on the field types already being created, you can also create them inline, as shown in the following example.

( 1) RecordType recordType = typeManager
( 2)        .recordTypeBuilder()
( 3)        .defaultNamespace(NS)
( 4)        .name("recordtype1")
( 5)
( 6)        .fieldEntry()
( 7)            .defineField()
( 8)                .name("field1").type("STRING").scope(VERSIONED)
( 9)            .create()
(10)        .add()
(11)
(12)        .fieldEntry()
(13)            .defineField()
(14)                .name("field2").type("STRING").scope(VERSIONED)
(15)            .create()
(16)        .mandatory()
(17)        .add()
(18)
(19)        .create();

(7) By calling defineField(), a different builder object is returned that allows to create a new field type. On (8) we set the options for the field, on (9) we call create() which creates the field type in the repository and returns the field entry builder.

The second field is very similar, except that we also set the mandatory option (16).

This code seems longer than how we created field types before, but that's in part because here we spread it over multiple lines on purpose. Since each option for the field type is set using a different method, it allows more variation of how parameters are specified. Specifying the type and scope is optional: the default type is STRING, and the default scope is NON_VERSIONED, though that can be changed by calling defaultScope() on the record type builder.

Make the schema code re-executable through createOrUpdate()

If we would run any of the above examples twice against the same repository, it would fail on the second run because the types will already exist.

What you really want to do is only create the schema if it does not exist yet, or update it in case it would be different, or complain when it would be incompatible (cfr. the immutable value type and scope properties of field type). This behavior is obtained by calling createOrUpdate() instead of create(), both for the field types as for the record types, as in the example below.

RecordType recordType = typeManager
        .recordTypeBuilder()
        .defaultNamespace(NS)
        .name("recordtype1")

        .fieldEntry()
            .defineField()
                .name("field1").type("STRING").scope(VERSIONED)
            .createOrUpdate()
        .add()

        .fieldEntry()
            .defineField()
                .name("field2").type("STRING").scope(VERSIONED)
            .createOrUpdate()
        .mandatory()
        .add()

        .createOrUpdate();
Switching the default namespace

At any time, you can switch the default namespace.

( 1) RecordType recordType = typeManager
( 2)        .recordTypeBuilder()
( 3)        .defaultNamespace("namespace1")
( 4)        .name("recordtype1")
( 5)        .fieldEntry().name("field1").add()
( 6)        .fieldEntry().name("field2").add()
( 7)        .defaultNamespace("namespace2")
( 8)        .fieldEntry().name("field1").add()
( 9)        .fieldEntry().name("field2").add()
(10)        .createOrUpdate();

On line (3) we the default namespace to 'namespace1'. This namespace will be used for the record type name and the first two fields that are added. Then on line (7) we change the default namespace to 'namespace2'. Then we add again fields called field1 and field2, but now these will be in namespace2, so these are different fields from the ones added on line (5) and (6).

Field type builder

Besides the record type builder, there is also a field type builder. Since creating a field is already a one-liner with the classic API, its use is somewhat limited, though it allows you to work in the same style as for creating record types.

8.2.3 Creating Records

Classic API

Let's first look at how a record is created using the classic Lily API.

Record record = repository.newRecord();
record.setRecordType(new QName(NS, "recordtype1"));
record.setField(new QName(NS, "field1"), "value 1");
record.setField(new QName(NS, "field2"), "value 2");
record = repository.create(record);

Instead of instantiating all those QName's, you can also set a default namespace, as shown in the following example. The default namespace is just an ephemeral attribute of Record: it is not stored in the repository, but is just and aid when setting fields or the record type.

Record record = repository.newRecord();
record.setDefaultNamespace(NS);
record.setRecordType("recordtype1");
record.setField("field1", "value 1");
record.setField("field2", "value 2");
record = repository.create(record);
Create a record using the builder API

Now let's look at how the same record is created using the builder API.

(1) Record record = repository
(2)        .recordBuilder()
(3)        .defaultNamespace(NS)
(4)        .recordType("recordtype1")
(5)        .field("field1", "value 1")
(6)        .field("field2", "value 2")
(7)        .create();

(2) we obtain the builder by calling repository.recordBuilder()

(3) we set the default namespace. This is optional, you can also use QName's.

(4-6) the record type and fields are set

(7) we call create(). This creates the record in the repository. This method returns a Record object, while all the previous methods returned the builder itself. Other operations are also available: update(), createOrUpdate() and build(). Calling build() will just instantiate the record object without modifying anything in the repository.

Switching the default namespace

If you need to create fields in several namespaces, then it is useful to know you can switch the default namespace at any time, as illustrated in the following example.

( 1) Record record = repository
( 2)        .recordBuilder()
( 3)        .defaultNamespace("namespace1")
( 4)        .recordType("recordtype1")
( 5)        .field("field1", "value 1")
( 6)        .field("field2", "value 2")
( 7)        .defaultNamespace("namespace2")
( 8)        .field("field1", "value 1")
( 9)        .field("field2", "value 2")
(10)        .create();

In this example we set two times fields named field1 and field2, but they are in different namespaces so it are different fields.

Using createOrUpdate

Lily offers a "create-or-update" operation which is useful if you don't care whether the record already exits or not, but more importantly it has the advantage that this method allows automatic retrying in case of IO exceptions, because it is idempotent. Because of this, it requires that the ID is assigned by the client.

(1) Record record = repository
(2)        .recordBuilder()
(3)        .assignNewUuid()
(4)        .defaultNamespace(NS)
(5)        .recordType("record_type")
(6)        .field("field1", "value 1")
(7)        .field("field2", "value 2")
(8)        .createOrUpdate();

On line (3) we assign the ID. You can also use a user-defined id using the method id(String).

On line (8) we call the createOrUpdate().

Creating a nested record

In Lily, the type of a field can be RECORD, which means that within the field value of a record, you can store another record. To create such a nested record, you could again use repository.recordBuilder() to construct it, but for this specific case there is a shortcut.

( 1) Record record = repository
( 2)        .recordBuilder()
( 3)        .defaultNamespace(NS)
( 4)        .recordType("some_record_type")
( 5)        .field("field1", "value 1")
( 6)        .recordField("record_field")
( 7)            .recordType("embbed_record_type")
( 8)            .field("field_r", "value r")
( 9)            .field("field_s", "value s")
(10)            .set()
(11)        .create();

On line (5) we set a normal field as usual.

On line (6) we use the method recordField() to created a nested record. As argument we give the name of the field. The important difference now is that this method does not return the same record builder, but a new one intended to create the nested record. The nested record builder is initialized with the default namespace from the current builder, so you do not need to repeat that.

When you are done creating the nested record, you call set(), see line (10). Calling set() returns the pointer to the original record builder.

Creating a LIST<RECORD> field

Similar to the previous case, there is also a convenient way for filling up LIST<RECORD> fields. The following example illustrates this.

( 1) Record record = repository
( 2)        .recordBuilder()
( 3)        .defaultNamespace(NS)
( 4)        .recordType("some_record_type")
( 5)        .field("field1", "value 1")
( 6)        .recordListField("list_of_records")
( 7)            .recordType("embbed_record_type")
( 8)            .field("field_r", "value r1")
( 9)            .field("field_s", "value s1")
(10)            .add()
(11)            .field("field_r", "value r2")
(12)            .field("field_s", "value s2")
(13)            .add()
(14)            .field("field_r", "value r3")
(15)            .field("field_s", "value s3")
(16)            .endList()
(17)        .create();

You start by calling recordListField(), see line (6).

Then each time you created a record, you call add(), see lines (10) and (13). After each add() call, a new record builder is returned to create the next record. However, this builder is already initialized with the default namespace and the record type of the previous one, so you do not need to repeat that.

After the last item, you call endList() instead of add(), see line (16).

Creating a record with link fields

For creating linked records, there is no special support yet as is the case for nested records, so you need to call repository.recordBuilder() again to create another record, as shown in the following example.

Record record = repository
        .recordBuilder()
        .defaultNamespace(NS)
        .recordType("some_record_type")
        .field("link_field", new Link(repository
                .recordBuilder()
                .defaultNamespace(NS)
                .recordType("some_other_record_type")
                .field("field1", "value1")
                .create().getId()))
        .create();
Creating records with common fields -- reusing the builder

There is nothing that prohibits you from reusing the same builder to create several records. This could be useful if you want to create several records that share some common fields.

In the following example, we create 5 records which each have the same value for field1 and field2, but a different value for field3.

RecordBuilder builder = repository
        .recordBuilder()
        .defaultNamespace("namespace")
        .recordType("record_type")
        .field("field1", "value1")
        .field("field2", "value2");

for (int i = 0; i < 5; i++) {
    builder.field("field3", new Long(i)).create();
}

The RecordBuilder has also a reset() method to clear its state, which is equivalent to creating a new RecordBuilder.

8.3 Scanning Records And Record Locality

8.3.1 Records are stored in order of record ID

In Lily (as in HBase), it is possible to influence that records are stored next to each other. This is achieved by the fact that records are stored sorted by their record ID. A typical way in which this is used is grouping records sharing the same common prefix in their record ID.

For example, consider the following record IDs:

USER.bar-1
USER.bar-2
USER.foo-1
USER.foo-2

Since the records are stored in order of the record ID's, it is not possible for e.g. one of the 'bar' records to be stored in between one of the 'foo' records.

When you are using the default UUID record ID's, the order will be random. Still, in that case the use of variant record IDs allows to influence record locality. Variant record IDs (or simply variants) are record ID's that share a common master record ID and are extended with a set of free key-value pairs.

For example, the following 3 records share the same master record ID and are extended with an 'item' property:

UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0.item=1
UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0.item=2
UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0.item=3

The properties are separated from the master record ID by a dot, and if there would be multiple key-value pairs, they are separated by a comma and always sorted by their key (this is the string syntax, on the storage level a binary encoding is used).

Variants can be used for both USER and UUID record ID's. As we have seen, when using USER record IDs, you can also bring structure and grouping into record IDs yourself. Using variants has the advantage that Lily has knowledge of this internal record ID structure, which it can exploit in the indexer for dereferencing between variants.

8.3.2 Scanning over records

Scanning is sequentially running over the records stored in the repository. Since records are stored ordered by record ID, this means a scan will run over the records in order of their record ID.

The alternative to scanning would be an ordinary (multi-)read operation. However, a scan is more efficient for retrieval multiple records than read or multi-read operations. Also, with a scan you don't have to know the record IDs up front.

It is not because scans run sequentially over records that they are only useful in batch scenario's: the record ID can be exploited as a primary index to jump straight to the relevant subset of records.

In the sections below we will give Java code examples. The REST interface supports scanners as well, see its tutorial and reference documentation for more details.

8.3.2.1 Full table scan

The following example shows how to scan over all records in the repository. This is something you will only do in batch settings, often through MapReduce, since your repository can contain a massive amount of records.

RecordScan scan = new RecordScan();
RecordScanner scanner = repository.getScanner(scan);
for (Record record : scanner) {
    PrintUtil.print(record, repository);
}
scanner.close();

Note that scanners should be closed when you're done with them in order to release resources.

8.3.2.2 Start and stop record ID

A scan can run over all the records in the repository (a “full table scan”) or a subset. To run over a subset, you can specify a start and stop record ID. Lily (relying on HBase) is able to efficiently jump to the record specified by the start record ID. The start record ID does not have to really exist: if it doesn't exist, the scan will position itself at the first record with a larger record ID. The scan then runs sequentially over the records, until it reaches the stop record ID (exclusive) or until the very last record, whichever condition is reached first. Both start and stop record ID are optional.

// Scan over all records whose ID starts with K up to right before
// those who start with M
RecordScan scan = new RecordScan();
scan.setStartRecordId(idGen.newRecordId("USER.K"));
scan.setStopRecordId(idGen.newRecordId("USER.M"));
RecordScanner scanner = repository.getScanner(scan);
for (Record record : scanner) {
    PrintUtil.print(record, repository);
}

8.3.2.3 Filters

A scan can skip certain records server-side, never returning them to the client, based on some conditions. This is called filtering. For example, you can filter records based on record type or field value. A filter is not a search however: the repository will still run over each record and evaluate the filter for each record. Additionally, a filter is able to direct the repository to stop the scan.

8.3.2.3.1 Example: record type filter

The following example will only return records of type Book.

RecordScan scan = new RecordScan();
scan.setRecordFilter(new RecordTypeFilter(new QName(NS, "Book")));
RecordScanner scanner = repository.getScanner(scan);
for (Record record : scanner) {
    PrintUtil.print(record, repository);
}
8.3.2.3.2 Example: record ID prefix filter

The RecordIdPrefixFilter passes through all records whose record ID have a given prefix. Recall the example earlier in the section about record locality of records starting with a prefix such as "foo-" or "bar-". If you would like all records starting with "foo-", you would do it like this:

RecordScan scan = new RecordScan();
scan.setStartRecordId(idGenerator.newRecordId("foo-"));
scan.setRecordFilter(new RecordIdPrefixFilter(idGenerator.newRecordId("foo-")));
RecordScanner scanner = repository.getScanner(scan);
...

We set the start record ID to jump efficiently to the first relevant record. We don't have a stop record ID (what would we set it to?), but rather the RecordIdPrefixFilter will abort the scan once it encounters a record ID with a larger prefix.

8.3.2.4 Returning a subset of fields

By default a scan will return full record objects, that is, records with all fields loaded. If you don't need all fields, you can gain performance by specifying the fields you are interested in. This is done via setReturnFields. It is possible to read no fields at all using ReturnFields.NONE, in which case only record ID and and record type will be loaded.

RecordScan scan = new RecordScan();
scan.setReturnFields(new ReturnFields(qname1, qname2));
RecordScanner scanner = repository.getScanner(scan);
for (Record record : scanner) {
    PrintUtil.print(record, repository);
}

8.3.2.5 Scanner Caching

By default, each time the next record is requested from a scanner, a call to the server will be made. It is more efficient to request a bunch of records from the server at once. This can be done using the caching setting. The following example will instruct to retrieve up to 100 records at once:

scan.setCaching(100);

8.3.2.6 Scanners directly read from HBase region servers

This is an implementation detail, but interesting nonetheless. When using the Lily Java API, the LilyClient executes scans directly on the HBase region server, without going through a Lily server node. This avoids an extra hop and avoids pulling all data through one Lily node.

8.3.2.7 Scanners: summary

Here are some important things to remember about scanners:

8.3.2.8 Using the CLI tool lily-scan-records

You can execute a scan without any programming using the lily-scan-records tool. This tool can work in two modes: count or print. In count mode, it will only count the records, in print mode it will rather dump them to standard out. The lily-scan-records tool also allows to configure all options such as start record ID and filters. Run 'lily-scan-records -h' for more information.

8.3.2.9 Variants and scanners

The Repository method getVariants allows to retrieve all the variants for some master record ID. Internally, this is obviously based on scanners with a technique similar to the record ID prefix filter. You could use a custom scan operation as well, which will offer more flexibility.

8.3.3 Record ID as your primary index

As you have learned from all the above, Lily (by means of HBase) offers much more than a "distributed hash map" kind of storage: by storing the records in record ID order and offering a scan operation, the record ID can be exploited as powerful index for accessing your data.

8.3.4 Scanners And MapReduce

Scanners can be used as input for MapReduce jobs. See MapReduce Integration.

8.4 Setup New Maven Project From Archetype

You can quickly set up the structure for a new Lily-based project by executing the following archetype. This command will allow to change some settings, such as the artifactId. It will then create a subdirectory named after this artifactId and put all files below that.

mvn archetype:generate \
  -DarchetypeGroupId=org.lilyproject \
  -DarchetypeArtifactId=lily-archetype-basic \
  -DarchetypeVersion=[unresolved variable: artifactVersion] \
  -DarchetypeRepository=http://lilyproject.org/maven/maven2/deploy/

If you are using a Lily version whose artifacts are not deployed in Lily's Maven repository, you can change the pointer to the repository as follows: -DarchetypeRepository=file:///path_to_lily_home/lib

Before making any changes to it, you might want to verify that it compiles by changing to the created directory and executing:

mvn install

8.5 Importing A Schema From JSON Programmatically

If you want to set up a schema for you application, a good approach is to use the import tool's JSON format to describe the schema.

Besides running the import tool manually, you might want to create the schema from your application or testcase.

Here are the steps to do that.

Add a dependency on lily-import to the pom.xml:

<dependency>
  <groupId>org.lilyproject</groupId>
  <artifactId>lily-import</artifactId>
  <version>${version.lily}</version>
</dependency>

Add the schema.json file itself to the resources of your application, thus in src/main/resources/{package-name}

Then use the following code to import the schema:

import org.lilyproject.tools.import_.cli.JsonImport;
...
System.out.println("Importing schema");
InputStream is = YouClass.class.getResourceAsStream("schema.json");
JsonImport.load(repository, is, false);
is.close();
System.out.println("Schema successfully imported");

8.6 Writing Test Cases Against Lily

When developing a project on top of Lily, you will want to write tests that perform stuff against Lily. For this purpose, Lily offers the ability to launch an embedded Lily stack within your test case, to make it independent of any external setup. The data of this embedded Lily stack is stored in temporary directories.

Launching Lily embedded is however rather slow, therefore we also offer an alternative: an easy way to launch a standalone Lily stack, and let the testcases talk to that. At the start of each test case, the state of that Lily will be cleared, which can take a few seconds, but this should be much faster than the launching everything embedded. Your test cases will be written in a way that is agnostic of which Lily it talks to.

We will refer to these two cases as embed mode and connect mode.

We don't support running the test cases against an arbitrary, custom cluster setup. This is because the lily-test-launcher offers some specific features such as the ability to reset the state and the ability to change the Solr schema.

8.6.1 First Steps

In these instructions, we will assume Maven is used. There is nothing Maven-specific to the approach, you should be able to translate it to other environments as well.

8.6.1.1 Maven Settings

We will start by adding some stuff to the pom of your project.

When you generate a project from the project archetype, these settings will already be set up for you.

8.6.1.1.1 Add Dependencies

Make sure you have these dependencies:

<project>
  <dependencies>

    <dependency>
      <groupId>org.lilyproject</groupId>
      <artifactId>lily-server-test-fw</artifactId>
      <version>[unresolved variable: artifactVersion]</version>
      <scope>test</scope>
    </dependency>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.8.1</version>
      <scope>test</scope>
    </dependency>

  </dependencies>
</project>
8.6.1.1.2 Configure Surefire Plugin

For reasons that will be explained later, we need to enable forkMode=always for the surefire plugin. Also, we pass on a system property "lily.lilyproxy.mode", that is set via the connect profile defined in the next section and the properties "lily.config.customdir" and "lily.plugin.dir" which are discussed further on.

<project>
  <build>
    <plugins>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <version>2.5</version>
        <configuration>
          <forkMode>always</forkMode>
          <systemPropertyVariables>
            <lily.lilyproxy.mode>${lily.lilyproxy.mode}</lily.lilyproxy.mode>
            <lily.config.customdir>${lily.config.customdir}</lily.config.customdir>
            <lily.plugin.dir>${lily.plugin.dir}</lily.plugin.dir>
          </systemPropertyVariables>
        </configuration>
      </plugin>

    </plugins>
  </build>
</project>
8.6.1.1.3 Configure "connect" Profile

The connect profile will be used to easily switch between embedded launching of Lily, or connecting to an external Lily.

<project>
  <profiles>

    <profile>
      <id>connect</id>
      <properties>
        <lily.lilyproxy.mode>connect</lily.lilyproxy.mode>
      </properties>
    </profile>

  </profiles>
</project>
8.6.1.1.4 Add lily-kauri-plugin, resolve-project-dependencies goal

This plugin will make sure that all the dependencies required to launch Lily are available in your local Maven repository.

The reason why this is needed is as follows. Lily runs on top of the Kauri runtime. Kauri launches all modules listed in the conf/kauri/wiring.xml. It finds these modules from a Maven-style repository. When launched in test cases, this is your local Maven repository. Kauri is not able to download dependencies itself, it assumes access to a file-system based repository that contains them. Not all modules listed in the wiring.xml are Maven dependencies, so Maven will not download them. In addition, Kauri module jars contain a classpath definition in the file KAURI-INF/classloader.xml, listing the classpath needs of that module, again searched for in Maven-style repositories. Even if we would list all modules in the pom too, the version resolving of the jars might be different causing for jars not to be found. The following plugin will make sure that all listed dependencies are downloaded and available in your local Maven repository.

<project>
  <build>
    <plugins>

      <!-- This plugin makes sure that all Lily/Kauri runtime dependencies
          are available in the local repository (required for lily-sever-test-fw) -->
      <plugin>
        <groupId>org.lilyproject</groupId>
        <artifactId>lily-kauri-plugin</artifactId>
        <version>[unresolved variable: artifactVersion]</version>
        <configuration>
          <wiringXmlResource>org/lilyproject/lilyservertestfw/conf/kauri/wiring.xml</wiringXmlResource>
        </configuration>
        <executions>
          <execution>
            <phase>compile</phase>
            <goals>
              <goal>resolve-project-dependencies</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

    </plugins>
  </build>
</project>

We need to tell Maven where it can download this plugin using:

<project>

  <pluginRepositories>
    <pluginRepository>
      <id>lilyproject-plugins</id>
      <name>Lily Maven repository</name>
      <url>http://lilyproject.org/maven/maven2/deploy/</url>
    </pluginRepository>
  </pluginRepositories>

</project>
8.6.1.1.5 Configuring Repositories

If you have an existing project, you will usually already have this. If not, make sure to add the Lily repository:

<project>
  <repositories>

    <repository>
      <id>lilyproject</id>
      <name>Lily Maven repository</name>
      <url>http://lilyproject.org/maven/maven2/deploy/</url>
    </repository>

  </repositories>
</project>

8.6.1.2 Write A Test Class

As usual in Maven, put your test classes below src/test/java

Example:

import org.junit.Test;
import org.lilyproject.lilyservertestfw.LilyProxy;
import org.lilyproject.repository.api.*;

public class MyTest {
    @Test
    public void testMe() throws Exception {
        LilyProxy proxy = new LilyProxy();

        // Depending on mode, this will:
        //  - start Hadoop, ZooKeeper, HBase, Solr, Lily embedded
        //  - connect to locally running instance launched by launch-test-lily and clear its state
        proxy.start(); 

        Repository repository = proxy.getLilyServerProxy().getClient();

        // Do stuff with the repository
        // ...
        
        proxy.stop();
    }
}

8.6.1.3 Run The Test With Lily Stack Embedded

Execute

mvn install

Because of all the services which are launched on the fly, this will take some time, like half a minute.

If all is well, it will end with a BUILD SUCCESSFUL message.

8.6.1.4 Create LilyProxy On The Class Level

In the previous test class example, we created the LilyProxy within a test method. This approach is not recommended because of two reasons:

Therefore, the usual approach is to create LilyProxy on the test class level, and to set the forkMode of the Maven surefire plugin (the plugin which executes test cases) to 'always', causing a new JVM to be launched per test class.

Of course, this approach requires the tests to be written in such a way that they don't conflict with each other: each test method in a test class should be able to run independently of the others, without assumptions about state.

Here is an example of how to start LilyProxy at the class level:

import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;
import org.lilyproject.client.LilyClient;
import org.lilyproject.lilyservertestfw.LilyProxy;

public class MyTest {
    private static LilyProxy LILY_PROXY;

    @BeforeClass
    public static void setUpBeforeClass() throws Exception {
        LILY_PROXY = new LilyProxy();
        LILY_PROXY.start();
    }

    @AfterClass
    public static void tearDownAfterClass() throws Exception {
        LILY_PROXY.stop();
    }

    @Test
    public void testOne() throws Exception {
        LilyClient lilyClient = LILY_PROXY.getLilyServerProxy().getClient();
        // Do stuff
    }

    @Test
    public void testTwo() throws Exception {
        LilyClient lilyClient = LILY_PROXY.getLilyServerProxy().getClient();
        // Do stuff
    }
}

You can again check with "mvn install" that this runs.

8.6.1.5 Connect To Independently Launched Lily

Up to now when launching tests, the Lily stack was launched within the test JVM, which is rather slow.

We can speed this up by launching an independent Lily instance, and letting the testcases connect to that one. When LilyProxy.start() is called, a reset trigger will be sent to this Lily instance, causing it to clear all Lily state in HBase, HDFS and ZooKeeper, to delete all documents in Solr, and to restart (within the JVM) the Lily server.

To use this, we first need to launch the standalone Lily stack using this command:

./bin/launch-test-lily

Wait until it is completely started, this will be clearly visible by a series of informational (non-log) messages being printed to standard out.

Then start the build using:

mvn -Pconnect install

The -Pconnect flag activates the connect profile we added to the pom.xml earlier, which then causes the system property lily.lilyproxy.mode=connect to be passed to the JVM's executing the tests, putting LilyProxy in connect mode.

You cannot run tests without -Pconnect while launch-test-lily is running, as these start the same services listening on the same port numbers.

When you stop launch-test-lily (use Ctrl+C), the temporary directory in which the data was stored will be deleted. If you want to retain the data for a future run, specify a custom storage directory using the -d option.

8.6.2 Service Configuration

8.6.2.1 General remarks

We provide limited control over the configuration of the various services.

All services use the default TCP port numbers and this cannot be changed. This means that if you have running any of the services locally, the embedded variants will clash with them.

8.6.2.2 Solr Schema

By default, Solr will be launched with the example schema that ships with Solr. This is not very useful, you will typically want to specify your own schema.

You can specify your own schema by passing an argument to LilyProxy.start(). For example, if the schema is available as a resource in the same package as your test class, you can do it like this:

import org.apache.commons.io.IOUtils;

...

byte[] solrSchemaData = IOUtils.toByteArray(MyTest.class.getResourceAsStream("solr_schema.xml"));
LilyProxy.start(solrSchemaData);

If LilyProxy is in embedded mode, this schema will be used directly when Solr is started. When LilyProxy is in connect mode, and the current schema is different from the one supplied, the current schema will be overwritten and Solr will be reloaded.

You can also change the schema after Lily proxy was started, using this method:

LILY_PROXY.getSolrProxy().changeSolrSchema(solrSchemaData);

8.6.2.3 Lily Conf & Plugins

Because LilyProxy supports switching between an embedded or externally launched Lily stack, we do not support dynamically defining the configuration of Lily in the testcase. Thus, typically every launched Lily instance in a testcase in your project should use the same configuration and plugins. This is because when using connect mode, we can't change the configuration of the externally launched Lily.

8.6.2.3.1 Connect

When running in connect mode, Lily should be launched using the ./bin/launch-test-lily command as described before. Any changes to configuration files or adding additional configuration files should be done in the $LILY_HOME/conf folder before starting launch-test-lily.

Similarly, plugins should be put in the $LILY_HOME/plugins folder.

8.6.2.3.2 Embedded

In the embedded mode, by default the standard Lily configuration is used. This configuration is included within the test framework jar and extracted to a temporary folder so that Kauri/Lily can read it.

An additional configuration folder (which will take precedence over the default configuration) can be given by setting its path in the system property lily.conf.customdir.

Similarly for the plugins, the plugin folder can be set with the system property lily.plugin.dir.

Example :

mvn -Dlily.conf.customdir=/home/user/custom/conf \
    -Dlily.plugin.dir=/home/user/custom/plugins install

Since these are properties you will typically not modify on a run-to-run basis, you can configure them directly in the pom.xml (see the surefire plugin).

8.6.3 Utilities

8.6.3.1 Index Schema

For data to be indexed into your Solr index, an index should be defined first. To be able to do this from your tests, some utility methods are provided on the LilyServerProxy.

The methods addIndexFromFile(String indexName, String indexerConf, long timeout) and addIndexFromResource(String indexName, String indexerConf, long timeout) add an index respectively defined in a file or resource to the indexer with the given indexName. These methods will wait until the information of this new index has propagated to the indexer as well as the rowlog. Only when this is the case one can be sure that events about record creates or updates will be picked up by the MQ rowlog and given to the Indexer in order to put the necessary data in the Solr index. The timeout is the maximum amount of time the methods would wait. If the timeout is exceeded, they will return false.

Example :

Assert.assertTrue("Adding index took too long",
    LILY_PROXY.getLilyServerProxy().addIndexFromResource("testIndex",
        "org/lilyproject/mylilyproject/my_indexerconf.xml", 60000L));

Variants of these methods are also available with booleans to indicate if the call should wait for the information about the new index to propagate to the indexer and the rowlog : addIndexFromFile(String indexName, String indexerConf, long timeout, boolean waitForIndexerModel, boolean waitForMQRowlog) . If these are put to false it is possible that a record create (for example) would not result in an update on the Solr index since the indexer and / or rowlog were not yet aware of the newly defined index.

8.6.3.2 WAL and MQ processed and Solr Index commited

With the above calls to add an index and wait for it to be fully operational one can be sure that any changes to records will eventually be reflected in the Solr index. This does not mean that these changes will be visible immediately. First messages need to be processed by the WAL and MQ before an update is performed on the Solr index, and then this Solr index needs to be commited for its changes to become visible. When writing a test it is useful to know if all record updates are reflected in the Solr index as well. We've provided some utility methods to help with this.

On LilyProxy, the waitWalAndMQMessagesProcessed(long timeout) method waits for all messages of the WAL and MQ to be processed and then commits the Solr index. When the given timeout expires before all messages have been processed, the call will return false. A variant of this method with a boolean to indicate if the Solr index should be commited or not is also available.

Example:

Assert.assertTrue("Processing messages took too long",
    LILY_PROXY.waitWalAndMQMessagesProcessed(60000L));

It is also possible to explicitly commit the Solr index by calling commit() on the SolrProxy.

8.6.3.3 Launching A Batch Index Build

A convenience method is available to perform a batch index build. This method will launch the build and block until it is finished. If it would not finish successfully, and exception is thrown. If it does not finish within the expected time out, it returns false.

Example:

Assert.assertTrue("Batch index build took too long",
    LILY_PROXY.getLilyServerProxy().batchBuildIndex("testIndex", 60000L * 10));

8.6.4 Advanced

8.6.4.1 User defined storage directory

By default the embedded mode will create a temporary directory in which to store the data and log files. This directory is cleared at shutdown. The parent directory in which the temporary directory is created is defined by the system property java.io.tmpdir.

Instead of creating a temporary directory, it is possible to use a fixed directory location. This directory can be set by using the system property lily.lilyproxy.dir. By default this directory is still cleared at shutdown. If the data stored in this directory should be kept in order to use it at a next run the system property lily.lilyproxy.clear should be set to false.

Example:

mvn -DargLine="-Dlily.lilyproxy.dir=/home/user/mydir -Dlily.lilyproxy.clear=false" install 

8.6.5 More On The Lily Test Framework

Lily's test framework consists of three separate projects.

Maven Project Name

Services

Class For Embedded Launching

Abstraction between embedded/connect mode

System property to set connect mode

hadoop-test-fw

HDFS, MapReduce, ZooKeeper, HBase

HBaseTestingUtility (this is part of HBase)

HBaseProxy

lily.hbaseproxy.mode

solr-test-fw

Solr inside Jetty

SolrTestingUtility

SolrProxy

lily.solrproxy.mode

lily-server-test-fw

Lily Server Node

LilyServerTestingUtility

LilyServerProxy

lily.lilyserverproxy.mode

LilyProxy

lily.lilyproxy.mode

If you write a project that only needs HBase and/or Solr, you can immediately use the corresponding projects, without having to launch Lily as well.

Switching between connect and embed mode

The system properties in the last column can be set to the value 'connect' or 'embed'. If not specified, embed is the default.

LilyProxy

LilyProxy combines HBaseProxy, SolrProxy and LilyServerProxy. When using LilyProxy, the single property lily.lilyproxy.mode will set the embed/connect mode for all of the proxy's. Mixing different modes is not possible since reset state functionality requires all services to run together.

launch-test-lily (LilyLauncher)

The launch-test-lily script (LilyLauncher class) basically creates, in one JVM, an HBaseTestingUtility, a SolrTestingUtility and a LilyServerTestingUtility.

LilyLauncher exposes through JMX the operation "resetLilyState", which performs the following actions:

When in connect mode, each time LilyProxy.start() is called, this resetLilyState operation will be called.

The launch-test-lily script opens JMX access on port 10102.

8.7 MapReduce Integration

8.7.1 Using Lily As Input For MapReduce Jobs

Lily has an InputFormat for Hadoop which enables to efficiently run over records in the repository.

The InputFormat is based on the Lily scanner feature, thus (within each input split) runs sequentially over all or a subset of the records, possibly with some filter(s), and with all or a selection of fields loaded.

This InputFormat is conceptually quite similar to HBase's TableInputFormat. The number of splits, thus the number of map tasks launched, equals the number of regions of the record table.

Lily scanners directly access HBase, by-passing the Lily server nodes, and hence should be fast. A hint is passed to Hadoop so that the map task for a certain input split can be co-located with region server where the corresponding region is hosted, reducing network traffic.

To see some example code, generate a project using the archetype, as described further on.

8.7.2 Using Lily As Output For MapReduce Jobs

To write to Lily from MapReduce jobs, just use the usual LilyClient class.

There would be little added value in providing a Hadoop OutputFormat for Lily. Having access to the repository from within the map or reduce method gives more flexibility: you can choose which method to use (create, update or createOrUpdate) and you could read before write, use conditional updates, etc.

As with any MapReduce task which has side-effects, be sure to be careful with the behavior of re-execution of failed tasks, of multiple reducers, and of speculative execution.

Create the LilyClient in the setup method, and close it in the cleanup method. We provide utility functions:

import org.lilyproject.mapreduce.LilyMapReduceUtil;
import org.lilyproject.util.io.Closer;

...

public class YourClass extends Mapper or Reducer {
    private LilyClient lilyClient;
    private Repository repository;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        this.lilyClient = LilyMapReduceUtil.getLilyClient(context.getConfiguration());
        this.repository = lilyClient.getRepository();
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        Closer.close(lilyClient);
        super.cleanup(context);
    }

8.7.3 Getting Started Writing A Lily MapReduce Job

The quickest way to get started writing a MapReduce job is to set up a project using the Maven archetype:

mvn archetype:generate \
  -DarchetypeGroupId=org.lilyproject \
  -DarchetypeArtifactId=lily-archetype-mapreduce \
  -DarchetypeVersion=[unresolved variable: artifactVersion] \
  -DarchetypeRepository=http://lilyproject.org/maven/maven2/deploy/

This generates a classic word-count style MapReduce job based on Lily. See the README.txt in the generated project for more information on how to try it out.

9 Repository (lily-server) plug-ins

9.1 Repository Decorators

9.1.1 Overview

9.1.1.1 What

Repository decorators are hooks added to Lily server nodes that allow to do things before or after any (CRUD) operation on the repository. A typical use case is auto-assignment of record state, such as generated meta data or calculated fields.

Client requests arrive at the first decorator in the chain. Clients are unaware of the existence of decorators, they don't know the requests pass through them. A decorator will then typically call the next in the chain, called the delegate, until the request arrives at the repository itself. Then the call returns through the call chain, allowing to do things after the operation as well.

Since decorators are put in front of the repository, they don't influence the behavior of the repository internally, they can only manipulate what goes in and out.

The term "interceptor" is often used for these kind of components. We opted for decorator instead, since this is same terminology as used by CDI (the Java Context and Dependency Injection specification), where the term interceptor is rather used for orthogonal concerns.

Decorators are not applied when records are read by the batch index builder.

9.1.1.2 Deployment

Repository decorators are packaged in a jar and have to be deployed on all Lily server nodes. They have to be explicitly activated through configuration as well, which allows to control the order in which the decorators are called, is a safe-guard against lingering jar files, and allows to keep the extension jar loaded in case it also offers other functionality.

There is no smart distributed deployment or management of decorators within Lily itself. Configuration and setup of nodes is usually managed in a central location (cfr. Lily Enterprise), making this is non-issue. This approach also has the advantage that it is possible to have differently configured nodes, such as during a rolling upgrade.

9.1.1.3 The Interface

A repository decorator needs to implement the interface RepositoryDecorator, which is defined as follows:

public interface RepositoryDecorator extends Repository {
    void setDelegate(Repository repository);
}

As you can see, this extends from Repository, so all Repository methods can be decorated. The setDelegate() method is called by the framework to provide your implementation with the delegate it should call.

9.1.2 Creating A Repository Decorator

The steps to make a repository decorator and have it in production are:

  1. making an implementation of the RepositoryDecorator interface
  2. write code to register your RepositoryDecorator implementation with the PluginRegistry.
  3. packaging it as a Kauri module (a jar file)
  4. deploying it to your Lily server node(s)
  5. activating it in the Lily configuration
  6. restarting the Lily server(s)

Steps 2 and 3 are specific to the mechanics of how to get an extension running in Lily. Fortunately, you don't have to worry too much about them since we have a template project that takes care of these. We will first walk through the steps to get your first decorator running, afterwards we'll provide some more background on step 2 and 3.

9.1.3 Your First Decorator

9.1.3.1 Generate A Project

Open a shell, go to a directory where you want the project to be located (a single sub-directory will be created for you), and generate a project using the following command:

mvn archetype:generate \
  -DarchetypeGroupId=org.lilyproject \
  -DarchetypeArtifactId=lily-archetype-lily-server-plugin \
  -DarchetypeVersion=[unresolved variable: artifactVersion] \
  -DarchetypeRepository=http://lilyproject.org/maven/maven2/deploy/

This will ask you to confirm the settings for some parameters:

Confirm properties configuration:
groupId: com.mycompany
artifactId: my-lily-server-plugin
version: 1.0-SNAPSHOT
package: com.mycompany
Y: 

It is recommended to answer N and change the values appropriately. The version number (1.0-SNAPSHOT) is the version of your decorator, not of Lily.

9.1.3.2 Implement RepositoryDecorator

The generated project contains a decorator implementation at the following path:

src/main/java/com/mycompany/SampleRepositoryDecorator.java

This sample decorator prints a message before and after record creation calls. For now, we can just continue with this sample decorator, you can come back to it later and adjust it to implement the desired functionality.

9.1.3.3 Disable other sample plugins

Edit the following file:

src/main/kauri/spring/services.xml

Remove or comment out the sections related to samples of other types of plugins, at the time of this writing this was only the SampleRecordUpdateHook:

<!-- Comment this bean out or delete it
  <bean id="updateHook" class="com.mycompany.SampleRecordUpdateHook">
    <constructor-arg ref="pluginRegistry"/>
  </bean>
-->

9.1.3.4 Build

To build the project, execute:

mvn assembly:assembly

9.1.3.5 Deploy

The build will have created a tarball at

target/my-lily-server-plugin-1.0-SNAPSHOT.tar.gz

This bundles the plugin (jar file and wiring.xml), together with all its dependencies, in a format which can be deployed with Lily Enterprise.

Here we are just interested in local testing, so we extract this again:

tar xvzf target/my-lily-server-plugin-1.0-SNAPSHOT.tar.gz

And then we copy the wiring.xml file to the plugins directory:

cp my-lily-server-plugin-1.0-SNAPSHOT/plugins/load-before-repository/wiring.xml \
   $LILY_HOME/plugins/load-before-repository

And copy the library files to Lily's lib dir:

cp -r my-lily-server-plugin-1.0-SNAPSHOT/lib/* \
      $LILY_HOME/lib

Tip: It is not very tidy to copy our own extensions directly into Lily's lib dir. This is in fact not necessary: when starting Lily using the 'bin/lily-server' script, you can define an environment variable LILY_MAVEN_REPO to point to additional lib dirs. You could make this point to the lib dir of the plugin. When using the service wrapper, see wrapper.conf.

Be productive: what's even easier during plugin development: let the LILY_MAVEN_REPO point directly to ~/.m2/repository. Then each time you rebuild using "mvn install", you don't have to deploy anything, just restart lily-server.

9.1.3.6 Edit Lily Configuration

Edit the file

$LILY_HOME/conf/repository/repository.xml

In that file, you will see a <decorators> element. Within that element, you need to list all the decorators that should be active:

  <decorators>
    <decorator>com.mycompany.my-lily-decorator</decorator>
  </decorators>

The decorator name can be found in the SampleRepositoryDecorator.java file mentioned above, in the NAME member.

9.1.3.7 Restart Lily Server

Now restart the Lily server.

During startup, two lines will be logged related to the decorator (depending on log configuration, but should be the case by default).

First you will see a line indicating that the decorator plugin jar is being loaded:

[INFO ][snipped] Starting module plugin-my-lily-decorator-1.0-SNAPSHOT - /[snipped path]/plugins/load-before-repository/my-lily-decorator-1.0-SNAPSHOT.jar

A bit later a line will be printed showing the active repository decorators, which corresponds exactly to those configured in repository.xml:

[INFO ][snipped] The active repository decorators are: [com.mycompany.my-lily-decorator]

If you would now create records, messages will be printed to standard out.

9.1.3.8 Next Steps

Now that you know how to get a decorator running, you can adjust the decorator implementation to suit your own needs.

For some more insight in how the plugins are packaged and deployed, see Lily Server Plugin Mechanism.

9.2 Record Update Hooks

9.2.1 Overview

9.2.1.1 What

A record update hook is an extension mechanism of the lily-server process. It is called before a record is updated but after the record has been locked for updating and the original record state has been read.

Compared to a Repository Decorator, you would use it when you would decorate the update method and would find that you need to read the original record. Since the repository implementation reads the previous record state anyway, we can avoid this double HBase-involving work. Possibly more important, since the record is locked, you can be sure the record state won't change anymore between the read and the update.

The hook is called before the conditional update checks are checked.

9.2.1.2 The Interface

public interface RecordUpdateHook {
    void beforeUpdate(Record record, Record originalRecord, Repository repository, FieldTypes fieldTypes)
            throws RepositoryException, InterruptedException;
}

The hook is provided with:

9.2.2 Creating a RecordUpdateHook

The steps to create a RecordUpdateHook are very similar to those for creating a Repository Decorator, so have a look over there. The archetype which is used there to generate a sample project also contains a sample RecordUpdateHook.

9.3 Lily Server Plugin Mechanism

TODO: this was cut and paste from the decorators document, needs some rewording.

Lily runs on a platform called Kauri. What Kauri basically does is start a number of modules. A module is a jar file with two things added to it:

Besides this, modules can also export or import services, this allows for wiring services between modules.

The decorator we created above is also such a Kauri module.

The Spring container definition is in the source tree at src/main/kauri/spring/services.xml. The maven build is configured such that the file ends up in the jar at KAURI-INF/spring/services.xml.

The classpath definition is generated as part of the build by a Maven plugin called kauri-genclassloader-plugin. It ends up in the jar at KAURI-INF/classloader.xml.

Let's have a closer look at what is in the Spring container definition:

  <kauri:import-service
      id="pluginRegistry"
      service="org.lilyproject.plugin.PluginRegistry"/>

  <bean id="decorator" class="com.mycompany.SampleRepositoryDecorator">
    <constructor-arg ref="pluginRegistry"/>
  </bean>

The special tag kauri:import-service will make the PluginRegistry service (provided by another Kauri module) available within this Spring container.

The <bean> tag causes the SampleRepositoryDecorator to be instantiated when the module is started.

Inside its constructor, the decorator will register itself with the PluginRegistry:

    public SampleRepositoryDecorator(PluginRegistry pluginRegistry) {
        this.pluginRegistry = pluginRegistry;
        pluginRegistry.addPlugin(RepositoryDecorator.class, NAME, this);
    }

To deploy our module, we copied it to the directory $LILY_HOME/plugins/load-before-repository. Kauri knows what modules to start due to the configuration in conf/kauri/wiring.xml. In that file, you will see a line that tells Kauri to load all the jar files inside that directory:

<directory id="plugin" path="${lily.plugin.dir}${file.separator}load-before-repository"/>

As you can see, the actual plugin directory location is provided by a system property, lily.plugin.dir.

The purpose of the subdirectory load-before-repository is that it are modules that will be started before the actual repository. For the decorators, it is important that they are registered before the repository is created, so that there is no window during startup in which the repository can get called without the decorators being active.

The above should have given you the basic insight in how this all works (if not, don't hesitate to ask questions on the mailing list).

If you want to register multiple decorators, it is not necessary to put each of them in a separate Kauri module, you can just add more implementations in the same project, and add <bean> tags for each of them to the Spring container.

10 Bulk Imports

Lily has no special support for bulk uploads, but below we provide some tips.

Disable indexes during import

Disabling incremental index updating during import will usually given an important performance advantage, especially if you make use of link dereferencing. You can then batch-build the index once the import is done.

If you have not already defined an index, simply wait to create your index until after the import. Otherwise, you can disable the incremental updating using:

lily-update-index -n nameOfTheIndex --update-state DO_NOT_SUBSCRIBE

Afterwards, re-enable it using:

lily-update-index -n nameOfTheIndex --update-state SUBSCRIBE_AND_LISTEN

Trigger a batch index build using:

lily-update-index -n nameOfTheIndex --build-state BUILD_REQUESTED

And follow up on its status using:

lily-list-indexes

See managing indexes for more details.

Run multiple clients in parallel

Be sure to run multiple clients in parallel, or write a multi-threaded client, even if your "cluster" would only contain a single node.

Configure initial region splits

When starting out on a blank Lily install and planning to do some bulk loading, be sure to increase the number of initial table splits for these tables: records, links-forward, links-backward. For example, set each to 10 times the number of servers you have (e.g. 60 for 6 nodes).

See Table creation settings for more details. Note that these initial region split settings only work upon initial creation of the table. If you use custom record IDs you will have to assign appropriate split keys yourself, or if unsure leave it to 1 initial split. Also with custom record IDs, make sure they are not monotonically increasing or you will be hitting the same region of the record table all the time.

Disable link index maintenance

Lily keeps an index of all links between records. This is used to keep denormalized data in the SOLR index up to date, or it can also be used for custom purposes.

If you are not interested at all in the link index, either because you don't have any link-type fields in your records, or you have no need for denormalized data in the index, than you can disable the updating of the link index. This can gain quite a bit in performance since otherwise for each record create/update, this index has to be kept up to date, which involves reading the record and querying the existing state of the index.

To disable the link index, edit the configuration file rowlog/rowlog.xml, and put the following flag to false

<linkIndexUpdater enabled="true"/>

This needs to be done on all Lily nodes, and the Lily server needs to be restarted after this change.

Change HBase flush settings

While not recommended for general Lily use, you could temporarily relax the HBase flush settings.

This is done with the following properties:

More information on these properties can be found in HBase's hbase-default.xml

General HBase tuning

Reduce HDFS replication

On small clusters (say, < 8 nodes), it is recommended to reduce the HDFS replication factor (dfs.replication property) to 2. Don't make the replication factor equal to or larger than the number of nodes, else HDFS/HBase will complain it can't reach the needed replication factor. The replication setting should be configured in hbase-site.xml rather than in HDFS's configuration, as it is the HBase client which sets the replication level for each file it creates.

Other

See the HBase Book on the HBase website for other tuning tips, including memory configuration, GC settings, LZO compression, etc. Be sure to keep an eye on the metrics.

11 Admin

11.1 Table creation settings

The very first time Lily is launched, it will create the necessary tables on HBase. Some settings for these tables can be configured through the configuration file conf/general/tables.xml. Once the tables are created, modifying this file will not have any effect anymore.

Initial region splits

A table in HBase is divided into a number of partitions, called regions. Initially each table starts out with one region, when a region reaches a certain size, it is split into two.

On an empty cluster, there will be only one region for each table, which makes that all updates will go to that one region, and hence the load will be unevenly spread among the servers in your cluster. Therefore, HBase allows to define initial table splits when creating a table.

Lily creates certain tables, such as the records table, with initial splits. Each split is defined by a start key and an end key, these need to be selected such that the created records will spread more or less evenly over the various regions.

If you create records with UUIDs as record IDs, than Lily can automatically calculate the appropriate start and end keys, given a certain number of regions. In case you assign the record IDs yourself, you will need to defined the splits yourself, or simpler, set the number of initial regions to 1.

11.2 Optimizing HBase Request Load Balancing

To get the maximum out of your cluster, the request load of each of the HBase region servers should be similar. For example, if one server would be processing 300 requests/sec and another one 2000 requests/sec, then you are making far from optimal use of your cluster's resources.

While we won't explain HBase regions in detail here, the important thing is that the regions of one table should be spread as equally as possible over all the region servers. For example if you have two tables with 10 regions each, and you have two region servers, than rather than putting all 10 regions of one table on one region server, it is better to put 5 regions of each table on each region server.

11.2.1 Record & linkindex tables

All Lily records are stored in one big table called records. If you are making use of Lily-generated UUID record ID's, then load balancing will be optimal. If you are making use of your own ID's, make sure to choose them such that they are not sequentially increasing.

On an empty cluster, it might take a while for the record table to grow to a good number of splits. Therefore, it is possible to pre-split the record table. See Table creation settings.

11.2.2 Rowlog tables (rowlog-mq and rowlog-wal)

The rowlog tables are system tables used by Lily. They contain the time-ordered sequence of events happening to the records. Since it is time-ordered, normally load balancing would be bad since we would always be touching the same region server. However, the rowlog tables can be created with a number of splits, and Lily will 'salt' the timestamps so they are equally divided over the splits. We call this "rowlog sharding".

To configure the number of splits, see the shardCount parameter in conf/rowlog/rowlog.xml. By default, 1 split is created. Important: this parameter should not be changed after the initial Lily startup, and it should have the same value on all Lily nodes.

There is no dynamic way to change the shardCount after initial Lily startup, though if necessary you can do it with the procedure described next. This procedure involves dropping the rowlog-mq & rowlog-wal tables, so only do this if either they are empty or there is nothing in them that you care about (e.g. you will do a full index rebuild and you don't need the linkindex). The procedure is: stop all Lily servers, drop the rowlog-mq & rowlog-wal tables, change the shardCount setting (on all servers), start the Lily servers. On startup, Lily will create the rowlog-mq & rowlog-wal tables again, with the newly configured number of splits.

11.2.3 Fixing bad region assignment

In case for some reason the regions in your cluster are not well-balanced, you can tell HBase to reassign the regions in a round-robin fashion by adding the following configuration to the hbase-site.xml of the HBase master:

<property>
 <name>hbase.master.startup.retainassign</name>
 <value>false</value>
</property>

After changing this setting, you need to restart your HBase cluster. The usual HBase startup behavior is that it will try to redeploy each region on the same server, as this assures good data locality with the HDFS data nodes, but the above setting will enable a fresh assignment. Don't forget to remove it again afterwards.

11.3 Metrics

Lily makes available some metrics by making use of Hadoop's metrics package. Metrics give information about the average time a certain operation takes, or the number of operations done per second, and the like.

Some of the tools like the tester and the mbox-import also report metrics.

The metrics can be consulted via JMX or can be reported to Ganglia. Ganglia can collect metrics data from multiple nodes, and uses RRDtool to store the data and make graphs of it.

11.3.1 JMX

The JMX metrics are enabled by default.

You can for example consult them using jconsole. For local processes, look for the class name org.kauriproject.launcher.RuntimeCliLauncher.

The values are updated every 15 seconds, to modify this see conf/general/metrics.xml.

11.3.2 Ganglia

For Ganglia you can use either version 3.0.x or 3.1.x.

The Ganglia metrics need to be enabled by editing the file conf/general/metrics.xml

For example in that file you will see:

    <attribute name="rowlog.class" value="org.apache.hadoop.metrics.spi.NullContextWithUpdateThread"/>
    <attribute name="rowlog.period" value="15"/>
    <!--
    <attribute name="rowlog.class" value="org.apache.hadoop.metrics.ganglia.GangliaContext31"/>
    <attribute name="rowlog.servers" value="localhost:8649"/>
    -->

For ganglia you would then change this to:

    <attribute name="rowlog.period" value="15"/>
    <attribute name="rowlog.class" value="org.apache.hadoop.metrics.ganglia.GangliaContext31"/>
    <attribute name="rowlog.servers" value="localhost:8649"/>

If you use Ganglia 3.0.x, drop the "31" at the end of the class name.

11.4 ZooKeeper Connectionloss And Session Expiration Behavior

ZooKeeper is the central service for coordination among and configuration of the Lily processes.

An application such as Lily that makes use of ZooKeeper needs to decide how it deals with situation where the connection with ZooKeeper is lost or when its ZooKeeper session is expired.

For Lily, it works as follows:

Per Lily server, there are two connections with ZooKeeper:

12 Glossary

12.1 index entry

In the context of Lily's indexer, an index entry is the entry in the index for a certain Lily record, thus the Solr document corresponding to a certain Lily record, or more correctly, to a specific version of a Lily record. Thus there can be multiple index entries for each record, in case multiple versions of a record are indexed.

13 Lily Hackers

This section of the documentation contains information intended for people working on (rather than with) Lily.

13.1 Getting Started

13.1.1 Lily Source Code

13.1.1.1 Getting the sources

Use:

svn co http://dev.outerthought.org/svn_public/outerthought_lilyproject/trunk/ lily-trunk

13.1.1.2 Building Lily

See the README.txt in the root of the source tree.

In short, if you have Maven installed, do:

mvn -Pfast install

The -Pfast option is to skip the test cases. Some of the tests will by default launch an embedded Hadoop/HBase, which takes time. This can be sped up by running against an existing HBase install, this is all explained in the README.txt.

13.1.1.3 Running Lily

During development, you can run Lily similarly to how you run the binary distribution (see Running Lily), the only difference is that the commands are in different locations.

To run launch-test-lily, you do

cd cr/standalone-launcher
./target/launch-test-lily

To run Lily, you do

cd cr/process/server
./target/lily-server

Tip: when you make changes to the Lily source code, after building with Maven you can directly restart Lily. There is no packaging or deploying to do. This is because the Kauri Runtime platform on which Lily runs directly loads the project dependencies (= constructs the classpath) using your local Maven repository (~/.m2/repository).

To run the indexer related commands like lily-add-index, lily-list-indexes:

cd cr/indexer/admin-cli
./target/lily-add-index
./target/lily-list-indexes
...

To run the import tool:

cd apps/import
./target/lily-import

13.1.1.4 Building a binary distribution

To build a .tar.gz like you can download from the Lily website, see the instructions in dist/README.txt

13.1.2 Repository Model To HBase Mapping

Here we describe how Lily stores records, record types and field types in HBase.

We assume you are familiar with HBase: you know about tables, rows, row keys, column families, column qualifiers, timestamps.

13.1.2.1 Records

13.1.2.1.1 One table for all records

All records are stored within one HBase table.

13.1.2.1.2 One record = one HBase row

A record, including all its versions, is stored in one row.

13.1.2.1.3 Row key = Record ID

The row key is the binary representation of the ID of the record as produced by the RecordId.toBytes() method. This byte encoding is such that it starts with the master record ID, so a search for all variants of a record can be done by prefix-scanning on the master record ID.

13.1.2.1.4 Column families and version numbering

Lily uses two column families:

The system and user fields are distinguished by means of a prefix-byte in the column key: see LilyHBaseSchema.RecordColumn.SYSTEM_PREFIX and DATA_PREFIX.

The data column family is configured to keep all versions (by default, HBase only keeps the 3 most reent versions and throws away the others).

For versioned data we make use of the time dimension of HBase. As timestamp we use the version number : 1, 2, 3, ... Non-versioned data is always stored at timestamp 1.

If a value is not changed from one version to another, it is not stored a second time but the value is 'inherited' from the previous version (cfr. sparseness of the HBase tables). If a field is deleted in a version, its value should not be inherited, this is done by storing a 'deleted' marker as value. This also brings the advantage that we can do a delete as part of a HBase Put, so that all updates to the row are done as one atomic unit (in HBase, Put and Delete are both atomic, but separate actions).

13.1.2.1.4.1 Version numbering and record re-creates

This section gives some more background information on the version numbering wrt record deletes and re-creates.

When a record is deleted in Lily, the deleted marker flag is put to true and all historical data (record type, record type version, field data) that existed for the record is cleared. The current version number is however kept. When later a record would be created with the same record id, this will be regarded as a record re-create. The record is created (as for a normal create), but the version numbering of the record will continue from where it was when it was deleted. (e.g. if the version number was 4 when the record was deleted, the re-created record will get verison number 5).

There are a number of reasons why this has been designed an implemented like this and not for instance with a HBase row-delete:

  1. First of all there is the way HBase behaves wrt row-deletes. When a row is deleted in HBase, a tombstone is written. When a major compaction happens (can take as long as 24 hours), the tombstone and everything older than the tombstone will be removed. As long as the tombstone is present, reading data from HBase will ignore everything that is older than the tombstone. However, if we write information after the row was deleted, while the tombstone is present and with a timestamp (version) older than the tombstone (e.g. our non-versioned data) this data will still be ignored and even removed when a major compaction happens. It the major compaction would have already happend (and thus the tombstone was removed) then writing and reading new data would succeed. This is inconsequent (non-idempotent) behaviour. Issues HBASE-2847, HBASE-2256 and HBASE-2856 relate to this. And as long as those are not solved, this is a problem.
  2. The row in HBase representing our record does not only contain our record's data, but also rowlog related information like the row-local table (see HBase Rowlog Library ). This information is for instance used to update the link-index and is still needed even after the record has been deleted. Removing the whole hbase row would thus also remove this information.
  3. If we would hide from the lily user that the version numbering continues from where it was when the record was deleted, some mapping would be needed to map record version numbers onto an internal (increasing) version numbering of the record. This however introduces more complex (and thus slower) read / write paths which we like to avoid as much as possible.
13.1.2.1.5 Fields = columns

The fields are stored as columns, thus one column per field. This is also true for LIST or RECORD fields: these are encoded into one column's value. The byte-encoding of a field value is provided by the ValueType interfaces.

The column qualifier (= the name of the column) is the system-generated field type ID.

13.1.2.2 Record types & field types

The repository schema, thus the record types and field types, are also stored in a HBase table.

The details of their mapping onto HBase are currently not documented here.

13.1.3 Blobstore

In this document we describe the api and design of the blob store. How are blobs stored in the repository and how do they relate to the records and fields.

13.1.3.1 General

The general idea is that, to enable introducing record-level access control in the future, blobs should only be accessed through the record they are used in (via the repository API) and not directly using their blob key.

Only in the very initial phase, where blobs are uploaded to the blobstore, can they exist without being part of a record. Before a blob can be used in a record it must have been uploaded to the blobstore. During a certain amount of time (e.g. 1 hour) the uploaded blob can then be used in a record. If after that time the blob was not used in a record it will become unavailable and will be removed from the blobstore.

Blobs can be re-used, but only within different versions (also non-sequential ones) of the same field of the same record. Blobs cannot belong to multiple records or multiple fields at the same time.

13.1.3.2 API and usage

13.1.3.2.1 Writing
Repository : OutputStream getOutputStream(Blob blob) throws BlobException, InterruptedException;

To upload a blob to the blobstore, an Outputstream must be requested on the Repository. After uploading the blob and closing the OutputStream, the blob will be updated with information that allows the repository to find and retrieve the blob's data in the blobstore.

13.1.3.2.2 Reading
Repository : BlobInputStream getInputStream(RecordId recordId, QName fieldName, Long version, Integer multivalueIndex, Integer hierarchyIndex);
(+ variants of this method with only the essential parameters)

To retrieve a blob an InputStream must be requested on the Repository. The InputStream can only be retrieved by giving the 'location' of the Blob within a record by giving the record's Id, the fieldName, the version of the record (or null if the latest record version should be used or if it is not applicable as is the case for non-versioned fields) and the multivalueIndex or hierarchyIndex (e.g. 0 for the first position) of the blob in case the field is multivalue or hierarchical or both.

When finished reading, this InputStream must be closed just like any other InputStream.

The returned InputStream is a subclass of InputStream, BlobInputStream, which offers one additional method to return the Blob metadata object:

Blob BlobInputStream.getBlob()

The purpose is to get access to this metadata (size, content type) without having to do an additional record read, which already happens as part of the getInputStream implementation. One immediate application is the ability to set the appropriate response headers in the REST interface.

13.1.3.2.3 Referring

A blob can be referred to by using it in a record create or update operation. If the record is not allowed to refer to the blob the create or update operation will throw an InvalidRecordException.

13.1.3.2.4 Deleting

A blob cannot be removed explicitly.

A blob will however be removed from the blobstore in three situations.

  1. The blob was uploaded to the blobstore, but was not used in a record within the defined timeout. In other words, the blob upload was not followed by a create or update operation of a record referring to the blob.
  2. An update or delete operation (of a non-versioned field or a mutable field) on a record can cause the blob not to be referred anymore. The blob will then be removed from the blobstore. Note that if an older version of the record (or newer version in case of an update of a mutable field) still refers to the blob, the blob will not be deleted.
  3. When a record is deleted, all its fields will be cleared. As a consequence any referred blobs will be deleted as well.

13.1.3.3 Design

13.1.3.3.1 Repository

The Repository provides the methods getOutPutStream and getInputStream to write and read blobs (cfr API above). The usage of blobs within records is managed through the normal record crud operations of the repository.

13.1.3.3.2 BlobManager

The Repository uses a BlobManager component to manage the state of blobs. The BlobManager manages the HBase table: BlobIncubatorTable.

13.1.3.3.3 Blob Incubator Table

The BlobIncubatorTable is used to store references to the blobs that have just been uploaded. When a blob is then used in a record create or update operation, this table is checked to see if the blob is indeed available to be used in a record. Before a blob is used, it is 'reserved' so that no other records can use the blob at the same time. Reserving a blob is done by adding the recordId next to the blob reference with a checkandput on HBase.

After a blob has been used in a record, its reference is removed from the BlobIncubatorTable.

13.1.3.3.3.1 Table layout:
13.1.3.3.4 Blob Incubator Monitor

When a blob is uploaded a reference is put in the BlobIncubatorTable and it can be used in a record. If the blob would never be used in a record, it would remain forever in the blobstore, and a reference to it will stay in the BlobIncubatorTable forever. To avoid this a BlobIncubatorMonitor scans the table on a regular basis and removes any blobs that were uploaded a minimal amount of time ago. This minimal time ('minimalAge') can be configured. (Default = 1 hour). 

This monitor is a process that runs on only one lily node (cfr leader election) and it should run at a sufficient low pace ('monitorDelay')as not to use too many system resources and influence the other operations. (Default = 1 check / minute)

13.1.3.3.5 Workflows
13.1.3.3.5.1 Create record
  1. A blob is uploaded using the OutputStream received from the repository.
  2. Upon closing the outputstream:
    1. The BlobManager is requested to put the reference to the blob in the BlobIncubatorTable
    2. A reference to the blob (blobId + where it is stored) is generated and stored in the value of the blob
  3. The repository is asked to create a record containing blobs
    1. For each blob in the record :
      1. The BlobManager is requested to reserve the blob
        1. A check is done if the blob reference is available in the BlobIncubatorTable (cfr checkAndPut). If not, the create is not allowed.
        2. The blob reference in the blobIncubator is updated to include the recordId of the record where it is going to be used.
          1. The blob is reserved for this record
          2. No other records can use the blob
    2. Create record in HBase
      1. For each blob the BlobManager is requested to remove the blob reservation
13.1.3.3.5.2 Update record
  1. Either a new blob is uploaded using the OutputStream (see above), or a blob will be used that is already used by another version of the same field in the same record.
  2. The repository is asked to update a record containing blobs
    1. For each (to-be-updated) blob in the record :
      1. The BlobManager is requested to reserve the blob
        1. A check is done if the blob reference is available in the BlobIncubatorTable, and it is reserved. If not, a check is done if the blob was already used in another version of the same field in the record. If not, the update is not allowed.
          Note: for the common case where a blob field was not changed with respect to the previous version, the field will already have been removed since it was not modified, and hence there is no additional overhead.
    2. Perform the update of the record
    3. The BlobManager is requested to remove any reservations made
    4. In case of non-versioned fields or an update of a mutable field (either by putting a new blob or deleting a field) it is possible that some blobs are no longer referred to by the record
      1. For each of these blobs the BlobManager is requested to remove these blobs.
        1. The BlobManager will put a reference of these blobs in the BlobDeleteTable and the BlobDeleteMonitor will pick these up and delete them

Note that for inline blobs no incubation or reservation is done. Inline blobs can always be used in a record, no matter if another field or record uses the 'same' inline blob.

13.1.3.3.5.3 Delete record

When a record is deleted, all fields are cleared. For each blob that was referred to by  the record the BlobManager is requested to delete the blob.

13.1.3.3.6 Failure scenarios
13.1.3.3.6.1 Failure after blob reservation

If a failure occurs in a record create or update operation after the step where the blob has been reserved, the blob would remain marked as reserved. If the create or update operation is retried this reservation can be re-used, but only if the record id is known and corresponds to the record id in the existing reservation.

If the operation is not retried, there will be a reservation that refers to a record that either does not exist, or does not use the blob. The BlobIncubatorMonitor will clean up the reservation after the defined timeout, but before removing the blob from the blobstore an extra check is done to see if the blob has indeed not been used by the record, cfr next failure scenario.

13.1.3.3.6.2 Failure before removing the blob reservation

If the record create or update succeeded but removing the blob reservation failed, the reservation will remain on the BlobIncubatorTable.
The BlobIncubatorMonitor will encounter this reservation and clean it up. Before removing the blob from the blobstore it will check if the referred record exists and if it indeed uses the blob.

13.1.3.3.6.3 Failure before removing the blob

If a blob needs to be removed due to an update or delete operation, the BlobManager is requested to remove it. If a failure occurs just before (or during) doing this, the blob will never be removed.
To avoid this we could introduce a secondary action on the WAL. But we choose not to do this since this would introduce a slowdown for all operations for a corner case which should happen very unfrequently.

If needed a BlobJanitor can still be implemented which scans all blobs in the blobstore and checks if they are still used in some record or not.

13.2 Releasing

13.2.1 Building A Lily Release

These are the steps to perform an official Lily release.

13.2.1.1 Pre-release checks

13.2.1.1.1 Verify a clean Maven build works

This is to verify a "real clean build" scenario would work, thus that the pom's don't reference something which is only available in your local repository.

Depending on how much you value your local Maven repository, you can either just throw away your local repository or temporarily use another location as the local repository:

EM2R=/tmp/EMPTY_MAVEN2_REPO; rm -rf ${EM2R}; mkdir -p ${EM2R}
echo "<settings><localRepository>${EM2R}</localRepository></settings>" > emptyrepo.xml

mvn install -s emptyrepo.xml

13.2.1.2 Change versions

13.2.1.2.1 in wiring.xml

Edit the following file:

cr/process/server/conf/kauri/wiring.xml

and remove the "-SNAPSHOT" suffix from the versions. Commit this.

13.2.1.2.2 in README.txt

Adjust the documentation link to point to the correct version of the docs.

13.2.1.2.3 in dist README.txt

dito in dist/src/main/resources-filtered/README.txt

13.2.1.2.4 in scm pom section

When using git, no scm configuration changes are needed.
However, make sure you have read http://maven.apache.org/scm/git.html

In the root pom.xml, check the scm section.

For releases from trunk, it should contain:

  <scm>
    <connection>scm:svn:https://dev.outerthought.org/svn/outerthought_lilyproject/trunk</connection>
    <developerConnection>scm:svn:https://dev.outerthought.org/svn/outerthought_lilyproject/trunk</developerConnection>
    <url>https://dev.outerthought.org/svn/outerthought_lilyproject</url>
  </scm>

For releases from a branch, it should contain something similar to this (modify branch name as applicable):

  <scm>
    <connection>scm:svn:https://dev.outerthought.org/svn/outerthought_lilyproject/branches/BRANCH_1_1_X</connection>
    <developerConnection>scm:svn:https://dev.outerthought.org/svn/outerthought_lilyproject/branches/BRANCH_1_1_X</developerConnection>
    <url>https://dev.outerthought.org/svn/outerthought_lilyproject</url>
  </scm>

13.2.1.3 Configure Lily repository access

See Lily Maven repository access.

13.2.1.4 Run Maven release:prepare

Maven release:prepare performs the steps documented here, most importantly:

This does not yet deploy anything.

It is strongly recommended (read: official releases: obliged) to do this on a fresh git checkout to avoid non-clean situations:

rm -rf lilyproject
git clone git@github.com:NGDATA/lilyproject.git
cd lilyproject

Then first do a dry run of release:prepare:

mvn -Pfast release:prepare -DautoVersionSubmodules=true -DpreparationGoals="clean install" -DdryRun=true

As long as the effective mvn release:prepare has not been performed, you can back out with mvn release:clean

We do not need the preparationGoals parameter, I just took this from the Kauri build instructions, and since we use Kauri in Lily it it likely that we will need it eventually. Here is the original reason: Why we need the preparationGoals parameter: by default the release plugin only executes the 'verify' phase, not install, but Kauri requires the artifacts to be installed in the local repository for Kauri Runtime based test cases to run.

Maven will interactively ask for:

If this finished successfully, you can proceed for real:

mvn -Pfast release:prepare -DautoVersionSubmodules=true -DpreparationGoals="clean install"

If the above would fail with a build failure like "The svn tag command failed. ... File ... already exists." then do an "svn up" and run the above command again. Apparently this is a problem starting from subversion 1.5.1.

To deploy the artifacts to the repository:

export MAVEN_OPTS=-Xmx2048m
mvn release:perform -Dgoals=deploy -Pfast
cd ..
rm -rf lily-trunk

13.2.1.5 Building the distribution

See Outerthought-internal procedure (lily-packages repository).

13.2.1.6 Post-release work

13.2.1.6.1 Change versions in wiring.xml

Edit the following file:

cr/process/server/conf/kauri/wiring.xml

and change the version numbers to those of the current development release: x.y-SNAPSHOT.

+ reverse the changes done to the README.txt's earlier.

13.2.1.6.2 Deploy javadoc

Checkout the tagged sources (this is important so that the version in the pom would be correct, as this determines the directory in which the javadoc will be deployed)

git clone git@github.com:NGDATA/lilyproject.git lily-release
cd lily-release
git checkout RELEASE_X_Y_Z # NOTE: this will put you in 'detached head' state. Use git checkout -b release-x-y-z if you want to.
mvn site-deploy

Once I had the problem that this gave the error "ArtifactNotFoundException: The skin does not exist: Unable to determine the release version". This was solved by bringing the versions of maven-site-plugin and maven-project-info-reports-plugin in Lily's root pom.xml in sync with the versions listed on http://maven.apache.org/plugins/

Verify the result is ok by surfing to:

http://lilyproject.org/maven-site/X.Y(.Z)/

And then relink the 'current' javadoc:

ssh lilyproject.org
cd /var/www/lilyproject.org/maven-site
rm current
ln -s {current version} current
13.2.1.6.3 Make new doc site

We need to make a Daisy site for the documentation of the new release.

Check out from outerthought svn the directory projects/outerthought/ot_dpt/trunk

Make a directory for the new site based on the lily-docs-trunk

cd site/src/main/dsy-wiki/sites
cp lily-docs-trunk to lily-docs-{version}
cd lily-docs-{version}
find -name .svn -exec rm -rf {} \;

Have a look at siteconf.xml & skinconf.xml to change version dependent things.

The branch can stay at lilydocs-trunk until actual work on docs for next version start. This avoids having to make edits in two versions for changes that happen shortly after the release. But do not forget to branch it + change the branch configuration in lily-docs-{version} once necessary, see Branching the docs.

When done, commit to svn:

svn add lily-docs-{version}
svn commit -m "Adding docs site for new Lily release" lily-docs-{version}

Retarget the link lily-docs-current: (The version should be in the same style as the others, with underscore, e.g. 1_2)

(I don't know how to retarget links in subversion, the below is my quick hack)
svn delete lily-docs-current
svn commit -m "retargetting lily-docs-current link: remove existing link" lily-docs-current
ln -s lily-docs-1_0 lily-docs-current
svn add lily-docs-current
svn commit -m "retargetting lily-docs-current link: link to new target" lily-docs-current

Log in on lilyproject.org

ssh lilyproject.org
sudo su - daisy
cd ot_dpt/site
svn up
mvn daisy:init-wiki

Verify the new site works (can take up to 10 seconds for Daisy to refresh the site information):

http://docs.outerthought.org/lily-docs-{version}

Check that the current now shows the documentation of this new release:

http://docs.outerthought.org/lily-docs-current/

With the 0.2 release, it seemed like Daisy was not able to detect that lily-docs-current was changed, even if surely the timestamp of the siteconf.xml was changed. This was solved by restarting Daisy: /etc/init.d/ot-sites-wiki restart

13.2.1.6.4 Other things

13.2.2 Publishing The Lily Maven Site (javadocs)

The Maven-generated site (containing the javadoc) is available on http://lilyproject.org/maven-site.

Setup the Lily Maven repository access if not already done.

Then execute the following command in the root of the source tree:

MAVEN_OPTS="-Xmx2500m" mvn site-deploy

The memory increase is because it seems to make site-deploy run much faster (ymmv).

13.2.3 Branching the docs

The release instructions cover how to set up a documentation site for the release, but assume the docs are not yet branched immediately. Here we describe how to branch the docs.

Branch the docs

Log in to Daisy, switch to Administrator role and go to Administration screen.

Create a branch called lilydocs-M_m (where M_m is the version number, e.g. 1_4)

Choose Tools, Document Tasks and create a new document task

select documents using a query:

select name where collections='lilydocs' and branch='lilydocs-trunk'

Move to next screen

Enter as description "Branching lilydocs for M.m"

Choose as Type of task for Simple actions

Choose as task 'Create variant', choose branch lilydocs-M_m and as language en

Start the task, verify it finished successfully.

Update the site definition

Check out from outerthought svn the directory projects/outerthought/ot_dpt/trunk

Edit the siteconf file for the release:

cd site/src/main/dsy-wiki/sites
vi lily-docs-M_m/siteconf.xml

Change content of branch tag from lilydocs-trunk to lilydocs-M_m

Now apply the update to the live site:

ssh lilyproject.org
sudo su - daisy
cd ot_dpt/site
svn up

Go to the site and check that the correct branch is used (by looking at the Variants menu or using the info icon).

Edit the homepage of the trunk site to update references of the old version number to 'nextversion-dev' (or 'trunk' if the next version number would be unknown).

Modify variables

Go to http://docs.outerthought.org/lily-docs-trunk/variables

Choose the Edit link next to "Lily Documentation Variables". Edit the content of this document appropriately.

13.2.4 Pre-Release Verifications

The goal of this section is to collect things that are useful to verify before doing a release. Some of these could be automated, though it is often useful to do some manual observations too.

So, the things to check:

The real basics.
Are there no HBase/ZooKeeper connection leaks?
Can LilyClient survive restarts of the lily-server process?
Can the lily-server be stopped within a reasonable time?

Stopping lily-server (e.g. using ctrl+c if started in a console) should finish within a reasonable amount of time, and not take many dozens of seconds or minutes or hang forever.

Verify this also while a client process is continuously doing create/update operations, and with an index defined (to be sure the rowlog and indexer processes are interrupted correctly).

Are there no important memory leaks and especially thread leaks when using resetLilyState a repeated number of times?

You need to run the whole lily stack with cr/standalone-launcher/target/launch-test-lily.sh. (The resetLilyState operation only exists in that case).

There is a script in cr/standalone-launcher/resetLilyState_duration_test to help with this, observe thread counts and memory with jconsole, let it run for at least 200 resetLilyState iterations

Run integration tests

Make sure to also run the integration tests:

mvn -Pintegration

Batch index build on real clusters

Batch index build should be tested on a real cluster, not only in combination with launch-hadoop or launch-test-lily, since there can be classpath differences in the launched task VMs.

13.3 Guidelines

13.3.1 Code Style

13.3.1.1 Java Code style

The goal of the code style guidelines is that the code looks the same throughout the code base. This improves both readability and writeability (you don't have to decide how to write your code).

The style proposed here is one that is followed by many open source projects.

13.3.1.1.1 Whitespace
13.3.1.1.1.1 Indenting

Indenting is done using 4 spaces, not tabs. The tab character should not occur in source files.

Indenting should, obviously, increase as nesting increases. Each increment should be 4 spaces.

Bad:

         for (int i = 0; i < 10; i++) {
    System.out.println(i);
                 total += i;
         }
13.3.1.1.1.2 In-line indenting
13.3.1.1.1.3 Spacing

Expressions are written with whitespace between them, rather than sticking everything together.

Bad:

String foo=”bar”;
int y=3*5+(64/8);
void doSomething( String x,String y ){

Good:

String foo = “bar”;
int y = 3 * 5 + (64 / 8);
void doSomething(String x, String y) {

Casts are written without space after them:

Object object = "hello";
String hello = (String)object;
13.3.1.1.1.4 Trailing spaces

Configure your editor or IDE to drop trailing spaces.

13.3.1.1.1.5 Newlines

Between methods there should be one blank line. Between instance variables there should be no blank lines, except for grouping related variables.

13.3.1.1.2 Bracket placement

Opening brackets are not placed on a new line.

Good:

if (x < 3) {
    ...
} else {
    ...
}
13.3.1.1.3 Line length

The maximum line length should be (about) 120 characters. Code should not be written such that everything is chopped at 80 characters.

13.3.1.1.4 Names

Names are important, think about them.

13.3.1.1.4.1 Use camel-case

Follow the Java guidelines.

Static final variables should be all uppercase.

13.3.1.1.4.2 Use descriptive names

Thus use “image” rather than “im”, especially for method names & arguments.

13.3.1.1.4.3 Single-letter variables

Do not single-letter variable names except for loop indices and maybe in short, complex algorithms where long names would complicate reasoning about the code.

13.3.1.1.5 Comments

Besides documenting APIs, comments should certainly be used for anything unusual, so that people including yourself do not end up wondering a few months later why something was done in a particular way.

13.3.1.1.5.1 Write HTML-formatted Javadoc

When writing Javadoc comments, make sure they contain the necessary markup to be readable in the generated javadoc. Most importantly, start new paragraphs with a <p>. In HTML, it is not necessary to add the closing </p>. The first sentence up to the first dot is used by Javadoc to show in overviews, so make sure it exists and is meaningful.

Example:

/**
 * Thing to do stuff.
 *
 * <p>Blah blah blah ...
 *
 * <p>Blah blah blah...
 */
13.3.1.1.5.2 Drop meaningless comments

Sometimes IDEs generate standard javadoc with @param declarations for all parameters. Source files are then sometimes full of empty comments just listing these parameters. This are extra lines to read, and are never maintained anyway as parameters are added and removed, so it is better to drop them altogether. In summary, only leave meaningful comments.

13.3.1.1.5.3 Do not use designer comments

Do not use things like:

// =============================================

// ~~~~~~~ begin methods ~~~~~~~~~~~

/****************************************************
*
*/
13.3.1.1.5.4 TODO and FIXME comments

TODO and FIXME comments can be used.

TODO comments can be useful markers during development, but we encourage to fix as many of them as possible before committing a change set, since otherwise many of these TODO's stay around a long time. Have the discipline to write code in production-style immediately.

13.3.1.1.6 For loops

The new-style for loops are preferred over the old-style.

13.3.1.2 Non-Java source files

13.3.1.2.1 XML

XML is indented with 2 spaces.

The XML declaration should always be present on the first line: <?xml version="1.0"?>

No spaces are used around the = sign of attributes.

Empty tags are written with the closing marker sticked to the tag name or last attribute:

Bad:

<foo />
<foo x="y" />

Good:

<foo/>
<foo x="y"/>

13.3.2 Programming Guidelines

13.3.2.1 InterruptedException

If you get an InterruptedException, after handling it if necessary, always throw it further. If you can't throw it further because you are implementing an interface which does not have it declared (such as Runnable), set the Thread.interrupted flag again.

By adding InterruptedException to the throws clause of a method, you are indicating to the caller that it is an operation which can be interrupted (typically because it is blocking/waiting, but could also be an interruptable loop).

See this article by Brian Goetz.

13.3.2.2 ZooKeeper

13.4 Lily Maven Repository Access

Here we explain what to set up to be able to deploy artifacts to the Lily Maven repository.

Maven settings

Configuring your Maven settings is important so that the permissions of the deployed files are correct, otherwise you'll have to fix them manually afterward (or most likely, you won't notice it, and the next person trying to deploy might have problems).

In the following file (create it if it does not exist):

~/.m2/settings.xml

make sure the following two server entries are included:

<settings>
  <servers>
    <server>
      <id>org.lilyproject.maven-deploy</id>
      <directoryPermissions>775</directoryPermissions>
      <filePermissions>664</filePermissions>
    </server>

    <server>
      <id>org.lilyproject.maven-snapshot</id>
      <directoryPermissions>775</directoryPermissions>
      <filePermissions>664</filePermissions>
    </server>

    <server>
      <id>org.lilyproject.website</id>
      <directoryPermissions>775</directoryPermissions>
      <filePermissions>664</filePermissions>
    </server>
  </servers>
</settings>
Passwordless login

To avoid entering your password many times during the deployment of the artifacts to the public repository, you should add your public key to the ~/.ssh/authorized_keys2 file on lilyproject.org. If you are unfamiliar with this, stop reading here and find out how to do this. It will take you less time than entering your password a gazillion times.

13.5 Incompatible changes (by commit)

Here we list incompatible changes that happen to Lily. This can be changes to data format, configuration, API, scripts, etc. The changes are listed by commit so that when using Lily trunk, you can check if any incompatible changes happened between now and the last time you fetched the sources.

Revision 5114 (October 11, 2011)

It is no longer allowed to use 'null' as namespace in a QName. An upgrade tool for existing repositories is available.

Revision 5096 (October 4, 2011)

The configuration for dynamic fields in the indexerconf changed to cope with the refactored value types.

Changes concerning the matching of fields:

Changes to the expression for producing the Solr field name:

Revision 5082 (October 3, 2011)

The syntax for declaring formatters in the indexerconf.xml changed, as well as how this configuration is interpreted. The Formatter interface changed as well.

These changes were done to cope with the change from primitive value type to the generified value types.

Since it was not possible to register custom formatters and since there was only one default formatter available, this change should not affect you.

Revision 5073 (September 26, 2011)

In the REST interface, the indexes request parameters mvIndex and hIndex for getting blobs are replaced by the 'indexes' parameter which is a comma separated list of Integers.

Revision 5071 (September 26, 2011)

Changed the JSON format for the lily-tester the same way as for the lily-importer and REST interface in revisions 5066 and 5067 (See below).

Revision 5066 and 5067 (September 22, 2011)

The JSON format for the lily-importer and REST interface has changed so to support the new value types: List, Path, Record and Link.

In a FieldType, the ValueType should be represented by just a string and no longer an object with the primitive, multivalue and hierarchical properties. 

The string for the ValueType represents the full name of the value type : valuetype = BLOB | BOOLEAN | DATETIME | DATE | DECIMAL | DOUBLE | INTEGER | LONG | STRING | URI | LIST<valuetype> | PATH<valuetype> | LINK[<rtNamespace$rtName>] |RECORD[<rtNamespace$rtName>]

13.6 Creating Snapshots Of 3d Party Projects

13.6.1 Building HBase Snapshot

Here we explain how to deploy a HBase SVN snapshot version to Lily's Maven repository. This is used in case we want to make use of a non-released HBase version in Lily.

13.6.1.1 Check out HBase

13.6.1.1.1 Existing checkout
13.6.1.1.2 No existing checkout

Fetch a copy of the source using:

svn export http://svn.apache.org/repos/asf/hbase/trunk hbase-trunk

When this finishes, a revision number will be printed, write it down.

13.6.1.2 Change HBase version number

HBase trunk has a version number of the style X.Y.Z-SNAPSHOT. As we want to know exactly what sources we are using, we will rename this to something that includes the SVN revision number.

It seems like a newer, unreleased version of the Maven release plugin has a special command for this (release:update-versions). But since we cannot use this yet, we revert to a simpler mechanism: find and sed. The following command relies on the fact that the HBase version number is unique, i.e. that no other dependency uses the same version number. Adjust the sections in bold to match the current HBase version and the SVN revision number determined earlier.

find -name pom.xml -exec sed -i 's/<hbase.version>0.89.0-SNAPSHOT<\/hbase.version>/<hbase.version>0.89.0-r917988<\/hbase.version>/g' {} \;

Do an 'svn diff' to verify that this made the correct changes.

13.6.1.3 Build

Execute:

mvn -DskipTests clean install

13.6.1.4 Test

At this point, before going on with the deploy, you will probably want to change the HBase (and Hadoop) version in Lily to try out if this HBase build works fine.

13.6.1.5 Deploy

First set up Lily Maven repository access if not already done.

Execute:

mvn deploy -DaltDeploymentRepository=org.lilyproject.maven-deploy::default::scp://lilyproject.org/var/www/lilyproject.org/maven/maven2/deploy

Note: For now, use Maven 2. Using Maven 3 here gives a "No connector available to access repository" error.

13.6.1.6 Make binary build available

For when someone wants to run Lily against installed HBase cluster of this same version, make available a binary distribution of HBase like this:

mvn -DskipTests=true package assembly:assembly

scp target/hbase-0.89.0-r{revision number}-bin.tar.gz youruser@lilyproject.org:/var/www/lilyproject.org/files/hbase

The matching Hadoop version should also be provided.

Currently (June 30, 2010) HBase trunk uses the Hadoop "branch-0.20-append" which can be obtained as follows:

svn co -r {revision found in hbase pom} http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/ hadoop-common

(please update the above if not correct anymore)

And then to build:

In theory:

ant tar

In practice, to succeed, I used:

ant -Djava5.home=/usr/lib/jdk1.5/ -Dforrest.home=/path/to/apache-forrest-0.8 tar

Note: forrest does need java5 (will give sitemap validation errors with java6), and it seems like building the docs cannot be skipped (even though it does seem to be intended to be skipped if forrest.home is not set).

At the end of the build, the path to the created hadoop tar file will be printed.

scp hadoop-0.20.3-append-r{revision}.tar.gz youruser@lilyproject.org:/var/www/lilyproject.org/files/hadoop

13.6.1.7 Revert version number changes

If you have a HBase checkout (rather than an export), revert the changed version numbers using:

svn revert -R .

13.6.2 Building Kauri Snapshot

These are the instructions to build a versioned Kauri release from subversion. This is for the case we want to use a non-released Kauri version in Lily.

13.6.2.1 Check out Kauri

13.6.2.1.1 Existing checkout
13.6.2.1.2 No existing checkout

Fetch a copy of the source using:

svn export https://dev.outerthought.org/svn/outerthought_kauri/trunk kauri-trunk

When this finishes, a revision number will be printed, write it down.

13.6.2.2 Change Kauri version number

Look in Kauri's pom.xml for the current version number.

Execute the following command (from within the kauri-trunk directory) to replace the version numbers. Adapt the sections in bold: the first one should be equal to Kauri's current development version number (as found in the pom.xml), the second should be the same but with the word 'SNAPSHOT' replaced with the Subversion revision number noted above.

find -name pom.xml -exec sed -i 's/<version>0.4-dev-SNAPSHOT<\/version>/<version>0.4-r1538<\/version>/g' {} \;

13.6.2.3 Deploy

Make sure the repository org.lilyproject.maven-deploy is configured in your ~/.m2/settings.xml, as described over here.

Execute:

mvn deploy -DaltDeploymentRepository=org.lilyproject.maven-deploy::default::scp://lilyproject.org/var/www/lilyproject.org/maven/maven2/deploy

13.6.2.4 Revert version number changes

If you have a Kauri checkout (rather than an export), revert the changed version numbers using:

svn revert -R .

13.6.3 Deploying SOLR war To Maven

SOLR does not publish its war in Maven (though see SOLR-1218), but for use in Lily's testcases it is convenient it is available in Maven. Here we describe how to publish the Solr war to Lily's Maven repository.

First, because Maven 3 does not have scp support by default, you need to create a dummy pom.xml file containing:

<project>
  <modelVersion>4.0.0</modelVersion>
  <groupId>dummy</groupId>
  <artifactId>dummy</artifactId>
  <version>1.0-SNAPSHOT</version>
  <build>
    <extensions>
      <extension>
        <groupId>org.apache.maven.wagon</groupId>
         <artifactId>wagon-ssh</artifactId>
         <version>2.0</version>
      </extension>
    </extensions>
  </build>
</project>

Then, it can be published into Lily's repository as follows:

mvn deploy:deploy-file \
  -Dfile=/path/to/apache-solr-1.4.1/dist/apache-solr-1.4.1.war \
  -Durl=scp://lilyproject.org/var/www/lilyproject.org/maven/maven2/deploy \
  -DgroupId=org.apache.solr \
  -DartifactId=solr-webapp \
  -Dversion=1.4.1 \
  -Dpackaging=war