Clustering in Jackrabbit works as follows: content is shared between all cluster nodes. That means all Jackrabbit cluster nodes need access to the SAME persistent storage (persistence manager and data store).

The persistence manager must be clusterable (eg. central database that allows for concurrent access, see PersistenceManagerFAQ); any DataStore (file or DB) is clusterable by its very nature, as they store content by unique hash ids. However, each cluster node needs its own (private) FileSystem and Search index.

Every change made by one cluster node is reported in a journal, which can be either file based or written to some database.

Requirements

In order to use clustering, the following prerequisites must be met:

Each cluster node must have its own repository configuration
Every cluster node must be assigned a unique ID
A journal type must be chosen, either based on files or stored in a database
The persistence managers must store their data in the same, globally accessible location (see PersistenceManagerFAQ)
A DataStore must always be shared between nodes, if used

Unique Cluster Node ID

Every cluster node needs a unique ID. This ID can be either specified in the cluster configuration as id attribute or as value of the system property org.apache.jackrabbit.core.cluster.node_id. When copying repository configurations, do not forget to adapt the cluster node IDs if they are hardcoded. See below for some sample cluster configurations. A cluster id can be freely defined, the only requirement is that it has to be different on each cluster node.

Sync Delay

By default, cluster nodes read the journal and update their state every 5 seconds (5000 milliseconds). To use a different value, set the attribute syncDelay in the cluster configuration.

Journal Type

The cluster nodes store information identifying items they modified in a journal. This journal must again be globally available to all nodes in the cluster. This can be either a folder in the file system or a database running standalone.

File Journal

The file journal is configured through the following properties:

revision: location of the cluster node's revision file
directory: location of the journal folder

Database Journal

The database journal is configured through the following properties:

revision: location of the cluster node's revision file
driver: JDBC driver class name
url: JDBC URL
user: user name
password: password

Sample Cluster Configuration

This section contains some sample cluster configurations. First, using a file based journal implementation, where the journal files are created in a share exported by NFS:

<Cluster id="node1">
  <Journal class="org.apache.jackrabbit.core.journal.FileJournal">
    <param name="revision" value="${rep.home}/revision.log" />
    <param name="directory" value="/nfs/myserver/myjournal" />
  </Journal>
</Cluster>

In the next configuration, the journal is stored in an Oracle database, using a sync delay of 2 seconds (2000 milliseconds):

<Cluster id="node1" syncDelay="2000">
  <Journal class="org.apache.jackrabbit.core.journal.OracleDatabaseJournal">
    <param name="revision" value="${rep.home}/revision.log" />
    <param name="driver" value="oracle.jdbc.driver.OracleDriver" />
    <param name="url" value="jdbc:oracle:thin:@myhost:1521:mydb" />
    <param name="user" value="scott"/>
    <param name="password" value="tiger"/>
  </Journal>
</Cluster>

In the following configuration, the journal is stored in an PostgreSQL database, accessed via "JNDI" (See Also UsingJNDIDataSource):

<Cluster id="node1" syncDelay="2000">
  <Journal class="org.apache.jackrabbit.core.journal.DatabaseJournal">
    <param name="revision" value="${rep.home}/revision.log" />
    <param name="driver" value="javax.naming.InitialContext"/>
    <param name="url" value="java:jdbc/Workspace"/>
    <param name="schema" value="postgresql"/>
  </Journal>
</Cluster>

Note: the journal implementation classes have been refactored in Jackrabbit 1.3. In earlier versions, journal implementations resided in the package org.apache.jackrabbit.core.cluster.

Persistence Manager Configuration

For performance reasons, only information identifying the modified items is stored in the journal. This implies, that all cluster nodes must have access to the items' actual content. The persistence manager needs to be transactional, and need to support concurrent access from multiple processes. When using Jackrabbit, one option is to use a database persistence manager, and use a database that does support concurrent access. The file system based persistence managers in Jackrabbit are not transactional and don't support concurrent access; Apache Derby doesn't support concurrent access in the embedded mode. The following sample shows a workspace's persistence manager configuration using an Oracle database:

<PersistenceManager class="org.apache.jackrabbit.core.persistence.db.OraclePersistenceManager">
  <param name="url" value="jdbc:oracle:thin:@myhost:1521:mydb" />
  <param name="user" value="scott"/>
  <param name="password" value="tiger"/>
  <param name="schemaObjectPrefix" value="${wsp.name}_"/>
  <param name="externalBLOBs" value="false"/>
</PersistenceManager>

Since the file system BLOB store uses a repository local directory and is not transactional, one should set the parameter externalBLOBs to false in order to store BLOBs in the database as well.

last edited 2009-07-21 04:36:00 by KevinJansz

[linuxkernelnewbies] Clustering - Jackrabbit Wiki

Clustering