Trend Devs: jgroups

Default Liferay clusters run using multicast communications that enable dynamic member discovery using EhCache peerDiscovery over multicastGroups . In fact, if you are running several Liferay clusters in a same network you should make sure they aren't sharing multicast groups ( configured in portal.properties under multicast.group.address and multicast.group.port keys ).

By contrast, cloud platforms don't allow multicast traffic between server instances. As a consequence, Liferay's default clusterlink setup doesn't work ( well, instances will work but cache and index changes will not be replicated across nodes ). There are no differences in that aspect between AWS, Azure, Google, etc.

Lots of articles in the Internet explain how to configure a Liferay cluster and probably one of the most detailed has become part of Liferay documentation. There are even articles explaining how to configure a Liferay cluster using unicast traffic . But I haven't found any article explaining how to configure a real autoscaling group with an unique cluster configuration among all nodes so new instances can be cloned and started automatically in any order.

As explained in those articles, Liferay needs several of their modules to be specifically configured to work in a clustered environment:

Database: there are no special needs regarding database setup to enable a Liferay cluster apart of using the same database instance across all Liferay nodes. But if you are looking for an unbreakable Liferay you can explore database cluster options. There are also horizontal scaling options supported by Liferay such as Database Sharding. For our purpose, we have setup an Amazon RDS server with a Multi AZ deployment to reduce maintenance downtime.
Document Library: must be shared across cluster instances. In a typically hosted cluster an NFS volume would be used, but Liferay provides the perfect fit for our Amazon deployment: it supports Amazon S3 buckets. Even if you already have a running Liferay instance and want to migrate data to S3 this article explains the process.
Quartz: since version 6.1 of Liferay there is no need of an explicit cluster configuration. When Liferay detects that clusterlink is enabled it creates all database tables automatically
Index: Lucene indexes must be updated across the cluster, usually enabling lucene.replicate.write=true when Cluster Link is enabled. A more refined and scalable option is using Solr as an external indexing server.
Cluster Link: is Liferay's cluster communication mechanism, it's enabled when cluster.link.enabled=true and it shares EhCaches´s peer discovery mechanism based on JGroups to discover cluster members.
EhCache: also uses JGroups to discover peers, which needs to be configured to work with unicast only channels.

All this topics have already been covered in linked articles, but there is a flaw in all of them: JGroups configuration. The configuration needs in TCPPING section a static list of initial members of the cluster. But JGroups supports some other implementations of the PING operation that doesn't need to now about this list of members. Two of them fit perfectly in our case:

Both implementations use a shared resource (an S3 bucket or a database table ) to allow cluster members to register and make them aware of the other members. As the database is the only non-clustered system in our platform and we want to save as much cpu time as possible we've chosen to use S3PING in a S3 bucket created ad hoc.

Having that in mind, by adding following lines (replacing the actual db server values and path to Liferay installation with environment specific values ) in every node portal-ext.properties ...

cluster.link.enabled=true
cluster.link.autodetect.address=YOUR_DB_HOST:YOUR_DB_PORT
cluster.link.channel.properties.control=PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/tcp.xml
cluster.link.channel.properties.transport.0=PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/tcp.xml

ehcache.bootstrap.cache.loader.factory=com.liferay.portal.cache.ehcache.JGroupsBootstrapCacheLoaderFactory
ehcache.cache.event.listener.factory=net.sf.ehcache.distribution.jgroups.JGroupsCacheReplicatorFactory
ehcache.cache.manager.peer.provider.factory=net.sf.ehcache.distribution.jgroups.JGroupsCacheManagerPeerProviderFactory

ehcache.multi.vm.config.location.peerProviderProperties=/jgroups/tcp.xml
ehcache.multi.vm.config.location=/ehcache/liferay-multi-vm-clustered.xml

net.sf.ehcache.configurationResourceName.peerProviderProperties=/jgroups/tcp.xml
net.sf.ehcache.configurationResourceName=/ehcache/hibernate-clustered.xml

lucene.replicate.write=true
index.search.writer.max.queue.size=9999999

dl.store.impl=com.liferay.portlet.documentlibrary.store.S3Store
dl.store.s3.access.key=S3_ACCESS_KEY
dl.store.s3.secret.key=S3_SECRET_KEY
dl.store.s3.bucket.name=S3_DL_BUCKET_NAME

... and creating a file tcp.xml under PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/ with this content ...

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.1.xsd">

    <TCP singleton_name="liferay"
         bind_port="7800"
         loopback="false"
         recv_buf_size="${tcp.recv_buf_size:5M}"
         send_buf_size="${tcp.send_buf_size:640K}"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         enable_bundling="true"
         use_send_queues="true"
         sock_conn_timeout="300"

         timer_type="old"
         timer.min_threads="4"
         timer.max_threads="10"
         timer.keep_alive_time="3000"
         timer.queue_max_size="500"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>

    <S3_PING location="S3_JGROUPS_BUCKET_NAME" access_key="S3_ACCESS_KEY"
             secret_access_key="S3_SECRET_KEY" timeout="2000"
	     num_initial_members="2"/>


    <MERGE2  min_interval="10000"
             max_interval="30000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                   discard_delivered_msgs="true"/>
    <UNICAST />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <UFC max_credits="2M"
         min_threshold="0.4"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER/>

</config>

... Liferay should be able to run in an AWS EC2 cluster without any instance-specific configuration, making these instances autoscaling capable.

The main drawback of this solution is that S3 traffic will cost around 4$ a month per cluster node (it will be reading the S3 every 3 seconds, which generates around 900K requests a month to the S3 ).

Luckily there is a third option: if the use of a Solr cluster based on SolrCloud is under consideration, we have developed an implementation of the JGroups PING operation over ZooKeeper, faster and more reliable than any other implementation, available in our github and ready to be downloaded from Maven Central Repository. Our next article will dig deeper into how it works.

Finally, Liferay must be marked as <distributable/> in web.xml, application server can be configured to balance HttpSessions between nodes ( it's not necessary, actually ) and the Elastic Load Balancer should be configured with session affinity based on JSESSIONID cookie. By the way, this configuration should run in both Liferay 6.1 and 6.2 CE and EE.

Trend Devs

Sunday, 28 June 2015

AWS - Liferay Cluster configuration ( auto scaling capable )