Sunday, 28 June 2015

AWS - Liferay Cluster configuration ( auto scaling capable )


Default Liferay clusters run using multicast communications that enable dynamic member discovery using EhCache peerDiscovery over multicastGroups .  In fact, if you are running several Liferay clusters in a same network you should make sure they aren't sharing multicast groups ( configured in portal.properties under  multicast.group.address and multicast.group.port  keys ).

By contrast, cloud platforms don't allow multicast traffic between server instances. As a consequence, Liferay's default clusterlink setup doesn't work ( well, instances will work but cache and index changes will not be replicated across nodes ). There are no differences in that aspect between AWS, Azure, Google, etc.

Lots of articles in the Internet explain how to configure a Liferay cluster and probably one of the  most detailed has become part of Liferay documentation. There are even articles explaining how to configure a Liferay cluster using unicast traffic .  But I haven't found any article explaining how to configure a real autoscaling group with an unique cluster configuration among all nodes so new instances can be cloned and started automatically in any order.

As explained in those articles, Liferay needs several of their modules to be specifically configured to work in a clustered environment:

  • Database: there are no special needs regarding database setup to enable a Liferay cluster apart of using the same database instance across all Liferay nodes. But if you are looking for an unbreakable Liferay you can explore database cluster options. There are also horizontal scaling options supported by Liferay such as Database Sharding. For our purpose, we have setup an Amazon RDS server with a Multi AZ deployment to reduce maintenance downtime. 
  • Document Library: must be shared across cluster instances. In a typically hosted cluster an NFS volume would be used, but Liferay provides the perfect fit for our Amazon deployment: it supports Amazon S3 buckets. Even if you already have a running Liferay instance and want to migrate data to S3 this article explains the process.
  • Quartz: since version 6.1 of Liferay there is no need of an explicit cluster configuration. When Liferay detects that clusterlink is enabled it creates all database tables automatically
  • Index: Lucene indexes must be updated across the cluster, usually enabling  lucene.replicate.write=true when Cluster Link is enabled. A more refined and scalable option is using Solr as an external indexing server.
  • Cluster Link:  is Liferay's  cluster communication mechanism, it's enabled when cluster.link.enabled=true and it shares EhCaches´s peer discovery mechanism based on JGroups to discover cluster members. 
  • EhCache: also uses JGroups to discover peers, which needs to be configured to work with unicast only channels.

All this topics have already been covered in linked articles, but there is a flaw in all of them:  JGroups configuration.  The configuration needs in TCPPING section a static list of initial members of the cluster. But JGroups supports some other implementations of the PING operation that doesn't need to now about this list of members. Two of them fit perfectly in our case:
Both implementations use a shared resource (an S3 bucket or a database table ) to allow cluster members to register and make them aware of the other members. As the database is the only non-clustered system in our platform and we want to save as much cpu time as possible we've chosen to use S3PING in a S3 bucket created ad hoc.

Having that in mind, by adding following lines (replacing the actual db server values and path to Liferay installation with environment specific values )  in every node portal-ext.properties ...

cluster.link.enabled=true
cluster.link.autodetect.address=YOUR_DB_HOST:YOUR_DB_PORT
cluster.link.channel.properties.control=PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/tcp.xml
cluster.link.channel.properties.transport.0=PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/tcp.xml

ehcache.bootstrap.cache.loader.factory=com.liferay.portal.cache.ehcache.JGroupsBootstrapCacheLoaderFactory
ehcache.cache.event.listener.factory=net.sf.ehcache.distribution.jgroups.JGroupsCacheReplicatorFactory
ehcache.cache.manager.peer.provider.factory=net.sf.ehcache.distribution.jgroups.JGroupsCacheManagerPeerProviderFactory

ehcache.multi.vm.config.location.peerProviderProperties=/jgroups/tcp.xml
ehcache.multi.vm.config.location=/ehcache/liferay-multi-vm-clustered.xml

net.sf.ehcache.configurationResourceName.peerProviderProperties=/jgroups/tcp.xml
net.sf.ehcache.configurationResourceName=/ehcache/hibernate-clustered.xml

lucene.replicate.write=true
index.search.writer.max.queue.size=9999999

dl.store.impl=com.liferay.portlet.documentlibrary.store.S3Store
dl.store.s3.access.key=S3_ACCESS_KEY
dl.store.s3.secret.key=S3_SECRET_KEY
dl.store.s3.bucket.name=S3_DL_BUCKET_NAME


... and creating a file  tcp.xml  under   PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/ with this content ...

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.1.xsd">

    <TCP singleton_name="liferay"
         bind_port="7800"
         loopback="false"
         recv_buf_size="${tcp.recv_buf_size:5M}"
         send_buf_size="${tcp.send_buf_size:640K}"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         enable_bundling="true"
         use_send_queues="true"
         sock_conn_timeout="300"

         timer_type="old"
         timer.min_threads="4"
         timer.max_threads="10"
         timer.keep_alive_time="3000"
         timer.queue_max_size="500"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>

    <S3_PING location="S3_JGROUPS_BUCKET_NAME" access_key="S3_ACCESS_KEY"
             secret_access_key="S3_SECRET_KEY" timeout="2000"
	     num_initial_members="2"/>


    <MERGE2  min_interval="10000"
             max_interval="30000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 use_mcast_xmit="false"
                   discard_delivered_msgs="true"/>
    <UNICAST />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <UFC max_credits="2M"
         min_threshold="0.4"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <pbcast.STATE_TRANSFER/>

</config>


... Liferay should be able to run in an AWS EC2 cluster without any instance-specific configuration, making these instances  autoscaling capable.

The main drawback of this solution is that S3 traffic will cost around 4$ a month per cluster node (it will be reading the S3 every 3 seconds, which generates around 900K requests a month to the S3 ).

Luckily there is a third option:  if the use of a Solr cluster based on SolrCloud is under consideration, we have developed an implementation of the JGroups PING operation over ZooKeeper, faster and more reliable than any other implementation, available in our github and ready to be downloaded from Maven Central Repository. Our next article will dig deeper into how it works.

Finally, Liferay must be marked as <distributable/> in web.xml,  application server can be configured to balance HttpSessions between nodes ( it's not necessary, actually ) and the Elastic Load Balancer should be configured with session affinity based on JSESSIONID cookie. By the way, this configuration should run in both Liferay 6.1 and 6.2 CE and EE.


4 comments:

  1. Hi,

    Great document; quick question, how do you handle LIferay EE licensing on an auto-scaling cluster? I was under the impression that EE licences were tied to IP addresses and hence couldn't be deployed in an auto-scaling group?

    Thanks,
    Matt

    ReplyDelete
    Replies
    1. Hi Matt,

      It's a good point. Well actually we haven't had that problem because our cluster was a CE. I think licenses aren't linked to the IP address but some other per machine unique id . Anyway, if you are an EE holder my suggestion is asking Liferay support team for a solution. Probably you aren't their first client running into this problem, and if you are running an autoscaling cluster of EE nodes that means that they are charging you a lot of money for support.

      Probably best solution would be some kind of floating license server. But I've never heard of one for Liferay EE.

      Delete
    2. We're the infrastructure partner, so don't have a direct relationship with Liferay, but the implementation partner doesn't seem hopeful. The reason for needing the auto-scaling is less to do with scale-out performance, but more about failure-management (using auto-scaling to replace failed nodes). Unfortunately the nodes come up with DHCP-assigned IP addresses, and so we're having to manage the Liferay servers manually with static IPs (which introduces its own set of problems within CloudFormation).

      Anyway, thanks for the quick response - if I come up with an answer I'll post a follow-up!

      Delete
  2. Hi Corne, thanks for your reply.

    We've been a little bit disconnected of the Liferay world in the last year but, as far as I know, LCS is a monitoring and patching (EE only) platform, and doesn't provide a way of deploying floating licenses to Liferay EE.

    I've been doing a little Google research and I can't see any new Liferay EE license model for cloud and autoscaling clusters - which is the problem highlighted by Matt J -

    Is your comment based on your own experience deploying Liferay EE in this kind of dynamic cloud clusters?

    Cheers!!

    ReplyDelete