Default Liferay clusters run using multicast communications that enable dynamic member discovery using EhCache peerDiscovery over multicastGroups . In fact, if you are running several Liferay clusters in a same network you should make sure they aren't sharing multicast groups ( configured in portal.properties under multicast.group.address and multicast.group.port keys ).
By contrast, cloud platforms don't allow multicast traffic between server instances. As a consequence, Liferay's default clusterlink setup doesn't work ( well, instances will work but cache and index changes will not be replicated across nodes ). There are no differences in that aspect between AWS, Azure, Google, etc.
Lots of articles in the Internet explain how to configure a Liferay cluster and probably one of the most detailed has become part of Liferay documentation. There are even articles explaining how to configure a Liferay cluster using unicast traffic . But I haven't found any article explaining how to configure a real autoscaling group with an unique cluster configuration among all nodes so new instances can be cloned and started automatically in any order.
As explained in those articles, Liferay needs several of their modules to be specifically configured to work in a clustered environment:
- Database: there are no special needs regarding database setup to enable a Liferay cluster apart of using the same database instance across all Liferay nodes. But if you are looking for an unbreakable Liferay you can explore database cluster options. There are also horizontal scaling options supported by Liferay such as Database Sharding. For our purpose, we have setup an Amazon RDS server with a Multi AZ deployment to reduce maintenance downtime.
- Document Library: must be shared across cluster instances. In a typically hosted cluster an NFS volume would be used, but Liferay provides the perfect fit for our Amazon deployment: it supports Amazon S3 buckets. Even if you already have a running Liferay instance and want to migrate data to S3 this article explains the process.
- Quartz: since version 6.1 of Liferay there is no need of an explicit cluster configuration. When Liferay detects that clusterlink is enabled it creates all database tables automatically
- Index: Lucene indexes must be updated across the cluster, usually enabling lucene.replicate.write=true when Cluster Link is enabled. A more refined and scalable option is using Solr as an external indexing server.
- Cluster Link: is Liferay's cluster communication mechanism, it's enabled when cluster.link.enabled=true and it shares EhCaches´s peer discovery mechanism based on JGroups to discover cluster members.
- EhCache: also uses JGroups to discover peers, which needs to be configured to work with unicast only channels.
All this topics have already been covered in linked articles, but there is a flaw in all of them: JGroups configuration. The configuration needs in TCPPING section a static list of initial members of the cluster. But JGroups supports some other implementations of the PING operation that doesn't need to now about this list of members. Two of them fit perfectly in our case:
Both implementations use a shared resource (an S3 bucket or a database table ) to allow cluster members to register and make them aware of the other members. As the database is the only non-clustered system in our platform and we want to save as much cpu time as possible we've chosen to use S3PING in a S3 bucket created ad hoc.
Having that in mind, by adding following lines (replacing the actual db server values and path to Liferay installation with environment specific values ) in every node portal-ext.properties ...
Having that in mind, by adding following lines (replacing the actual db server values and path to Liferay installation with environment specific values ) in every node portal-ext.properties ...
cluster.link.enabled=true cluster.link.autodetect.address=YOUR_DB_HOST:YOUR_DB_PORT cluster.link.channel.properties.control=PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/tcp.xml cluster.link.channel.properties.transport.0=PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/tcp.xml ehcache.bootstrap.cache.loader.factory=com.liferay.portal.cache.ehcache.JGroupsBootstrapCacheLoaderFactory ehcache.cache.event.listener.factory=net.sf.ehcache.distribution.jgroups.JGroupsCacheReplicatorFactory ehcache.cache.manager.peer.provider.factory=net.sf.ehcache.distribution.jgroups.JGroupsCacheManagerPeerProviderFactory ehcache.multi.vm.config.location.peerProviderProperties=/jgroups/tcp.xml ehcache.multi.vm.config.location=/ehcache/liferay-multi-vm-clustered.xml net.sf.ehcache.configurationResourceName.peerProviderProperties=/jgroups/tcp.xml net.sf.ehcache.configurationResourceName=/ehcache/hibernate-clustered.xml lucene.replicate.write=true index.search.writer.max.queue.size=9999999 dl.store.impl=com.liferay.portlet.documentlibrary.store.S3Store dl.store.s3.access.key=S3_ACCESS_KEY dl.store.s3.secret.key=S3_SECRET_KEY dl.store.s3.bucket.name=S3_DL_BUCKET_NAME
... and creating a file tcp.xml under PATH_TO_LIFERAY_WEBAPP/WEB-INF/classes/jgroups/ with this content ...
<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.1.xsd"> <TCP singleton_name="liferay" bind_port="7800" loopback="false" recv_buf_size="${tcp.recv_buf_size:5M}" send_buf_size="${tcp.send_buf_size:640K}" max_bundle_size="64K" max_bundle_timeout="30" enable_bundling="true" use_send_queues="true" sock_conn_timeout="300" timer_type="old" timer.min_threads="4" timer.max_threads="10" timer.keep_alive_time="3000" timer.queue_max_size="500" thread_pool.enabled="true" thread_pool.min_threads="1" thread_pool.max_threads="10" thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false" thread_pool.queue_max_size="100" thread_pool.rejection_policy="discard" oob_thread_pool.enabled="true" oob_thread_pool.min_threads="1" oob_thread_pool.max_threads="8" oob_thread_pool.keep_alive_time="5000" oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="100" oob_thread_pool.rejection_policy="discard"/> <S3_PING location="S3_JGROUPS_BUCKET_NAME" access_key="S3_ACCESS_KEY" secret_access_key="S3_SECRET_KEY" timeout="2000" num_initial_members="2"/> <MERGE2 min_interval="10000" max_interval="30000"/> <FD_SOCK/> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER /> <pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="true"/> <UNICAST /> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M"/> <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/> <UFC max_credits="2M" min_threshold="0.4"/> <MFC max_credits="2M" min_threshold="0.4"/> <FRAG2 frag_size="60K" /> <pbcast.STATE_TRANSFER/> </config>
The main drawback of this solution is that S3 traffic will cost around 4$ a month per cluster node (it will be reading the S3 every 3 seconds, which generates around 900K requests a month to the S3 ).
Luckily there is a third option: if the use of a Solr cluster based on SolrCloud is under consideration, we have developed an implementation of the JGroups PING operation over ZooKeeper, faster and more reliable than any other implementation, available in our github and ready to be downloaded from Maven Central Repository. Our next article will dig deeper into how it works.
Luckily there is a third option: if the use of a Solr cluster based on SolrCloud is under consideration, we have developed an implementation of the JGroups PING operation over ZooKeeper, faster and more reliable than any other implementation, available in our github and ready to be downloaded from Maven Central Repository. Our next article will dig deeper into how it works.
Hi,
ReplyDeleteGreat document; quick question, how do you handle LIferay EE licensing on an auto-scaling cluster? I was under the impression that EE licences were tied to IP addresses and hence couldn't be deployed in an auto-scaling group?
Thanks,
Matt
Hi Matt,
DeleteIt's a good point. Well actually we haven't had that problem because our cluster was a CE. I think licenses aren't linked to the IP address but some other per machine unique id . Anyway, if you are an EE holder my suggestion is asking Liferay support team for a solution. Probably you aren't their first client running into this problem, and if you are running an autoscaling cluster of EE nodes that means that they are charging you a lot of money for support.
Probably best solution would be some kind of floating license server. But I've never heard of one for Liferay EE.
We're the infrastructure partner, so don't have a direct relationship with Liferay, but the implementation partner doesn't seem hopeful. The reason for needing the auto-scaling is less to do with scale-out performance, but more about failure-management (using auto-scaling to replace failed nodes). Unfortunately the nodes come up with DHCP-assigned IP addresses, and so we're having to manage the Liferay servers manually with static IPs (which introduces its own set of problems within CloudFormation).
DeleteAnyway, thanks for the quick response - if I come up with an answer I'll post a follow-up!
Hi Corne, thanks for your reply.
ReplyDeleteWe've been a little bit disconnected of the Liferay world in the last year but, as far as I know, LCS is a monitoring and patching (EE only) platform, and doesn't provide a way of deploying floating licenses to Liferay EE.
I've been doing a little Google research and I can't see any new Liferay EE license model for cloud and autoscaling clusters - which is the problem highlighted by Matt J -
Is your comment based on your own experience deploying Liferay EE in this kind of dynamic cloud clusters?
Cheers!!