czwartek, września 27, 2012

Duplicates with Tibco EMS XA in FT mode

2012-09-27 08:47:47.733 [MSG:252] sent to user='anonymous': connID=5 consID=7 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'
FT switch
2012-09-27 08:47:49.330 [MSG:252] sent to user='anonymous': connID=5 consID=3 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'
2012-09-27 08:47:49.377 [MSG:252] sent to user='anonymous': connID=5 consID=2 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'
2012-09-27 08:47:49.440 [MSG:252] sent to user='anonymous': connID=5 consID=7 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'
2012-09-27 08:47:49.455 [MSG:252] sent to user='anonymous': connID=5 consID=9 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'
2012-09-27 08:47:49.564 [MSG:252] sent to user='anonymous': connID=5 consID=8 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'
2012-09-27 08:47:49.814 [MSG:252] sent to user='anonymous': connID=5 consID=11 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'
2012-09-27 08:47:49.876 [MSG:252] sent to user='anonymous': connID=5 consID=2 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'
2012-09-27 08:47:50.126 [MSG:252] acknowledged by user='anonymous': connID=5 consID=2 msgID='ID:EMS-SERVER.15905063F50823:111' queue='test.bw1'



After FT switch the same message is delivered to different EMS sessions and transactions are commited. Redelivery is by design (Shared State Failover chapter):

'For queue receivers, any messages that have been sent to receivers, but have not
been acknowledged before the failover, may be sent to other receivers
immediately after the failover.
A receiver trying to acknowledge a message after a failover may receive the
javax.jms.IllegalStateException. This exception signifies that the attempted
acknowledgement is for a message that has already been sent to another queue
receiver. This exception only occurs in this scenario, or when the session or
connection have been closed.',


but not throwing an error during transaction commit is a bug in EMS server BW.
Atrribute exclusive stick to the queue can prevent this disaster (and performance), but pending transactions can still affect overal message number consistency (workaround: all prepared transactions should be commited, rollback is harmful).

Update 28.09.2012: Tibco Support is not very helpful so I'm digging deeper.

I assume xa_end should be just after JMS Send, so from xa_end I will get XID, and then see what's going on with transaction (I correlate xa_end using conn+sess):

2012-09-27 08:47:49.892 [735917889 JobCourier2] [TIBCO EMS]: [J] xa start conn=6 sess=35 xid=< 131075, 29, 27, 494545535110110010248561005899514857585348545110254535458525457455351101100102485610058995148575853485451102545354585210257 > ({formatID=131075 gtrid_length=29 bqual_length=27 data=1--53edf08d:c309:5063f656:469-53edf08d:c309:5063f656:4f9}) flags=0
2012-09-27 08:47:49.892 [735917889 JobCourier2] [TIBCO EMS]: [J] QueueSender Send conn=6 sess=35 prod=8 dest=test.fw1 msgid=ID:EMS-SERVER.15905063F50823:159 dlvmode=2 pri=4 ttl=0
2012-09-27 08:47:49.892 [735917889 JobCourier2] [TIBCO EMS]: [J] xa end conn=6 sess=35 xid=< 131075, 29, 27, 494545535110110010248561005899514857585348545110254535458525457455351101100102485610058995148575853485451102545354585210257 > ({formatID=131075 gtrid_length=29 bqual_length=27 data=1--53edf08d:c309:5063f656:469-53edf08d:c309:5063f656:4f9}) flags=4000000
735917889 JobCourier2] [TIBCO EMS]: [J] xa commit conn=6 xid=< 131075, 29, 27, 494545535110110010248561005899514857585348545110254535458525457455351101100102485610058995148575853485451102545354585210257 > ({formatID=131075 gtrid_length=29 bqual_length=27 data=1--53edf08d:c309:5063f656:469-53edf08d:c309:5063f656:4f9}) onephase=false
2012-09-27 08:47:50.126

Wow. XID doesn't enclose JMS Receive. So we've got transaction only for sending and it is commited successfully. That's the reason we've got duplicates.


Why JMS Receive is not enclosed in transaction? Because it was enlisted once in transaction which wasn't commited successfully and its not visible to other transactions (and therefore observed transactions only for sending). Message is acked after being redelivered to session which it originated from. So this is BW bug.

Update 01.10.2012: There is a bug in com.tibco.plugin.share.jms.impl.JMSReceiver.SessionController - flag requesting new transaction is not cleared in error handling.

środa, września 26, 2012

BW XA examined


To the queue test.fw1 we sent 128 messages with sequence numer as a body:
2012 Sep 26 10:11:55:974 GMT +2 BW.xa-test User [BW-User] - Job-10000 [testcase/trigger.process/Group/Log]: SENT ID:EMS-SERVER.CB75062B89D4:1 SEQ 1
2012 Sep 26 10:11:55:990 GMT +2 BW.xa-test User [BW-User] - Job-10000 [testcase/trigger.process/Group/Log]: SENT ID:EMS-SERVER.CB75062B89D4:2 SEQ 2
...

Then we've been moving them back nad forth, with rewriting preceding JMSMessageID (PREV) as Body of new message, using queues test.bw1+test.fw1:
2012 Sep 26 10:15:37:884 GMT +2 BW.xa-test User [BW-User] - Job-11023 [testcase/BW.process/Log]: FROM ID:EMS-SERVER.CB75062B89D26:135 TO ID:EMS-SERVER.CB75062B89D25:147 PREV 5
2012 Sep 26 10:15:37:884 GMT +2 BW.xa-test User [BW-User] - Job-11017 [testcase/FW.process/Log]: FROM ID:EMS-SERVER.CB75062B89D4:18 TO ID:EMS-SERVER.CB75062B89D24:146 PREV 18

During this moving active EMS server was killed.

user@WAR-LAP-510 /cygdrive/c/Users/user/Desktop
$ cat log_recv.txt | awk '{print $16}' | sort | wc -l
143

user@WAR-LAP-510 /cygdrive/c/Users/user/Desktop
$ cat log_recv.txt | awk '{print $16}' | sort | grep -v NaN | wc -l
120

why JMS body is NaN?

user@WAR-LAP-510 /cygdrive/c/Users/user/Desktop
$ cat log_recv.txt | awk '{print $16}' | sort -u | grep -v NaN | wc -l
111

duplicate ID:EMS-SERVER.CB75062B89D24:416
duplicate ID:EMS-SERVER.CB75062B89D25:406
duplicate ID:EMS-SERVER.CB75062B89D25:414

Let's see ID:EMS-SERVER.CB75062B89D24:416:

2012 Sep 26 10:15:44:354 GMT +2 BW.xa-test User [BW-User] BW-XATM-100005 Job-11316 [testcase/BW.process/Log-1]: ERR FROM ID:EMS-SERVER.CB75062B89D24:416 TO ID:EMS-SERVER.CB75062B89D26:443 PREV NaN
2012 Sep 26 10:15:44:775 GMT +2 BW.xa-test User [BW-User] - Job-11343 [testcase/BW.process/Log]: FROM ID:EMS-SERVER.CB75062B89D24:416 TO ID:EMS-SERVER.CB75062B89D23:468 PREV NaN
2012 Sep 26 10:15:45:009 GMT +2 BW.xa-test User [BW-User] - Job-11354 [testcase/BW.process/Log]: FROM ID:EMS-SERVER.CB75062B89D24:416 TO ID:EMS-SERVER.CB75062B89D20:484 PREV NaN
2012 Sep 26 10:19:28:633 GMT +2 BW.xa-test User [BW-User] - Job-12085 [testcase/cleanQueuesByConsuming/consumeFW.process/Log]: CURRENT ID:EMS-SERVER.CB75062B89D23:468 PREV ID:EMS-SERVER.CB75062B89D24:416
2012 Sep 26 10:19:28:773 GMT +2 BW.xa-test User [BW-User] - Job-12091 [testcase/cleanQueuesByConsuming/consumeFW.process/Log]: CURRENT ID:EMS-SERVER.CB75062B89D20:484 PREV ID:EMS-SERVER.CB75062B89D24:416

The same message ID:EMS-SERVER.CB75062B89D24:416 is delivered 3 times, twice in sucessful process!
Timestamps suggest that duplication is due to failed transaction.

ERR FROM ID:EMS-SERVER.CB75062B89D24:416 TO ID:EMS-SERVER.CB75062B89D26:443 PREV NaN

NaN in body may indicate that message was received as damaged due to s11601.dc server kill, then new was sent to s11660.dc2.
Next we see try to commit transaction using s11601.dc which is dead - not available XAER_NOTA:
2012 wrz 26 10:15:44:354 CEST BW.xa-test Warn [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackxaerror] [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackxaerror] XAResourceRecord.rollback - xa error XAException.XAER_NOTA

Handling of XA is not aware about failover of active-passive of EMS connection, and as a result we see that messages are lost or duplicated.

https://access.redhat.com/knowledge/docs/en-US/JBoss_Enterprise_Web_Platform/5/html/Transactions_JTA_Development_Guide/chap-Transactions_JTA_Programmers_Guide-Transaction_Recovery.html

"Because XAResource objects are not persistent across system failures, the Transaction Manager needs the ability to acquire the XAResource objects that represent the resource managers which might have participated in the transactions prior to a system failure. For example, a Transaction Manager might use the JNDI look-up mechanism to acquire a connection from each of the transactional resource factories, and then obtain the corresponding XAResource object for each connection. The Transaction Manager then invokes the XAResource.recover method to ask each resource manager to return the transactions that are currently in a prepared or heuristically completed state."

"When running XA recovery, you must tell JBoss Transaction Service which types of Xid it can recover. Each Xid that JBoss Transaction Service creates has a unique node identifier encoded within it, and JBoss Transaction Service only recovers transactions and states that match the requested node identifier. The node identifier to use should be provided to JBoss Transaction Service in a property that starts with the name com.arjuna.ats.jta.xaRecoveryNode. Multiple values are allowed. A value of * forces recovery, and possibly rollback, of all transactions, regardless of their node identifier. Use it with caution. "

Xid contains JMS connection ID, however Xid created on first EMS server can be accepted by second EMS server because information about connections is persisted to disk. During recovery phase Xid list is not rebuilt properly on second server therefore lack of transactions transition is observed.

czwartek, września 20, 2012

JBoss 7 UserTransaction via JNDI

public class JBossUserTransactionContextFactory implements InitialContextFactory, Context {
private Properties p = new Properties();
private String host = "localhost";

public Object lookup(Name name) throws NamingException {
return lookup(name!=null ? name.toString() : null);
}


public Object lookup(String name) throws NamingException {
System.setProperty("jboss.node.name", host);
if (name!=null && name.toString().contains("UserTransaction"))
return EJBClient.getUserTransaction(host);
else
throw new NamingException("not found: "+name);
}

private static String getVal(Hashtable h, String key) {
return h.get(key)!=null ? h.get(key).toString() : null;
}
private boolean isNumber(String s) {
try {
Integer.valueOf(s);
return true;
}
catch (Exception e) {}
return false;
}

public Context getInitialContext(Hashtable h)
throws NamingException {
p.put("endpoint.name", "client-endpoint");
p.put("remote.connectionprovider.create.options.org.xnio.Options.SSL_ENABLED", "false");
p.put("remote.connections", "default");
p.put("remote.connection.default.connect.options.org.xnio.Options.SASL_POLICY_NOANONYMOUS", "false");
String url = getVal(h, PROVIDER_URL);
String user = getVal(h, SECURITY_PRINCIPAL);
String pass = getVal(h, SECURITY_CREDENTIALS);
if (url==null)
url = "remote://localhost:4447";
String[] tokens = url.split(":");
String port = "4447";
if (tokens.length==3) {
host = tokens[1];
port = tokens[2];
}
else if (tokens.length==2) {
if (isNumber(tokens[1])) {
host = tokens[0];
port = tokens[1];
}
else {
host = tokens[1];
}
}
else if (tokens.length==1)
host = tokens[0];
while (host.charAt(0) == '/')
host = host.substring(1);

p.put("remote.connection.default.port", port);
p.put("remote.connection.default.host", host);
p.put("remote.connection.default.username", user);
p.put("remote.connection.default.password", pass);

EJBClientConfiguration cc = new PropertiesBasedEJBClientConfiguration(p);
ContextSelector selector = new ConfigBasedEJBClientContextSelector(cc);
EJBClientContext.setSelector(selector);
return this;
} // ....
}

środa, września 19, 2012

Transakcje rozproszone w Tibco

JNDI Context Factory = org.jboss.naming.NamingContextFactory. Zasoby mogą być używane tylko poprzez JDNI zewnętrznego serwera aplikacyjnego.

Jak odpalić CERBToken Java Micro Edition w Windows z GUI

Microemulator: File -> Open MIDlet File -> wybieramy jara.

niedziela, września 09, 2012

DRBD w Oracle Linux 6.3 z OEK 2.6.39

Sytuacja jest o tyle inna w stosunku do Red Hata, że DRBD 8.3.11 jest w jądrze Oracle Enterprise Kernel, potrzebne są tylko narzędzia userspace. Konfiguracja zasobu wygląda tak:

resource eai-store-01 {
    protocol C;
    startup {
become-primary-on both;
    }
    syncer {
verify-alg sha1;
csums-alg md5;
      rate 100M;
    }
    net {
allow-two-primaries;
data-integrity-alg sha1;

connect-int 20;
ping-int 15;
timeout 10;
cram-hmac-alg sha1;
shared-secret 7812352gsd7t327r56;

after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
    }
    disk {
on-io-error pass_on;
fencing dont-care;
    }
    on eai1 {
   device minor 1;
   disk /dev/sdb1;
   meta-disk internal;
address 192.168.100.71:7789;
    }
    on eai2 {
   device minor 1;
   disk /dev/sdb1;
   meta-disk internal;
address 192.168.100.72:7789;
    }
}

czwartek, września 06, 2012

Klaster w RHEL 6

ccs_tool create eai-cluster-01 -2
cat /etc/cluster/cluster.conf


<?xml version="1.0"?>
<cluster name="eai-cluster-01" config_version="1">
  <dlm enable-fencing="0"/>

  <clusternodes>
    <clusternode name="eai1" votes="1" nodeid="1">
      <fence>
        <!--method name="single">
        </method-->
      </fence>
    </clusternode>
    <clusternode name="eai2" votes="1" nodeid="2">
      <fence>
        <!--method name="single">
        </method-->
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
  </fencedevices>

  <rm>
    <resources/>
    <service autostart="1" name="store">    
        <fs device="/dev/drbd1"
        mountpoint="/mnt/eai-store-01" fstype="gfs2"
        name="drbd-eai-store-01 " options="noatime"/>   
    </service>
    <failoverdomains/>    
  </rm>
</cluster>

cat /etc/corosync/corosync.conf

compatibility: whitetank

totem {
    version: 2
    secauth: off
    threads: 0
    interface {
        ringnumber: 0
        bindnetaddr: 192.168.7.1
        mcastaddr: 226.94.1.1
        mcastport: 5405
        ttl: 1
    }
}

logging {
    fileline: off
    to_stderr: no
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: on
    timestamp: on
    logger_subsys {
        subsys: AMF
        debug: off
    }
}

amf {
    mode: disabled
}

cat /etc/drbd.d/eai-store-01.res

resource eai-store-01 {
    startup {
      become-primary-on both;

      wfc-timeout 35;
      degr-wfc-timeout 35; 
      outdated-wfc-timeout 35;     
    }
    net {
      protocol C;
      allow-two-primaries yes;
      verify-alg sha1;

#    data-integrity-alg sha1;
      csums-alg md5;
   
      connect-int 20;
      ping-int 15;
      timeout 10;
      cram-hmac-alg sha1;
      shared-secret b21372y3egsg613;
   
      after-sb-0pri discard-zero-changes;
      after-sb-1pri violently-as0p;
      after-sb-2pri
violently-as0p;
    }
    disk {
     on-io-error pass_on;
     fencing dont-care;

     disk-timeout 5000;
    }
    on eai1 {
     volume 0 {
        device minor 1;
        disk /dev/sdb1;
        meta-disk internal;       
     }
     address 192.168.7.71:7789;
    }
    on eai2 {
     volume 0 {
        device minor 1;
        disk /dev/sdb1;
        meta-disk internal;       
     }
     address 1
92.168.7.72:7789;
    }


chkconfig iptables off
chkconfig cman on
/etc/init.d/cman start
modprobe drbd

mv /usr/etc/* /etc/ #drbd --prefix=/usr

drbdadm create-md eai-store-01
drdbadm up eai-store-01
mkfs.gfs2 -j3 -t eai-cluster-01:eai-store-01 /dev/drbd1
mount /dev/drbd1 /mnt/eai-store-01
chkconfig gfs2 on
chkconfig drbd on
chown -R nfsnobody:nfsnobody /mnt/eai-store-01/ems/config/datastore
 

poniedziałek, września 03, 2012

IPMP w RHEL 6

cat /etc/sysconfig/network-scripts/ifcfg-bond0

DEVICE=bond0
IPADDR=192.168.7.71
NETMASK=255.255.255.0
NETWORK=192.168.7.0
BROADCAST=192.168.7.255
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
GATEWAY=192.168.7.1
TYPE=Ethernet
BONDING_OPTS="mode=balance-alb 802.3ad broadcast miimon=100 use_carrier=0 debug=1 xmit_hash_policy=layer3+4"

ln -s /etc/sysconfig/network-scripts/ifup-eth /etc/sysconfig/network-scripts/ifup-bond0
ln -s /etc/sysconfig/network-scripts/ifdown-eth /etc/sysconfig/network-scripts/ifdown-bond0
echo "alias bond0 bonding" > /etc/modprobe.d/bonding.conf


cat /etc/sysconfig/network-scripts/ifcfg-ethX

DEVICE=ethX
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
TYPE=Ethernet
MASTER=bond0
SLAVE=yes
HWADDR=08:00:27:01:02:7X

chkconfig --del NetworkManager
chkconfig --add network
chkconfig network on

IPMP w Solarisie jest bardziej zaawansowane. Linuksową implementację wspierał IBM.