Friday, November 15, 2013

Oracle Please Stop Shipping Ask.com Toolbar

Lately I configured a JRE on some windows machine. It was not a developer machine and the goal was to run some Java, so I installed the JRE. Normally I download everything from oracle.com, but I was lazily this time and google presented java.com as the first result. I never used java.com. I know Java is from Oracle, but the paranoia routine in my brain has not yet white-listed this site as having something to do with Oracle and as such I refuse to associate it with them. But hey as I said, I was lazy and the site seemed legit, so I started downloading.

While going through the install wizard, I suddenly got this screen:



I was baffled to see this. The first thing that went through my mind was: f*** ! I must have downloaded a compromised installer! hAx0orz in my interwebs! So I immediately executed evasive maneuvers theta 3 for windows users: No time for checksums! Pulled out the power, ripped out ram and hdd and fried them instantly.

After I was calmed down I was curious in finding out which compromised installer I had downloaded. So I used a working machine to go out on the Internet.  It turned out I was in for an even bigger surprise. It's just part of the official installer! http://www.java.com/en/download/faq/ask_toolbar.xml

Even more, this is not even breaking news as it has been around there for quite some time (and I'm not the first one writing about this). It always escaped me since it's only in the JRE, and only in the one you download from java.com *sigh*. The JDK and the JRE exe's comming from oracle.com don't contain this crap.

This is bad. Really bad. Yeah, ok, you can disable it on install, true, but that won't cut it. The fact that it is only in JRE installs via java.com makes it even worse. People working professionally with Java (requiring a JDK install instead) could see through this if it was in that type of installer (but then it would probably don't generate profit either). However, systems using JRE's are end user systems (or end user systems managed by other professionals). These are the people working with the applications we write. If these people starts to associate Java with crapware then we have a problem.

It might well be that this was a decision taken by SUN and that Oracle now has to live with the contract, but then it's time they end it. This should stop as it's damaging the reputation of Java all together!

I kindly invite you in joining me signing this petition, at least if you also agree this should stop: https://www.change.org/petitions/oracle-corporation-stop-bundling-ask-toolbar-with-the-java-installer

Tuesday, October 29, 2013

Building a SOAP Webservices Proxy Module using Spring Webservices

Some time ago I wanted to see how easy it is to write a web services proxy (wsproxy) using Spring Web Services. So, I thought I would share the result on Github. Feel free to use it (Apache v2 license) or let it serve as basis for your own development. The remainder of the article will explain the idea, how I used Spring web services to build it and a short guide on how to use the current implementation.

A wsproxy in this context is a central soap aware access layer which relays messages between systems. This relaying works in two directions; for services hosted internally (inbound mode) but also for services hosted externally were we are the client (outbound mode). One can compare this with a traditional http forward/reverse proxy but instead of operating on application transport it goes one level higher in the stack and deals with application messages; in this case soap.

In outbound mode our internal services will (normally) use the wsproxy as a http forward proxy. It will then deal with delivering the received message to the actual target. For inbound mode the module will act as a reverse proxy accepting incoming messages from external clients and relay them to our internal services.

Before we continue, a few words about forward/reverse proxies:

A forward proxy is something the client explicitly configures in addition to specifying a target URL. The http stack of the client will send the message to the configured proxy and sent the host/port of the actual desired target host via http headers (Host header). The proxy then deals with forwarding the request to the desired target host/port and return the response to the client. Thus the proxy composes the host and port for the target URL out of the http Host header. For the path it will use the path from the request as received from the client. A reverse proxy behaves (from the point of the client) as the actual target server. The client is not aware of any proxies and does neither have to configure anything special. What the client uses as target URL is the URL of the reverse proxy. The reverse proxy will be able to intercept the message from the client and forward it to the actual target within the network. The reverse proxy will required extra configuration as for example the URL (host, port and possibly path) of the target service, as host/port cannot be deduced from the request or http headers.

When using this module, one can still implement an actual http reverse proxy on boundaries for TLS offloading or other transport purposes. For external inbound traffic (messages coming from external clients destined for internal service) the wsproxy module will simply be the second reverse proxy in line for incoming requests. For internal outbound traffic (messages from internal clients destined for external endpoints) the URL of the http reverse proxy will be configured as the target URL for that specific service on the wsproxy module.

One of the features of having such a soap aware wsproxy is centralizing concerns. The goal is that soap traffic passing through can be intercepted. This way we can implement things such as: audit logging, monitoring, access control, message security, ... doing so we create a centralized place to deal with these requirements instead of having to re-implement them per application.

I chose Spring web services because there is no complicated infrastructure or design to understand. It it also very extensible and offers re-use at the right level for our requirements.
However, at this time I should also point out that there are existing solutions out there which can do this as well and more. They often go under the name of xml security gateways and come as software package or fully equipped appliances. As always you should outweigh the benefits of these existing solutions over writing something yourself. Fact is that they don't come for free (to say the least) and you still need someone with the right skills for configuring and maintaining them. As we will see, our requirements are easy to fulfill with a bit of code (and the help of Spring Web Services) giving us all the control we need.

For the requirements I had an easy to extend design in mind with following out of the box requirements:

  • Outbound mode needs to be configuration free for standard usages. This means that when needing to access a new external services our proxy should relay messages without requiring extra configuration
  • Messages passing the gateway need to be logged. Without extra configuration the entire message should be logged by default. Optionally, we need to be able to configure more fine grained which parts needs to be logged for specific services
  • For (mostly external) outbound communication we should be able to configure a forward proxy or override the target with a pre-configured host (an existing reverse proxy or the actual endpoint)
  • The module must be able to forward messages over secure transport in case there is no external reverse proxy present for offloading outbound secure transport. For inbound secure transport we will let this be handled by the container where the module is running on; so this is out of scope for the module
  • Be able to apply and handle message integrity/confidentiality
The component design looks like this:

There are 3 main components. The endpoint (catch-all endpoint in the drawing), which will act as the receiver for the message being sent to the wsproxy. The forwarder, which will relay the message to the target. Finally, the interceptor chains are the hooks where we are able to intercept the messages being sent/received and do something with them.

These 3 components are offered by Spring Web Services; the endpoint is a org.springframework.ws.server.endpoint.annotation.Endpoint implementing org.springframework.ws.server.endpoint.MessageEndpoint to be able to receive the raw payload. The forwarder uses org.springframework.ws.client.core.WebServiceTemplate and the interceptor chains are org.springframework.ws.client.support.interceptor.ClientInterceptor and/or org.springframework.ws.server.EndpointInterceptor depending on which sides they need to function (more on that later). For message security we will use WSS4J, but this is just an implementation of an interceptor, thus not a new component.

It is important to realize that there are two interceptor chains. From the point of the wsproxy, we'll call the first one the "inbound chain". This is the one operating between the client and the wsproxy. The "outbound chain" is the one operating between the wsproxy and target endpoint. So, if we have an internal client accessing an external endpoint via our wsproxy, the inbound chain will be invoked when the message is received by the wsproxy. The outbound chain will be invoked from the moment the wsproxy relays the message to the target endpoint. Spring has two interfaces to distinguish on which "side" the interceptor operates (an interceptor can also implement both interfaces, making it able to function on both sides). The org.springframework.ws.server.EndpointInterceptor operates on the endpoint side, for the wsproxy this is inbound. The org.springframework.ws.client.support.interceptor.ClientInterceptor operates on the client side, so for the wsproxy this is outbound. Btw; we're using inbound and outbound rather then the original Spring naming (client/endpoint) to avoid confusion. As you've noticed by now, the wsproxy is also an endpoint and a client. However, when we refer to "client" we mean the actual service client and the "endpoint" is the actual target service.

The module itself will run on a standard JEE servlet container as your typical Spring application. For all inbound traffic the http (or https) connector from the container is used. For all outbound traffic the WebServiceTemplate is used configured with commons httpclient under the hood, which we'll be able to use for both http and https if required. Service identification is done the "doclit" style. This means we take the first element of the body including it's namespace. This is represented as a QName. This identification is important as we'll have configuration on a per service basis, such as the forward proxy, forwarding protocol, endpoint URL mapping, specific loggers etc.

Ok enough of this. let's take this baby for a spin! Import the project in your IDE of choice, make sure you import it as a Maven project as Maven will have to filter the file active_environment.properties (this is done automatically via the default profile). Then we will:

  • Setup a normal standalone soap based endpoint
  • Deploy wsproxy
  • Use a web services client to access the endpoint via the proxy module
For bootstrapping a simple endpoint there is a class SimpleEndpoint foreseen in the tests sources which uses the JDK internal JAX-WS and http server to bootstrap a webservice endpoint:
public class SimpleEndpoint {

 public static void main(String args[]) {
  Endpoint.publish("http://localhost:9999/simple", new SimpleWebServiceEndpoint());
 }

 @WebService
 public static class SimpleWebServiceEndpoint {
  public Date getCurrentDate(String randomParameter) {
   return new Date();
  }
 }
}
Just run this as a new Java application, it will remain running until the process is killed. Boot the wsproxy by deploying the project to your server of choice (I'll be using Tomcat7), there is no extra configuration required. As for the client we'll be using soap-ui (you can also use cURL if you like). In soap-ui we first have to create the project. We do this based on the WSDL exposed by our test service (accessible under http://localhost:9999/simple?WSDL). Next, we'll have to configure our wsproxy module as the http forward proxy in soap-ui:
The soap-ui projects are also availale in the project if you want. Don't forget to enable the proxy settings as explained above, they are not saved as part of the project.

Important: don't forget to disable the proxy settings again if you are starting with a new project. soap-ui will use the proxy settings for standard http traffic and not only for soap/http. For example; when creating a new project based on a WSDL URL, soap-ui will use the http proxy settings as well to retrieve the WSDL. Since the wsproxy module is not a pure http proxy (but a soap proxy instead) it will not allow non-soap traffic through.

The last thing we need to configure is the target URL in soap-ui. The wsproxy module is by default (on tomcat at least) deployed under the context root named after the filename. In our case this means the module is reachable under: http://localhost:8080/ws-proxy/
There are two options:

  • Deploy the module under the root of the application server (/) instead. In that case nothing needs to be changed to the target URL. The target URL will remain the same URL as you would be using without the proxy module
  • Use a context root of choice, but in that case you'll have to prefix the context root to the target URL
In our case we are in the second scenario, this means that we'll have to change the proposed target URL from "http://localhost:9999/simple" to "http://localhost:9999/ws-proxy/simple".

What happens is that soap-ui sends the request to the host/port specified in the proxy settings (it will thus not sent the request to localhost:9999 but to localhost:8080 instead). The path however is retained; the request is actually sent to localhost:8080 with path "ws-proxy/simple". With the module deployed under "ws-proxy" you can now see why this path prefix has to be there. If the path would start with "simple" we would get a 404. The remainder of the path is not important for the infrastructure, as the Spring dispatcher servlet (configuration can be found in WsProxyWebApplicationInitializer) is bound to "/*". So every subsequent path is treated by the servlet in every case.

To be able to forward the message to the actual target, the module will calculate the target URL:

  • First check if there is a pre-configured target URL for the given endpoint based upon the service identification (payload root element + namespace). This is configured in EndpointTargetUrlMapping as we will see later.
  • If nothing found, check if there is a http Host header present, use host:port as the target server. For the path, use the path as present in the request, but subtract the context root under which this module is deployed (if any)
The latter means that in our case the module is deployed under "ws-proxy" and the request path is "ws-proxy/simple", which will result in a target URL of "http://localhost:999/simple" When executing the request, we'll get this answer:


In the wsproxy log file we can see the intercepted request and response being logged:
51948 [http-bio-8080-exec-5] DEBUG be.error.wsproxy.interceptors.internalchain.LoggingInterceptor  - SID:{http://wsproxy.error.be/}getCurrentDate INBOUND SIDE Request:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:wsp="http://wsproxy.error.be/">
   <soapenv:Header/>
   <soapenv:Body>
      <wsp:getCurrentDate>
         <!--Optional:-->
         <arg0>?</arg0>
      </wsp:getCurrentDate>
   </soapenv:Body>
</soapenv:Envelope>

51949 [http-bio-8080-exec-5] DEBUG be.error.wsproxy.core.ForwardingClient  - Using information from Host header as hostname/port
51949 [http-bio-8080-exec-5] DEBUG be.error.wsproxy.core.ForwardingClient  - Got webservice forwarding request, sending to:http://localhost:9999/simple
51981 [http-bio-8080-exec-5] DEBUG be.error.wsproxy.core.ForwardingClient  - Using interceptors:[class be.error.wsproxy.interceptors.externalchain.HttpRequestHeaderTransfererInterceptor]
51981 [http-bio-8080-exec-5] DEBUG be.error.wsproxy.core.ForwardingClient$3  - Opening [org.springframework.ws.transport.http.HttpComponentsConnection@1dd5e19a] to [http://localhost:9999/simple]
51991 [http-bio-8080-exec-5] DEBUG be.error.wsproxy.core.ForwardingClient  - Forwarding (http://localhost:9999/simple) done.
51994 [http-bio-8080-exec-5] DEBUG be.error.wsproxy.interceptors.internalchain.LoggingInterceptor  - SID:{http://wsproxy.error.be/}getCurrentDate INBOUND SIDE Response:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
  <S:Body>
    <ns2:getCurrentDateResponse xmlns:ns2="http://wsproxy.error.be/">
      <return>2013-10-28T15:55:29.717+01:00</return>
    </ns2:getCurrentDateResponse>
  </S:Body>
</S:Envelope>
In the default setup default logging happens on the inbound side. The inbound interceptors are configure here:
@Configuration
public class InboundInterceptors {

 @Autowired
 private PayloadRootAnnotationMethodEndpointMapping catchAllEndpointMapping;
 @Autowired
 private MessageDispatcher messageDispatcher;

 @Configuration
 public static class FirstInlineInterceptors {
  @Bean
  public DelegatingSmartSoapEndpointInterceptor loggingInterceptor() {
   return new DelegatingSmartSoapEndpointInterceptor(new LoggingInterceptor());
  }
 }

 @Configuration
 public static class ServiceSpecificInterceptors {

 }

 @Configuration
 public static class LastInLineInterceptors {

 }
}
If you want to configure this logging interceptor also on the outbound side, you can add them in OutboundInterceptors. LoggingInterceptor both implements EndpointInterceptor and ClientInterceptor. To continue with our requirements, there is also an interceptor which is able to log fragments based on XPath expression. The LoggingXPathInterceptor is service specific and hence we'll be adding this one to the ServiceSpecificInterceptors. The difference is that service specific interceptors use PayloadRootSmartSoapEndpointInterceptor which we need to give namespace and payload root element to identify the service. The configured interceptor will only be invoked for that service. The first in line and last in line use DelegatingSmartSoapEndpointInterceptor which will be invoked for any request.
 @Configuration
 public static class ServiceSpepcificInterceptors {
  @Bean
  public PayloadRootSmartSoapEndpointInterceptor getCurrentDateLoggingInterecptor() {
   LoggingXPathInterceptor loggingXPathInterceptor = new LoggingXPathInterceptor();
   loggingXPathInterceptor.addRequestXPaths(new WebServiceMessageXPathExpressionMetaData(
     "//*[local-name()='arg0']", "requestParameter"));
   loggingXPathInterceptor.addResponseXPaths(new WebServiceMessageXPathExpressionMetaData(
     "//*[local-name()='return']", "responseParameter"));
   return new PayloadRootSmartSoapEndpointInterceptor(loggingXPathInterceptor, "http://wsproxy.error.be/",
     "getCurrentDate");
  }
 }
When we execute the request again in soap-ui, we can see that the request argument and response value are extracted logged to our log file:
DEBUG be.error.wsproxy.interceptors.internalchain.LoggingXPathInterceptor - SID:{http://wsproxy.error.be/}getCurrentDate XPATHID:requestParameter VALUE:?
DEBUG be.error.wsproxy.interceptors.internalchain.LoggingXPathInterceptor - SID:{http://wsproxy.error.be/}getCurrentDate XPATHID:responseParameter VALUE:2013-10-28T16:50:29.537+01:00
The WebServiceMessageXPathExpressionMetaData operates by default on the soap body (payload) and treats the given XPaths as mandatory (but non blocking). To see other options check the javadoc on WebServiceMessageXPathExpressionMetaData.

The properties that can be configured are located in the package be.error.wsproxy.configuration.properties. Following classes exist:

The default Spring profile "local", enabled via Maven the default Maven filter, will resolve them from the properties file in wsproxy_local_demo.properties. The configuration is always stored as a simple string allowing for easy externalization in for example a JNDI environment. The three first properties determine how messages will be forwarded, to start with EndpointProtocolMapping:

In the above scenario the target URL was automatically deduced from the Host parameters as our internal client was using the module as a forward proxy. Since the host parameters do not contain any notion about protocols, the wsproxy assumed http as the forwarding protocol by default. If don't have a reverse proxy which takes care of offloading TLS, you can ask the proxy module to forward over https instead. You can do this by setting the protocol mapping to https for a specific service:

endpoint.protocol.mapping={namespace}payloadRootElementLocalName=https,...
EndpointTargetUrlMapping allows to define the target URL directly. This is required in the scenario where an external client will be accessing our internal service. In that case the target URL can no longer be deduced; external clients will not use our module as forward proxy but the message will simply end up on our module as it would be the actual service. The module then needs to know to where it should forward the message to:
endpoint.target.url.mapping={namespace}payloadRootElementLocalName=http(s)://host:port/path,....
This can also be used to override the target URL all together. The forwarder will first look if there is an explicit URL defined for the given service, if so that one will be given precedence.

The ForwardProxy can be configured when the wsproxy module on its turn needs to communicate via a http forward proxy in order to reach the target. This is also configurable on a per service basis.

Remember that the forward proxy does not change anything on how the target URL is calculated. Instead of directly accessing the target URL, the message will be forward to the configurered proxy if the setting is used:
forward.proxy=={namespace}payloadRootElementLocalName=host:port,... 

The keystores point to the keystore configuration containing store location, store password, key alias and key password. They are used when we want to apply message security which we'll cover next.
keystores.location=${project.root.dir}/config/test-keystores
keystore=${keystores.location}/keystore.jks
keystore.password=changeme
key.alias=mykey
key.password=changeme
truststore=${keystores.location}/truststore.jks
truststore.password=changeme
To satisfy the last requirement (integrity/confidentiality) we'll be using WSS4J via the Spring Wss4jSecurityInterceptor. This interceptor needs to be configured on the outbound side in our example were we have an internal client accessing an external service. The steps we will perform:
  • Setup a secured standalone soap based endpoint
  • Configure the wsproxy with message security for the given service
  • Deploy wsproxy
  • Use a web services client to access the endpoint via the proxy module
For the secured endpoint SimpleSecuredEndpoint is foreseen using JAXWS and WSIT. The WSIT configuration can be found in META-INF/wsit-be.error.wsproxy.SimpleSecuredEndpoint$SimpleWebServiceEndpoint.xml enabling message integrity on our endpoint.
public class SimpleSecuredEndpoint {

 public static void main(String args[]) throws IOException {
  // Set WSIT_HOME manually, we're only using this for testing purposes. This way we can have a dynamic path based
  // on the project location in filesystem to resolve the keystores via the WSIT configuratin in META-INF
  System.setProperty("WSIT_HOME", new ClassPathResource("").getFile().getParent() + "/../config/test-keystores/");
  Endpoint.publish("http://localhost:9999/simple", new SimpleWebServiceEndpoint());
 }

 @WebService(serviceName = "SimpleEndpoint")
 @Addressing(enabled = false, required = false)
 public static class SimpleWebServiceEndpoint {
  public Date getCurrentDateSecured(String randomParameter) {
   return new Date();
  }
 }
}
Important: the JAXWS implementation shipped with the JDK does not contain WSIT. It is just the JAXWS RI. In order for this to work you'll have to download the latest Metro release yourself which bundles everything together. See Metro home page. When you've downloaded Metro, run SimpleSecuredEndpoint passing along the endorsed system property: -Djava.endorsed.dirs=/path_to_metro/lib. This will ensure the entire JAXWS implementation is used from the external libraries. When everything is running fine you'll be seeing a line: INFO: WSP5018: Loaded WSIT configuration from file: file:/home/koen/.....

The configuration of the WSS4J interceptor enabling message integrity in OutboundInterceptors:

@Bean
 public Map<QName, List<ClientInterceptor>> customClientInterceptors() throws Exception {
  Map<QName, List<ClientInterceptor>> mapping = new HashMap<>();

  List<ClientInterceptor> list = new ArrayList<>();
  list.add(getCurrentDateServiceSecurityInterceptor());
  list.add(new LoggingInterceptor());
  mapping.put(new QName("http://wsproxy.error.be/", "getCurrentDateSecured"), list);

  return mapping;
 }

 private Wss4jSecurityInterceptor getCurrentDateServiceSecurityInterceptor() throws Exception {
  Wss4jSecurityInterceptor interceptor = new Wss4jSecurityInterceptor();

  // Outgoing
  interceptor.setSecurementActions("Signature Timestamp");
  interceptor
    .setSecurementSignatureParts("{}{http://schemas.xmlsoap.org/soap/envelope/}Body;{}{http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd}Timestamp");
  interceptor.setSecurementSignatureKeyIdentifier("IssuerSerial");
  Pair<String, String> key = keystore.getKeyAliasPasswords().get(0);
  interceptor.setSecurementUsername(key.getLeft());
  interceptor.setSecurementPassword(key.getRight());
  interceptor.setSecurementSignatureAlgorithm("http://www.w3.org/2000/09/xmldsig#rsa-sha1");
  interceptor.setSecurementSignatureDigestAlgorithm("http://www.w3.org/2000/09/xmldsig#sha1");
  interceptor.setSecurementTimeToLive(700);
  interceptor.setValidationTimeToLive(700);
  interceptor.setSecurementSignatureCrypto(keystoreCrypto);

  // Incomming
  interceptor.setValidationActions("Timestamp Signature");
  interceptor.setValidationSignatureCrypto(truststoreCrypto);

  return interceptor;
 }

On line 6 and 7 we add the custom interceptor to the list of interceptors used by the ForwardingClient. We've also added the LoggingInterceptor on the outbound so we can see the secured messages going out and coming in. To test the message security configuration deploy the wsproxy and use soap-ui to fire the request. The soap-ui setup is no different then setup for the non-secured endpoint.

Important:There seems to be an issue with C14N. When the request is sent as normal, WSIT will complain that the calculated digest does not match with the one in the message. I'm going to investigate this further, but this appears to be a problem of WSIT and not WSS4J, since the same problem also happens when soap-ui is configured as a secured client and directly communicates with the endpoint rather then using the wsproxy module. To get around this and see the test working, remove the line feed between the soap Body start element and the payload root start element. Also remove the line feed between the soap Body end element and the payload root end element:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:wsp="http://wsproxy.error.be/">
   <soapenv:Header/>
   <soapenv:Body><wsp:getCurrentDateSecured>
         <!--Optional:-->
         <arg0>?</arg0>
 </wsp:getCurrentDateSecured></soapenv:Body>
</soapenv:Envelope>
The soap-ui projects are also availale in the project if you want. Don't forget to enable the proxy settings as explained before, they are not saved as part of the project.

The result:

<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:exc14n="http://www.w3.org/2001/10/xml-exc-c14n#" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <S:Header/>
   <S:Body wsu:Id="_5002">
      <ns2:getCurrentDateSecuredResponse xmlns:ns2="http://wsproxy.error.be/">
         <return>2013-10-29T14:26:25.789+01:00</return>
      </ns2:getCurrentDateSecuredResponse>
   </S:Body>
</S:Envelope>
Nothing spectacular as the wsproxy added message security when forwarding the request and removed it when returning the response. If we look at the wsproxy log files, we'll first see the request entering on the inboud side:
34   [http-bio-8080-exec-3] DEBUG be.error.wsproxy.interceptors.logging.LoggingInterceptor  - SID:{http://wsproxy.error.be/}getCurrentDate INBOUND SIDE Request:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:wsp="http://wsproxy.error.be/">
   <soapenv:Header/>
   <soapenv:Body>
    <wsp:getCurrentDateSecured>
         <!--Optional:-->
         <arg0>?</arg0>
 </wsp:getCurrentDateSecured>
  </soapenv:Body>
</soapenv:Envelope>
The request is secured and forward to the endpoint:
394  [http-bio-8080-exec-3] DEBUG be.error.wsproxy.interceptors.logging.LoggingInterceptor  - SID:{http://wsproxy.error.be/}getCurrentDate OUTBOUND SIDE Request:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:wsp="http://wsproxy.error.be/">
   <soapenv:Header>
    <wsse:Security xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" soapenv:mustUnderstand="1">
      <wsu:Timestamp wsu:Id="TS-518848887F924441AB13830540361321">
        <wsu:Created>2013-10-29T13:40:36.130Z</wsu:Created>
        <wsu:Expires>2013-10-29T13:45:36.130Z</wsu:Expires>
      </wsu:Timestamp>
      <ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Id="SIG-518848887F924441AB13830540361916">
        <ds:SignedInfo>
...
The secured response is received from the endpoint;
524  [http-bio-8080-exec-3] DEBUG be.error.wsproxy.interceptors.logging.LoggingInterceptor  - SID:{http://wsproxy.error.be/}getCurrentDate OUTBOUND SIDE Response:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:exc14n="http://www.w3.org/2001/10/xml-exc-c14n#" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <S:Header>
    <wsse:Security S:mustUnderstand="1">
      <wsu:Timestamp xmlns:ns15="http://www.w3.org/2003/05/soap-envelope" xmlns:ns16="http://docs.oasis-open.org/ws-sx/ws-secureconversation/200512" wsu:Id="_3">
        <wsu:Created>2013-10-29T13:40:36Z</wsu:Created>
        <wsu:Expires>2013-10-29T13:45:36Z</wsu:Expires>
      </wsu:Timestamp>
      <ds:Signature xmlns:ns15="http://www.w3.org/2003/05/soap-envelope" xmlns:ns16="http://docs.oasis-open.org/ws-sx/ws-secureconversation/200512" Id="_1">
        <ds:SignedInfo>
...
Security information is processed and validated. If ok, the security information is stripped and the response returned (to our client, soap-ui in this case):
567  [http-bio-8080-exec-3] DEBUG be.error.wsproxy.interceptors.logging.LoggingInterceptor  - SID:{http://wsproxy.error.be/}getCurrentDate INBOUND SIDE Response:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:exc14n="http://www.w3.org/2001/10/xml-exc-c14n#" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <S:Header/>
  <S:Body wsu:Id="_5002">
    <ns2:getCurrentDateSecuredResponse xmlns:ns2="http://wsproxy.error.be/">
      <return>2013-10-29T14:40:36.357+01:00</return>
    </ns2:getCurrentDateSecuredResponse>
  </S:Body>
</S:Envelope>
Until now all tests have been done simulating an internal client accessing an external service. In case you want to use the module to do the inverse; serving external clients accessing internally hosted services, it is more of the same. For each internally hosted service to which the module will forward you would have to register a target URL using the endpoint.target.url.mapping configuration parameter. Interceptors continue working the same way, but remember that for example for message security you probably want the Wss4jSecurityInterceptor to be configured on the inbound side, as in this scenario the inbound side is the external facing side. There is no issue in configuring the Wss4jSecurityInterceptor on both inbound and outbound side for different service; all configuration is based on a per service basis.

For example: service x (identified by namespace and payload root element) is an internal hosted service. Service y is an external service which internal clients want to access. For securing our internal service x we would be adding Wss4jSecurityInterceptor as service specific inbound interceptor in the InboundInterceptors configuration. This interceptor will thus only be active on the wsproxy endpoint (only serving the inbound side, in this example the external facing side) and only for service x. For securing the calls to service y, we would register Wss4jSecurityInterceptor in OutboundInterceptors, adding message security for messages being sent to the external service y by the wsproxy module.

Ok, that's about it! Feel free to drop me a message if this was somehow useful or if you have ideas for improvement!

Sunday, September 8, 2013

WS-Security: using BinarySecurityToken for authentication

As we all know, one goal set by WS-Security is to enforce integrity and/or confidentially on SOAP messages. In case of integrity, the signature which is added to the SOAP message is the result of a mathematical process involving the private key of the sender resulting in an encrypted message digest.

Most frameworks, such as WSS4J, will by default only sign the body. If you're adding extra headers, such as a Timestamp header, you'll have indicate explicitly to sign them. Using the Spring support for WSS4J for example, you can set a comma separated list containing the local element name and the corresponding namespace using the securementSignatureParts property.

Below an example how to instruct it to both sign the Body and Timestamp element (and their siblings). This will result in two digital signatures being appended to the message:


Eventually the SOAP message will be send together with the XML digital signature data and in most cases a BinarySecurityToken containing the certificate.

Nothing new so far. However, what struck me is that it seems not widely understood what the goal is of the BST neither how authentication is controlled using it. Let me try to shed some light on this:

The certificate of the sender which is send along with the SOAP message plays the role of identification. You can compare it as being the username++. It should be clear that the certificate inside the message cannot be trusted, neither can a username without verifying the password. So far everyone agrees on that: "yeah of course, certificates need to be validated in order to be trusted and then you're set!"

But that is not the entire story. Validation of the certificate is not the same as authentication. The fact that the certificate in the message is valid and is signed by a known CA is not enough to consider the sender authenticated.

For example: I, in my most malicious hour, could have intercepted the message, changed the content, created a new signature based on my private key and replaced the BST in the message with my certificate. My certificate could perfectly be an official CA signed certificate (even signed by the same CA as you're using) so it would pass the validation check. If the framework would simply validate the certificate inside the message we would have no security at all.

Note: If you're sending the message over secure transport instead, chances are that I was not able to intercept the message. But secure transport is mostly terminated before the actual endpoint, leaving a small piece of the transport "unsecured". Albeit this part will be mostly internally in your company, but what I want to point out is that no matter how secure your transport is, the endpoint has the end responsibility in verifying the identity of the sender. For example; in an asynchronous system the SOAP message could have been placed on a message queue to be processed later. When processing starts by the endpoint, the trace of the secure transport is long gone. You'll have to verify the identity using the information contained in the message.

In order to close this loophole we have two solutions:

The first solution builds further on what we already described: the certificate in the message is verified against the CA root certificates in the truststore. In this scenario it advised to first narrow the set of trusted CA's. You could for example agree with your clients on a limited list of CA's to get your certificates from. Doing so you are already lowered the risk of trusting more "gray zone" CA's which might not take the rules for handing out certificates so strict (like for example, proper checking the identity of their clients). Secondly, because *every* certificate handed out by your trusted CA will be considered "authenticated", we'll close the loophole by issuing some extra checks.

Using WSS4J you can configure a matching pattern based on the subject DN property of the certificate. They have a nice blog entry on this here: http://coheigea.blogspot.ie/2012/08/subject-dn-certificate-constraint.html.
We could specify that the DN of the certificate must match a given value like this:

Wss4jHandler handler = ... 
handler.setOption(WSHandlerConstants.SIG_SUBJECT_CERT_CONSTRAINTS, "CN = ...");
Note: that there is currently no setter for this using the Spring support for WSS4J in Wss4jSecurityInterceptor, so you'll have to extend it in order to enable this!

To conclude the steps being performed:

  1. The certificate contained in the message is validated against the trusted CA in your trustore. When this validation succeeds it tells the application that the certificate is still valid and has actually been handed out by a CA that you consider trusted.
    • This check gives us the guarantee that the certificate really belongs to the party that the certificate claimes to belong to.
  2. Optionally the certificate can also be checked on revocation so that we don't continue trusting certificates that are explictly been revoked.
  3. WSS4J will check if some attributes of the certificate match the required values for the specific service (Subject DN Certificate Constraint support).
    • This would be the authentication step; once the certficate has been found valid, we check if the owner of the certificate is the one we want to give access too
  4. Finally the signature in the message is verified by creating a new digest of the message, compare it with decrypted digest from the message and so forth
It should be noted that this check (at least when using WSS4J) is not done by default! If you don't specify it and simply add your CA's in the trust store you'll be leaving a security hole!

The second solution requires no extra configuration and depends on ONLY the certificate of the sender to be present in the truststore.
The certificate contained in the message is matched against the certificate in the truststore. If they match the sender is authenticated. There is no need to validate certificates against a CA since the certificates imported in the truststore are explicitly trusted (WSS4J will still check if the certificate is not expired and possibly check it for revocation). Again, there are no CA certificates (or CA intermediate certificates) in the truststore! Only the certificates of the senders that you want to give access too. Access is hereby controlled by adding (or removing) their certificate from the truststore.

This requires you to be cautious when initially importing the certificates since you'll have to make sure they actually represent the sender. But this is something you're always obliged to do when adding certificates to your truststore, also when adding CA certificates like in the first solution.

Conclusion: in the assumption you can limit the trusted CA's, the first solution is in most cases the preferred one and also the most scalable. For new clients there are no changes required to the truststore. The attributes to match can be stored externally so they are easy to change/add. Also, when the client certificates expires or gets revoked, you don't need to do anything special. The new certificate will we used by the sender at a given moment and will directly be validated against the CA in your truststore. In the second solution you would have to add the new certificate to the trustore and leave the old one in there for a while until the switch is performed.

Overall lessons learned: water tight security is hard. The #1 rule in IT (assumption is the mother of all f***ups) is certainly true here. Be skeptical and make sure you fully understand what is going on. Never trust default settings until you are sure what they do. The default setting on your house alarm (eg. 123456) is no good idea either. Neither is the default admin password on a Tomcat installation.

Friday, July 26, 2013

Splitting Large XML Files in Java

Last week I was asked to write something in Java that is able to split a single 30GB XML file into smaller parts of configurable file size. The consumer of the file is going to be a middle-ware application that has problems with the large size of the XML. Under the hood it uses some kind of DOM parsing technique that causes it to run out of memory after a while. Since it's vendor based middle-ware, we are not able to correct this ourselves. Our best option is to create some pre-processing tool that will first split the big file in multiple smaller chunks before they are processed by the middle-ware.

The XML file comes with a corresponding W3C schema, consisting of a mandatory header part followed by a content element which has several 0..* data elements nested. For the demo code I re-created the schema in simplified form:

The header is neglectable in size. A single data element repetition is also pretty small, lets say less less then 50kB. The XML is so big because of the number of repetitions of the data element. The requirements are that:
  • Each part of the splitted XML should be syntactical valid XML and each part should also validate against the original schema
  • The tool should validate the XML against the schema and report any validation errors. Validation must not be blocking and non validating elements or attributes must not be skipped in the output
  • For the header there is decided that rather then copying it in each of the new output files, the header will be re-generated for each new output file with some information of the processing and some defaults
So, using binary split tools such as Unix Split is out of the question. This will split after a fixed number of bytes leaving the XML corrupt for sure. I'm not really sure but tools such as Split don't know anything about encoding either. So splitting after byte 'x' could not only result in splitting in the middle of an XML element (for example), but even in the middle of a character encoding sequence (when using Unicode that is UTF8 encoded for example). It's clear we need something more intelligent.

XSLT as core technology is no go either. At first sight one could be tempted: using XSLT2.0 it is possible to create multiple output files from a single input file. It should even be possible to validate the input file while transforming. However, the devil is, as always, in the details. Otherwise simple operations in Java such as writing the validation error's to a separate file, or checking the size of the current output file would probably require custom java code. This is certainly possible with Xalan and Saxon to have such extensions, but Xalan is not a XSLT2.0 implementation so that only leaves us with Saxon. Last but not least, XSLT1.0/2.0 are non-streaming, meaning that they will read the entire source document into memory, so this clearly excludes XSLT from the possibilities.

This leaves us with Java XML parsing as the only option left. The ideal candidate is in this case of course StAX. I'm not going into the SAX ↔ StAX comparison here, fact is that StAX is able to validate against schema's (at least some parsers can) and can also write XML. Further more, the API's is a lot easier to use then SAX, because it pull based it gives more control on iterating the document and works more pleasant then the push way of SAX. Aight, what do we need:

  • StAX implementation capable of validating XML
    • Oracle's JDK ships by default with SJSXP as the StAX implementation, but this one is however non validating; so I ended up with using Woodstox. As far as I could find, validation with Woodstox is only possible using the StAX cursor API
  • Preferably have some Object/XML mapping technique for (re)creating the header instead of manually fiddling with elements and having to look up the correct datatypes/format
    • Clearly JAXB. It has support for StAX, so you can create your Object model and then let it directly write to a StAX outputstream
The code is a bit to large to show it here as a whole. Both the source files, XSD and test XML can be accessed here on GitHub. It has a Maven pom file so you should be able to import it in your IDE of choice. The JAXB binding compiler will automatically compile the schema and put the generated sources on the classpath.

 public void startSplitting() throws Exception {
  XMLStreamReader2 xmlStreamReader = ((XMLInputFactory2) XMLInputFactory.newInstance())
    .createXMLStreamReader(BigXmlTest.class.getResource("/BigXmlTest.xml"));
  PrintWriter validationResults = enableValidationHandling(xmlStreamReader);

  int fileNumber = 0;
  int dataRepetitions = 0;
  XMLStreamWriter xmlStreamWriter = openOutputFileAndWriteHeader(++fileNumber); // Prepare first file
The first line creates our StAX stream reader which means we are using the cursor API. The iterator API uses the XMLEventReader class. There is also a strange "2" in the classname which refers to the StAX 2 features from Woodstox, one of them is probably the support for validation. From here:
StAX2 is an experimental API that is intended to extend basic StAX specifications 
in a way that allows implementations to experiment with features before they 
end up in the actual StAX specification (if they do). As such, it is intended 
to be freely implementable by all StAX implementations same way as StAX, but 
without going through a formal JCP process. Currently Woodstox is the only 
known implementation.
"enableValidationHandling" can be seen in the source file if you want. I'll highlight the important pieces. First, load the XML schema:
  XMLValidationSchema xmlValidationSchema = xmlValidationSchemaFactory.createSchema(BigXmlTest.class
    .getResource("/BigXmlTest.xsd"));
Callback for writing possible validation results to the output file;
   public void reportProblem(XMLValidationProblem validationError) throws XMLValidationException {
    validationResults.write(validationError.getMessage()
      + "Location:"
      + ToStringBuilder.reflectionToString(validationError.getLocation(),
        ToStringStyle.SHORT_PREFIX_STYLE) + "\r\n");
   }
The "openOutputFileAndWriteHeader" will create a XMLStreamWriter (which is again part form the cursor API, the iterator API has XMLEventWriter) to which we can output or part of the original XML file. It will also use JAXB to create our header and let it write to the output. The JAXB objects are generated default by using the Schema compiler (xjc).
 private XMLStreamWriter openOutputFileAndWriteHeader(int fileNumber) throws Exception {
  XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
  xmlOutputFactory.setProperty(XMLOutputFactory.IS_REPAIRING_NAMESPACES, true);
  XMLStreamWriter writer = xmlOutputFactory.createXMLStreamWriter(new FileOutputStream(new File(System
    .getProperty("java.io.tmpdir"), "BigXmlTest." + fileNumber + ".xml")));
  writer.setDefaultNamespace(DOCUMENT_NS);
  writer.writeStartDocument();
  writer.writeStartElement(DOCUMENT_NS, BIGXMLTEST_ROOT_ELEMENT);
  writer.writeDefaultNamespace(DOCUMENT_NS);

  HeaderType header = objectFactory.createHeaderType();
  header.setSomeHeaderElement("Something something darkside");
  marshaller.marshal(new JAXBElement<HeaderType>(new QName(DOCUMENT_NS, HEADER_ELEMENT, ""), HeaderType.class,
    HeaderType.class, header), writer);

  writer.writeStartElement(CONTENT_ELEMENT);
  return writer;
 }
On line 3 we enable "repairing namespaces". The specification has this to say:
javax.xml.stream.isRepairingNamespaces:
Function: Creates default prefixes and associates them with Namespace URIs.
Type: Boolean
Default Value: False
Required: Yes
What I understand from this is that its required for handling default namespaces. Fact is that if it is not enabled the default namespace is not written in any way. On line 6 we set the default namespace. Setting it does not actually write it to the stream. Therefore one needs writeDefaultNamespace (line 9) but that can only be done after a start element has been written. So, you have to define the default namespace before writing any elements, but you need to write the default namespace after writing the first element. The rationale is that StAX needs to know if it has to generate a prefix for the root element you are going to write yes or no.

On line 8 we write the root element. Important to indicate to which namespace this element belongs. If you do not specify a prefix, a prefix will be generated for you, or, in our case no prefix will be generated at all because StAX knows we already set the default namespace. If you would remove the indication of the default namespace at line 6, the root element will be prefixed (with a random prefix) like: <wstxns1:BigXmlTest xmlns:wstxns1="http://www... Next we write our default namespace, this will be written to the element started previously (btw, for some deeper understanding on this order see this nice article). On line 11-14 we use our JAXB generated model to create the header and let our JAXB marshaller write it directly to our StAX output stream.

Important: the JAXB marshaller is initialized in fragment mode, otherwise it will start to add an XML declaration, as would be required for standalone documents, and that is of course not allowed in the middle of an existing document:

   marshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
On a side note: the JAXB integration is not really useful in this example, it creates more complexity and takes more lines of code then simply adding the elements using the XMLStreamWriter. However, in if you have a more complex structure which you need to create and merge into the document it is pretty handy to have automatic object mapping.

So, we have our reader which is enabled for validation. From the moment we start iterating over the source document it will validate and parse at the same time. Then we have our writer which already has an initialized document and header written and is ready to accept more data. Finally we have to iterate over the source and write each part to the output file. If the output file becomes to big we will switch it with a new one:

 while (xmlStreamReader.hasNext()) {
    xmlStreamReader.next();

    if (xmlStreamReader.getEventType() == XMLEvent.START_ELEMENT
      && xmlStreamReader.getLocalName().equals(DATA_ELEMENT)) {

      if (dataRepetitions != 0 && dataRepetitions % 2 == 0) { // %2 = just for testing: replace this by for example checking the actual size of the current output file
       xmlStreamWriter.close(); // Also closes any open Element(s) and the document
       xmlStreamWriter = openOutputFileAndWriteHeader(++fileNumber); // Continue with next file
       dataRepetitions = 0;
      }
      // Transform the input stream at current position to the output stream
 transformer.transform(new StAXSource(xmlStreamReader), new StAXResult(
 new FragmentXMLStreamWriterWrapper(new AvoidDefaultNsPrefixStreamWriterWrapper(xmlStreamWriter, DOCUMENT_NS))));
      dataRepetitions++;
    }
}
The important bits are that we keep iterating over the source document and check for the presence of the start of the Data element. If so we stream the corresponding element and its siblings to the output. In our simple example we don't have siblings, just a text value. But if the structure is more complex all underlying nodes will automatically be copied to the output. Every two Data elements we will cycle our output file. The writer is closed and a new one is initialized (this check can of course be replaced by checking file size instead of % 2). If the writer is closed it will automatically take care of closing open elements and finally closing the document itself, no need to do this yourself. As what the mechanism is concerned for streaming the nodes from the input to the output:
  • Because we are forced to use the cursor API because of the validation, we have to use XSLT to transfer the node and its siblings to the output. XSLT has some default templates which will be invoked if you do not specify an XSL specifically. In this case it will transform the input to the given output.
  • A custom FragmentXMLStreamWriterWrapper is needed, I documented this in the JavaDoc. This wrapper is again wrapped in a AvoidDefaultNsPrefixStreamWriterWrapper. The reason for this last one is that the default XSLT template does not recognize the default namespace in our source document. More on that in a minute (or search for AvoidDefaultNsPrefixStreamWriterWrapper).
  • The transformer that you use needs to be the Oracle JDK's internal version. Where we initialize the transformer we directly reference the instance of the internal TransformerFactory: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl which then creates the correct transformer: transformer = new TransformerFactoryImpl().newTransformer(); Normally you would be using the TransformerFactory.newInstance() and use the transformer available on the classpath. However, parsers and transformers can install themselves by supplying META-INF/services. If another transformer (such as default Xalan, not the repackaged JDK version) would be on the classpath the transformation would fail. The reason is that apparently only the JDK internal version has the ability to transform from StAXSource to StAXResult
  • The transformer will actually let our XMLStreamReader continue in the iteration process. So after a Data element has been processed, the cursor of the reader will be in theory ready at the next Data element. In theory that is, since the next event type might be a space if your XML is formatted. So it still might need some iterations on the xmlStreamReader.next() in our while loop before the next Data element is actually ready.
The result is that we have 3 output files, each compliant to the original schema, that have each 2 data elements:

To split a ~30GB XML (I'm talking about my original assignment XML with a more complex structure, not the demo XSD used here) in parts of ~500MB with validation it took about 25 minutes. To test the memory usage I deliberately set the Xmx to 32MB. As you can see in the graph memory consumption is very low and there is no GV overhead:

Life is good, but not completely. There where some awkward things that I discovered that one needs to be careful about.

In my real scenario the input XML had no namespaces associated with it and I'm pretty sure it never will. That is the reason I stick with this solution. In the demo here there is a single namespace and that already starts to make the setup more brittle. The issue is not StAX: handling the namespaces with StAX is pretty simple. You can decide to have a default namespace (assuming your Schema is elementFormDefault = qualified) corresponding with the target namespace of the Schema and maybe declare some prefixed namespaces for possibly other namespaces that are imported in the Schema. The problems begin (as you might already noticed by now) when XSLT starts to interfere with the output stream. Apparently it doesn't check which namespaces are already defined or other things happen.

The result is that they seriously clutter up the document by re-defining existing namespaces with other prefixes or resetting the default namespace and other stuff you don't want. One probably needs an XSL if you need more namespace manipulation then the default template. XSLT will also trigger exceptions if the input document is using default namespaces. It will try to register a prefix with name "xmlns". This is not allowed as xmlns is reserved for indicating the default namespace it cannot be used as a prefix. The fix I applied for this test was to ignore any prefix that is "xmlns" and to ignore the addition of the target namespace in cominbation with the xmlns prefix (thats why we have the AvoidDefaultNsPrefixStreamWriterWrapper). The prefix and namespace both need to match in the AvoidDefaultNsPrefixStreamWriterWrapper, because if you would have an input document without default namespace but with prefixes instead (like <bigxml:BigXmlTest xmlns:bigxml="http://...."><bigxml:Header....) then you cannot ignore adding the namespace (the combination will then be the target namespace with the "bigxml" prefix) since that would yield only prefixes for the data elements without namespaces being bound to them, for example:

<?xml version='1.0' encoding='UTF-8'?>
<BigXmlTest xmlns="http://www.error.be/bigxmltest">
 <Header>
  <SomeHeaderElement>Something something darkside</SomeHeaderElement>
 </Header>
 <Content>
  <bigxml:Data>Data1</bigxml:Data>
  <bigxml:Data>Data2</bigxml:Data>
 </Content>
</BigXmlTest>
Remember that the producer of the XML is free (again in case elementFormDefault = qualified) to choose whether to use the defaultnamespace or to prefix every element. The code should transparently be able to deal with both scenario's. The AvoidDefaultNsPrefixStreamWriterWrapper code for convenience:
public class AvoidDefaultNsPrefixStreamWriterWrapper extends XMLStreamWriterAdapter {
...

 @Override
 public void writeNamespace(String prefix, String namespaceURI) throws XMLStreamException {
  if (defaultNs.equals(namespaceURI) && "xmlns".equals(prefix)) {
   return;
  }
  super.writeNamespace(prefix, namespaceURI);
 }

 @Override
 public void setPrefix(String prefix, String uri) throws XMLStreamException {
  if (prefix.equals("xmlns")) {
   return;
  }
  super.setPrefix(prefix, uri);
 }
Finally, I also wrote a version (click here for GitHub) that does the exact same thing but this time with the StAX iterator API. You will notice that there no longer the cumbersome XSLT required for streaming to the output. Each Event of interest is simply added to the output. The lack of validation could be solved by first validating the input using the cursor API and then parse it using the Iterator API. It will take longer, but that might still be acceptable in most conditions. The most important piece:
 while (xmlEventReader.hasNext()) {
    XMLEvent event = xmlEventReader.nextEvent();

    if (event.isStartElement() && event.asStartElement().getName().getLocalPart().equals(CONTENT_ELEMENT)) {
     event = xmlEventReader.nextEvent();

     while (!(event.isEndElement() && event.asEndElement().getName().getLocalPart()
       .equals(CONTENT_ELEMENT))) {

      if (dataRepetitions != 0 && event.isStartElement()
        && event.asStartElement().getName().getLocalPart().equals(DATA_ELEMENT)
        && dataRepetitions % 2 == 0) { // %2 = just for testing: replace this by for example checking the actual size of the current
                // output file
       xmlEventWriter.close(); // Also closes any open Element(s) and the document
       xmlEventWriter = openOutputFileAndWriteHeader(++fileNumber); // Continue with next file
       dataRepetitions = 0;
      }
      // Write the current event to output
      xmlEventWriter.add(event);
      event = xmlEventReader.nextEvent();

      if (event.isEndElement() && event.asEndElement().getName().getLocalPart().equals(DATA_ELEMENT)) {
       dataRepetitions++;
      }
     }
    }
   }
On line 2 you will see that we get an XMLEvent back which contains all information about the current node. On line 4 you see that it is easier to use this form to check for the element type (instead of comparing with constants you can use the object model). On line 19 to copy the element from input to output we simply add the Event to the XMLEventWriter.

Monday, April 1, 2013

JMS and Spring: Small Things Sometimes Matter

JmsTemplate and DefaultMessageListenerContainer are Spring helpers for accessing JMS compatible MOM. Their main goal is to form a layer above the JMS API and deal with infrastructure such as transaction management/message acknowledgement and hiding some of the repetitive and clumsy parts of the JMS API (hang in there: JMS 2.0 is on its way!). To use either one of these helpers you have to supply them with (at least) a JMS ConnectionFactory and a valid JMS Destination.

When running your app on an application server, the ConnectionFactory will most likely be defined using the JEE architecture. This boils down adding the ConnectionFactory and its configuration parameters allowing them to be published in the directory service under a given alias (eg. jms/myConnectionFactory). Within your app you might for example use the "jndi-lookup" out of the JEE namespace or JndiTemplate/JndiObjectFactoryBean beans if more configuration is required for looking up the ConnectionFactory and pass it along to your JmsTemplate and/or DefaultMessageListenerContainer.

The latter, JMS destination, identifies a JMS Queue or Topic for which you want to produce messages to or consume mesages from. However, both JmsTemplate as DefaultMessageListenerContainer have two different properties for injecting the destination. There is a method taking the destination as String and one taking it as a JMS Destination type. This functionality is nothing invented by Spring, the JMS specification mentions both approaches:

4.4.4 Creating Destination Objects
Most clients will use Destinations that are JMS administered objects that they have looked up via JNDI. This is the most portable approach.
Some specialized clients may need to create Destinations by dynamically manufacturing one using a provider-specific destination name. 
Sessions provide a JMS provider-specific method for doing this.
If you pass along a destination as String then the helpers will hide the extra steps required to map them to a valid JMS Destination. In the end a createConsumer on a JMS Session expects you to pass along a Destination object to indicate where to consume messages from before returning a MessageConsumer.

When destinations are configured as String, the Destination is looked up by Spring using the JMS API itself. By default JmsTemplate and DefaultMessageListenerContainer have a reference to a DestinationResolver which is DynamicDestinationResolver by default (more on that later). The code below is an extract from DynamicDestinationResolver, the highlighted lines indicate the usage of the JMS API to transform the String to a Destination (in this example a Queue):

 protected Queue resolveQueue(Session session, String queueName) throws JMSException {
  if (session instanceof QueueSession) {
   // Cast to QueueSession: will work on both JMS 1.1 and 1.0.2
   return ((QueueSession) session).createQueue(queueName);
  }
  else {
   // Fall back to generic JMS Session: will only work on JMS 1.1
   return session.createQueue(queueName);
  }
 }
The other way mentioned by the spec (JNDI approach) is to configure Destinations as administrable objects on your application server. This follows the principle as with the ConnectionFactory; the destination is published in the applications servers directory and can be looked up by its JNDI name (eg. jms/myQueue). Again you can lookup the JMS Destination in your app and pass it along to JmsTemplate and/or DefaultMessageListenerContainer making use of the property taking the JMS Destination as parameter.

Now, why do we have those two options?

I always assumed that it was a matter of choice between convenience (dynamic approach) and environment transparancy/configurability (JNDI approach). For example: in some situations the name of the physical destination might be different depending on the environment where your application runs. If you configure your physical destination names inside your application you obviously loose this benefit as they cannot be altered without rebuilding your application. If you configured them as administered object on the other hand, it is merely a simple change in the application server configuration to alter the physical destination name.

Remember; having physical Destinations names configurable can make sense. Besides the Destination type, applications dealing with messaging are agnostic to its details. A messaging destination has no functional contract and none of its properties (physical destination, persistence, and so forth) are of importance for the code your write. The actual contract is inside the messages itself (the headers and the body). A database table on the other is an example of something that does expose a contract by itself and is tightly coupled with your code. In most cases renaming a database table does impact your code, hence making something like this configurable has normally no added value compared to a messaging Destination.

Recently I discovered that my understanding of this is not the entire truth. The specification (from "4.4.4 Creating Destination Objects" as pasted some paragraphs above) already gives a hint: "Most clients will use Destinations that are JMS administered objects that they have looked up via JNDI. This is the most portable approach." Basically this tells us that the other approach (the dynamic approach where we work with a destination as String) is "the least portable" way. This was never really clear to me as each provider is required to implement both methods, however "portable" has to be looked at in a broader context.

When configuring Destinations as String, Spring will by default transform them to JMS Desintations whenever it creates a new JMS Session. When using the DefaultMessageListenerContainer for consuming messages each message you process occurs in a transaction and by default the JMS session and consumer are not pooled, hence they are re-created for each receive operation. This results in transforming the String to a JMS Destination each time the container checks for new messages and/or receives a new message. The "non portable" aspect comes into play as it also means that the details and costs of this transformation depend entirely on your MOM's driver/implementation. In our case we experienced this with Oracle AQ as MOM provider. Each time a destination transformation happens the driver executes a specific query:

select   /*+ FIRST_ROWS */  t1.owner, t1.name, t1.queue_table, t1.queue_type, t1.max_retries, t1.retry_delay, t1.retention, t1.user_comment, t2. type , t2.object_type, t2.secure
from  all_queues t1, all_queue_tables t2
where  t1.owner=:1 and  t1.name=:2 and  t2.owner=:3 and  t1.queue_table=t2.queue_table
Forum entry can be found here.

Although this query was improved in the latest drivers (as mentioned by the bug report), it was still causing significant overhead on the database. The two options to solve this:

  • Do what the specification advices you to do: configure destinations as resources on the application server. The application server will hand out the same instance each time, so they are already cached there. Even though you will receive the same instance for every lookup, when using JndiTemplate (or JndiDestinationResolver, see below) it will also be chached application side, so even the lookup itself will only happen once.
  • Enable session/consumer caching on the DefaultMessageListenerContainer. When the caching is set to consumer, it indirectly also re-use the Destination as the consumer holds a reference to the Destination. This pooling is Spring added functionality and the JavaDoc says it safe when using resource local transaction and it "should" be safe when using XA transaction (except running on JBoss 4).
The first is probably the best. However in our case all destinations are already defined inside the application (and there are plenty of them) and there is no need for having them configurable. Refactoring these merely for this technical reason is going to generate a lot of overhead with no other advantages. The second solution is the least preferred one as this would imply extra testing and investigation to make sure nothing breaks. Also, this seems to be doing more then needed, as there is no indication in our case that creating a Session or Consumer has measurable impact on performance. According to the JMS specification:
4.4 Session
A JMS Session is a single-threaded context* for producing and consuming
messages. Although it may allocate provider resources outside the Java virtual
machine, it is considered a lightweight JMS object.
Btw; this is also valid for MessageConsumers/Producers. Both of them are bound to a session, so if a Session is lightweight to open then these objects will be as well.

There is however a third solution; a custom DestinationResolver. The DestinationResolver is the abstraction that takes care of going from a String to a Destination. The default (DynamicDestinationResolver) uses the createConsumer(javax.jms.Destination) on the JMS Session to transform, but it does however not cache the resulting Destination. However, if your Destinations are configured on the application server as resources, you can (besides using Spring's JNDI support and injection the Destination directly) also use JndiDestinationResolver. This resolver will treat the supplied String as a JNDI location (instead of physical destination name) and perform a lookup for you. By default it will cache the resulting Destination, avoiding any subsequent JNDI lookups. Now, one can also configure JndiDestinationResolver as a caching decorator for the DynamicDestinationResolver. If you set fallback to true, it will first try to use the String as a location to lookup from JNDI, if that fails it will pass our String along to DynamicDestinationResolver using the JMS API to transform our String to a Destination. The resulting Destination is in both cases cached and thus a next request for the same Destination will be served from the cache. With this resolver there is a solution out of the box without having to write any code:

 <bean id="cachingDestinationResolver" class="org.springframework.jms.support.destination.JndiDestinationResolver">
  <property name="cache" value="true"/>
  <property name="fallbackToDynamicDestination" value="true"/> 
 </bean>

 <bean id="infra.abstractMessageListenerContainer" class="org.springframework.jms.listener.DefaultMessageListenerContainer" abstract="true">
  <property name="destinationResolver" ref="cachingDestinationResolver"/>
  ...
 </bean>
The JndiDestinationResolver is thread safe by internally using a ConcurrentHasmap to store the bindings. A JMS Destination is on itself thread safe according to the JMS 1.1 specification (2.8 Multithreading) and can safely be cached:

This is again a nice example on how simple things can sometimes have an important impact. This time the solution was straightforward thanks to Spring. It would however been a better idea to make the caching behaviour the default as this would decouple it from any provider specific quirks in looking up the destination. The reason this isn't the default is probably because the DefaultMessageListenerContainer supports changing the destination on the fly (using JMX for example):

Note: The destination may be replaced at runtime, with the listener container picking up the new destination immediately (works e.g. with DefaultMessageListenerContainer, as long as the cache level is less than CACHE_CONSUMER). However, this is considered advanced usage; use it with care!

Monday, March 4, 2013

Bulk fetching with Hibernate

If you need to process large database result sets from Java you can opt for JDBC to give you the low level control required. On the other hand if you are already using an ORM in your application falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.

A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified. To start we'll try the obvious first, performing a query to simply retrieve all data:

new TransactionTemplate(txManager).execute(new TransactionCallback<Void>() {
 @Override
 public Void doInTransaction(TransactionStatus status) {
   Session session = sessionFactory.getCurrentSession();
   List<DemoEntity> demoEntitities = (List<DemoEntity>) session.createQuery("from DemoEntity").list();
   for(DemoEntity demoEntity : demoEntitities){
    //Process and write result
   }
   return null;
 }
});
After a couple of seconds:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Clearly this won't cut it. To fix this we will be switching to Hibernate scrollable result sets as probably most developers are aware of. The above example instructs hibernate to execute the query, map the entire results to entities and return them. When using scrollable result sets records are transformed to entities one at a time:

new TransactionTemplate(txManager).execute(new TransactionCallback<Void>() {
 @Override
 public Void doInTransaction(TransactionStatus status) {
  Session session = sessionFactory.getCurrentSession();
  ScrollableResults scrollableResults = session.createQuery("from DemoEntity").scroll(ScrollMode.FORWARD_ONLY);

  int count = 0;
  while (scrollableResults.next()) {
   if (++count > 0 && count % 100 == 0) {
    System.out.println("Fetched " + count + " entities");
   }
   DemoEntity demoEntity = (DemoEntity) scrollableResults.get()[0];
   //Process and write result
  }
 return null;
 }
});
After running this we get:

...
Fetched 49800 entities
Fetched 49900 entities
Fetched 50000 entities
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Although we are using a scrollable result set, every returned object is an attached object and becomes part of the persistence context (aka session). The result is actually the same as our first example in which we used "session.createQuery("from DemoEntity").list()". However, with that approach we had no control; everything happens behind the scenes and you get a list back with all the data if hibernate has done its job. using a scrollable result set on the other hand gives us a hook into the retrieval process and allows us to free memory up when needed. As we have seen it does not free up memory automatically, you have to instruct Hibernate to actually do it. Following options exist:

  • Evicting the object from the persistent context after processing it
  • Clearing the entire session every now and then
We will opt for the first. In the above example under line 13 (//Process and write result) we'll add:
session.evict(demoEntity);
Important:
  • If you were to perform any modification to the entity (or entities it has associations with that are cascade evicted alongside), make sure to flush the session PRIOR evicting or clearing, otherwise queries hold back because of Hibernate's write behind will not be sent to the database
  • Evicting or clearing does not remove the entities from second level cache. If you enabled second level cache and are using it and you want to remove them as well use the desired sessionFactory.getCache().evictXxx() method
  • From the moment you evict an entity it will be no longer attached (no longer associated with a session). Any modification done to the entity at that stage will no longer be reflected to the database automatically. If you are using lazy loading, accessing any property that was not loaded prior the eviction will yield the famous org.hibernate.LazyInitializationException. So basically, make sure the processing for that entity is done (or it is at least initialized for further needs) before you evict or clear
After we run the application again, we see that it now successfully executes:
...
Fetched 99800 entities
Fetched 99900 entities
Fetched 100000 entities

Btw; you can also set the query read-only allowing hibernate to perform some extra optimizations:

 ScrollableResults scrollableResults = session.createQuery("from DemoEntity").setReadOnly(true).scroll(ScrollMode.FORWARD_ONLY);
Doing this only gives a very marginal difference in memory usage, in this specific test setup it enabled us to read about 300 entities extra with the given amount of memory. Personally I would not use this feature merely for memory optimizations alone but only if it suits in your overall immutability strategy. With hibernate you have different options to make entities read-only: on the entity itself, the overall session read-only and so forth. Setting read only false on the query individually is probably the least preferred approach. (eg. entities loaded in the session before will remain unaffected, possibly modifiable. Lazy associations will be loaded modifiable even if the root objects returned by the query are read only).

Ok, we were able to process our 100.000 records, life is good. But as it turns out Hibernate has another another option for bulk operations: the stateless session. You can obtain a scrollable result set from a stateless session the same way as from a normal session. A stateless session lies directly above JDBC. Hibernate will run in nearly "all features disabled" mode. This means no persistent context, no 2nd level caching, no dirty detection, no lazy loading, basically no nothing. From the javadoc:

/**
 * A command-oriented API for performing bulk operations against a database.
 * A stateless session does not implement a first-level cache nor interact with any 
 * second-level cache, nor does it implement transactional write-behind or automatic 
 * dirty checking, nor do operations cascade to associated instances. Collections are 
 * ignored by a stateless session. Operations performed via a stateless session bypass 
 * Hibernate's event model and interceptors.  Stateless sessions are vulnerable to data 
 * aliasing effects, due to the lack of a first-level cache. For certain kinds of 
 * transactions, a stateless session may perform slightly faster than a stateful session.
 *
 * @author Gavin King
 */
The only thing it does is transforming records to objects. This might be an appealing alternative because it helps you getting rid of that manual evicting/flushing:
new TransactionTemplate(txManager).execute(new TransactionCallback<Void>() {
 @Override
 public Void doInTransaction(TransactionStatus status) {
  sessionFactory.getCurrentSession().doWork(new Work() {
   @Override
   public void execute(Connection connection) throws SQLException {
    StatelessSession statelessSession = sessionFactory.openStatelessSession(connection);
    try {
     ScrollableResults scrollableResults = statelessSession.createQuery("from DemoEntity").scroll(ScrollMode.FORWARD_ONLY);

     int count = 0;
     while (scrollableResults.next()) {
      if (++count > 0 && count % 100 == 0) {
       System.out.println("Fetched " + count + " entities");
      }
      DemoEntity demoEntity = (DemoEntity) scrollableResults.get()[0];
      //Process and write result 
     }
    } finally {
     statelessSession.close();
    }
   }
  });
  return null;
 }
});

Besides the fact that the stateless session has the most optimal memory usage, using the it has some side effects. You might have noticed that we are opening a stateless session and closing it explicitly: there is no sessionFactory.getCurrentStatelessSession() nor (at the time of writing) any Spring integration for managing the stateless session.Opening a stateless session allocates a new java.sql.Connection by default (if you use openStatelessSession()) to perform its work and therefore indirectly spawns a second transaction. You can mitigate these side effects by using the Hibernate work API as in the example which supplies the current Connection and pass it along to openStatelessSession(Connection connection). Closing the session in the finally has no impact on the physical connection since that is captured by the Spring infrastructure: only the logical connection handle is closed and a new logical connection handle was created when opening the stateless session.

Also note that you have to deal with closing the stateless session yourself and that the above example is only good for read-only operations. From the moment you are going to modify using the stateless session there are some more caveats. As said before, hibernate runs in "all feature disabled" mode and as a direct consequence entities are returned in detached state. For each entity you modify, you'll have to call: statelessSession.update(entity) explicitly. First I tried this for modifying an entity:

new TransactionTemplate(txManager).execute(new TransactionCallback<Void>() {
 @Override
 public Void doInTransaction(TransactionStatus status) {
  sessionFactory.getCurrentSession().doWork(new Work() {
   @Override
   public void execute(Connection connection) throws SQLException {
    StatelessSession statelessSession = sessionFactory.openStatelessSession(connection);
    try {
     DemoEntity demoEntity = (DemoEntity) statelessSession.createQuery("from DemoEntity where id = 1").uniqueResult();
     demoEntity.setProperty("test");
     statelessSession.update(demoEntity);
    } finally {
     statelessSession.close();
    }
   }
  });
  return null;
 }
});
The idea is that we open a stateless session with the existing database Connection. As the StatelessSession javadoc indicates that no write behind occurs, I was convinced that each statement performed by the stateless session would be sent directly to the database. Eventually when the transaction (started by the TransactionTemplate) would be committed the results would become visible in the database. However, hibernate does BATCH statements using a stateless session. I'm not 100% sure what the difference is between batching and write behind, but the result is the same and thus contra dictionary with the javadoc as statements are queued and flushed at a later time. So, if you don't do anything special, statements that are batched will not be flushed and this is what happened in my case: the "statelessSession.update(demoEntity);" was batched and never flushed. One way to force the flush is to use the hibernate transaction API:
StatelessSession statelessSession = sessionFactory.openStatelessSession();
statelessSession.beginTransaction();
...
statelessSession.getTransaction().commit();
...
While this works, you probably don't want to start controlling your transactions programatically just because you are using a stateless session. Also, doing this we are again running our stateless session work in a second transaction scenario since we didn't pass along our Connection and thus a new database connection will be acquired. The reason we can't pass along the outer Connection is because if we commit the inner transaction (the "stateless session transaction") and it would be using the same connection as the outer transaction (started by the TransactionTemplate) it would break the outer transaction atomicity as statements from the outer transaction sent to database would be committed along with the inner transaction. So not passing along the connections means opening a new connection and thus creating a second transaction. A better alternative would be just to trigger Hibernate to flush the stateless session. However, statelessSession has no "flush" method to manually trigger a flush. A solution here is to depend a bit on the Hibernate internal API. This solution makes the manual transaction handling and the second transaction obsolete: all statements become part of our (one and only) outer transaction:
StatelessSession statelessSession = sessionFactory.openStatelessSession(connection);
 try {
  DemoEntity demoEntity = (DemoEntity) statelessSession.createQuery("from DemoEntity where id = 1").uniqueResult();
  demoEntity.setProperty("test");
  statelessSession.update(demoEntity);
  ((TransactionContext) statelessSession).managedFlush();
 } finally {
  statelessSession.close();
}
Fortunately there is an even better solution very recently posted on the Spring jira: https://jira.springsource.org/browse/SPR-2495 This is not yet part of Spring, but the factory bean implementation is pretty straight forward: StatelessSessionFactoryBean.java when using this you could simple inject the StatelessSession:
@Autowired
private StatelessSession statelessSession;
It will inject a stateless session proxy which is equivalent to the way the normal "current" session works (with the minor difference that you inject a SessionFactory and need to obtain the currentSession each time). When the proxy is invoked it will lookup the stateless session bound to the running transaction. If none exists already it will create one with the same connection as the normal session (like we did in the example) and register a custom transaction synchronization for the stateless session. When the transaction is committed the stateless session is flushed thanks to the synchronization and finally closed. Using this you can inject the stateless session directly and use it as a current session (or the same way as you would inject a JPA PeristentContext for that matter). This relieves you from dealing with the opening and closing of the stateless session and having to deal with one way or the other to make it flush. The implementation is JPA aimed, but the JPA part is limited to obtaining the physical connection in obtainPhysicalConnection(). You can easily leave out the EntityManagerFactory and get the physical connection directly from the Hibernate session.

Very careful conclusion: it is clear that the best approach will depend on your situation. If you use the normal session you will have to deal with eviction yourself when reading or persisting entities. Besides the fact you have to do this manually, it might also impact further use of the session if you have a mixed transaction; you both perform 'bulk' and 'normal' operations in the same transaction. If you continue with the normal operations you will have detached entities in your session which might lead to unexpected results (as dirty detection will no longer work and so forth). On the other hand you will still have the major hibernate benefits (as long as the entity isn't evicted) such as lazy loading, caching, dirty detection and the likes. Using the stateless session at the time of writing requires some extra attention on managing it (opening, closing and flushing) which can also be error prone. In the assumption you can proceed with the proposed factory bean, you have a very bare bone session which is separately from your normal session but still participating in the same transaction. With this you have a powerful tool to perform bulk operations without having to think about memory management. The downside is that you don't have any other hibernate functionality available.