Devoxx - Day 4

By Roland Huß | Friday, November 19, 2010

Tags:

Cloud

The last full conference day of the Devoxx was again packed full with very interesting talks of various kind. It started with a keynote about the roadmap of JEE 7. Summarizing we can expect some smooth refinements of the platform (exept maybe the support for virtualization out of the box). Here are our impression on the talks of Thursday. Please expect our summary blog post on monday since we are all now in rush to get out things done and to catch train, plain etc. We hope, you enjoyed the blog flood so far ;-)

Designing Java Systems to Operate at a Cloud Scale (George Reese)

The talk focused on how to architect cloud applications in general. The main tips given were:

No hard-coding of IP Addresses
Try to avoid keeping state on server (fail over, scalability, performace)
Use caching where appropriate (static content, db)
Choose the appropriate DB (NoSQL or relational db?)
- Don’t buy into the NoSQL hype
- Where integrity of data and transactional integrity matter - consider relational db
- This is basically a tradeoff between scalability and transactional integrity
  - DBs can scale, but is complex
  - NoSQL can also support data integrity
Scaling
- Split DB reads from writes: typically there are much more reads than writes
- Consider a sharding strtegy early on
- Take advantage of CDN
Self-contained Applications
- In a cloud environment you need to be able to replicate application very quickly (when scaling / disaster recovery)
Network Assumptions
- Avoid assumptions about network topologies
- Keep the network architecture SIMPLE
- Some EJB servers require broadcast/multicast for clustering.. but this is not always supported in all clouds
Disk and Network I/O
- You need to worry about them in the cloud - especially in conjunction with virtualisation.
- Heavy I/O applications or chaty network applictions have to worry in particular. It doesn’t mean that they are not suited for the cloud - there are work arounds
Assume cloud is a hostile environment
- Trust no-one is key
- Handle passwords very carefully
- encrypt network traffic
- and most important: install an IDS (intrusion detection system)
Design for failure
- Build redundant components

When it comes to Java Applications there are a few pointers to consider

Use message queues for communication, when possible. Avoid RMI(to keep network topology simple) and SOAP (due to bloatedness).
EJBs should be avoided, due to the network problems in clustered environment. Also ejbs are more resource intensive
Consider also using multi-processes and not just multi-threads. Threads cannot be split up across multiple JVMs, processes can.

My overall impression of the session was that it was very high-level - good if you haven’t had much exposure to cloud applications. It touched all the main topics, but unfortunately didn’t delve into any of the details. A lot of the topics covered were self evident.

Hadoop and NoSQL at Twitter (Dmitriy Ryaboy)

Actually, this talk was not really about Hadoop, but about scaling large data sets at Twitter. There are lot of different kind of scale problems, but there are general principles which can be applied to solving yours. And there is a good chance something already solved your problem. Twitter has to deal with 95 Million tweets per day, 3000 Tweets per second.

Single master with many read slaves doesn’t work here because of write speed bootlenecks and it does not play well with multiple data center. Snowflake, the standalone distributed UID generator Twitter is using, is time-dominant, which means data is roughly time sorted.

Gizzard is Twitters sharding framework which key features are spreading the keyspace across many nodes and replication. Messages are mapped to shards and shards are mapped to replicaton trees. Shards are abstracted (MySQL, Lucene, Redis, Logical Shards). Ranges of keys are mapped to shards. Replication is controlled by various possible replication policies. Fault tolerance is realized by re-enqueing failed writes, but writes must be commutative and idempotent. Stale reads can happen (CALM: Consistency As Logical Monotonicity)

Haplocheirus is a vector cache. 1.2 Million deliveris per second of posts, which all would have to be queried for. Assembling the timeline is expensive if an “assemble on read” is used. “Assemble on write” has high storage costs and is expensive for popular users. The latter can be fixed by async writes. For this, a LRU cache is used, which is currently Memcache. In the future Twitter will use Haplo, a redis-based timeline store. The conclusion is to use precomputing wisely.

FlockDB is a social graph store. It is realized by several tables for holding relations, which is partioned by user id. It is Twitter’s current solution for holding user relationships and calculating intersections.

Cassandra is used by Twitter for large scale data mining, a geo database and realtime analytics. Lucene is used for searches on the geo database.

Rainbird, part of Cassandra, is used for time series analytics.

Cuckoo is used for cluster monitoring (not opensource yet).

Hadoop is used for offline processing at Twitter. 1000 machines, Billions of API requests, 12 TB of ingested data, 95 Million Tweets per day generate huge amount of datas, for whicht a OLAP database it not a good fit. Hadoop scales to good to large data sizes, but it is slower than a speciliast OLAP DB. Twitter uses a hybrid approach with Vertica used for table aggregations, Hadoop for logs etc. Scribe (originating from Facebook) is used for logging. Hadoop gets 12 TB per day data.

Elephant-Bird is a library for working with data in Hadoop. Thrift, Avro and Protocol Buffers are serialization frameworks, which give a compact description of data and are backwards compatible. Very useful for logging data for later data analysis. Elephant-Bird uses Protocol Buffers for dealing with Hadoop I/O Format.

HBase and Pig (a declarative dataflow language) are used for analytics within Twitter. Howl is an abstraction to seamlessly work with Pig and Hive.

Recommendations:

Precompute results if query space is limited
Provide narrow query interfaces. Optimize them.
Staying CALM for eventual consistency.
Sharding and replication is a pattern (use a framework).
Use existing tools.

Wow, what a firework of tools, I even didn’t heard about. ‘guess there is a quite a lot to catch up in order to follow the latest data modeling trends. Good talk, probably a bit to much new stuff for me.

Activiti (Tom Baeyens, Joram Barrez)

Activiti is a new BPM project lead by the former jBpm Head Tom Bayens under the umbrella of Alfresco. It is licensed under the Apache License as is a BPMN 2.0 engine. Activiti can be embedded in any Java environment and is extensible. One of the technical advantages of Activiti compared to jBpm is its Spring support from the very beginning. Quite a bunch of tool surround Activiti:

Webbased BPMN 2.0 graphical edior
Activiti Explorer for task management
Activiti Probe for administrative functionality
Activiti Cycle is BPM collaboration
REST-Api
Activiti Eclipse designer (including BPMN 2.0 validation)
Activiti Grails integration

An example of a simple BPMN 2.0 notation used by Activiti looks like:

<?xml version="1.0" encoding="UTF-8"?>

<definitions id="definitions"
xmlns="http://www.omg.org/spec/BPMN/20100524/MODEL"
targetNamespace="http://www.activiti.org/bpmn2.0">

  <process id="helloWorld">

    <startEvent id="start" />
    <sequenceFlow id="flow1" sourceRef="start" targetRef="script" />
    <scriptTask id="script" name="HelloWorld" scriptFormat="groovy">
      <script>
        System.out.println("Hello world")
      </script>
    </scriptTask>
    <sequenceFlow id="flow2" sourceRef="script" targetRef="theEnd" />
    <endEvent id="theEnd" />

</process>

</definitions>

This is how Activity uses this process:

// Bootstrap
ProcessEngine processEngine = new DbProcessEngineBuilder()
  .configureFromPropertiesResource("activiti.properties")
  .buildProcessEngine();
ProcessService processService = processEngine.getProcessService();

// Deployment
processService.createDeployment()
  .addClasspathResource("hello-world.bpmn20.xml")
  .deploy();

// Run
processService.startProcessInstanceByKey("helloWorld");

Some sort of real world example (obtaining a loan from a bank) was inroduced and clicked through. It include integration with Alfresco, where document where created and managed. Excel integration is there as well.

Activiti has nice support for JUnit Test for creating unit testing your processes using custom annotations. The query API for queryin process instances.

In a 1-minute crash movie, Joram demonstrate how easy it is to setup Activiti with a default setup along with all those nice tools.

It is really impressive what Activiti achieved in these few months of its existance. I’m pretty sure, that Activiti is (or become) the king of open source BPM, and maybe beyond. Activiti is definetly worth a try.

BTW, I bever seen a speaker (Joram Barrez) overtaking himself while speaking that fast ;-)

Akka (Viktor Klang)

The speaker started his session by mentioning that he has to recover from 9 years of Java development which made me crack up a bit :-) Akka is technology which is both written in Scala and in Java.

He continnued listing all the vision stuff that it is simple to write concurrent, fault-tolerant and scalable applications using Akka.

Here is the overview he presented:

Simpler Concurrency
Event-driven architecture
true scalability
fault tolerance
transparent remoting
java & scala api

Akka is all about the Java-Actor-Implementation from its programming conception and seems indeed very easy to be used.

Here is an example in Scala which I copied from http://akkasource.org:

// server code
class HelloWorldActor extends Actor {
 def receive = {
   case msg => self reply (msg + " World")
 }
}
RemoteNode.start("localhost", 9999).register(
 "hello-service", actorOf[HelloWorldActor])

// client code
val actor = RemoteClient.actorFor(
 "hello-service", "localhost", 9999)
val result = actor !! "Hello"

Note that the !! (bang bangs) are an operator overload. In Java this method means “sendRequestReply”.

A test project using Akka is online

Google Web Toolkit (David Geary)

David is obviously an expert on Java and user interfaces. He wrote impressivly many books about Swing, JavaServer Faces (JSF), Advanced JSP, the JSP Standard Tag Library, and the Google Web Toolkit.

His demo was quite enjoyable. He (re)coded on the fly a nice little web app called “Places” and containing content from Yahoo!Maps not without some errors in Eclipse. His comment on that was: “That’s why when I’m at home I pay for IntelliJ.”

I found the slides for his demo also here

There is also a Quake demo on YouTube. Quake is running inside of a browser. This program was made with GWT.

David came up with some news about some features in GWT 2.0:

Just released (28/10/2010)
There is no fake browser any more as in version 1.0 instead they have hosted mode browser plugins (I think for Firefox, Safari and IE)
Layouting is completely new. It is very similar to Swing now (remember GridbagLayout and so on)
Event Listeners (also similar to Swing/AWT/SWT) but there is no need for Adapters anymore since there EventHandlers now
“History” is also a nice feature. Using this you can browse through your web states by clicking forward or backward in your browser as described in GOF Memento pattern
UIBuilder: widgets can now be declared in (XML because people complained about too much Java code) and can be accessed via Java annotations
Monitoring with Speed Tracer looks quite comfortable (is for any webapp not only GWT)

I think it’s fun to play a little bit around with that technology and maybe use it with my own programs.

Android UI Development: Tips, Tricks and Techniques (Romain Guy, Chet Haase)

Let’s talk about garbage! On mobile devices garbage matters! Garbage generated with an animation on your mobile device which is generated every time the animation is running can cause serious problems. So keep in mind to keep garbage at a minimum level when dealing with mobile devices just like you would do in normal life :) Chet Haase and Romain Guy talk about tips and tools pointing to performance and memory leaks on mobile devices.

Temporaries Sometimes it is necessary to have temporary objects such as local variables, but you should always consider to use a static final class member instead of this.
Autoboxing creates objects! If you do not need an object type use primitive types instead so allocation is minimized.
Iterator. Enhanced for() loops are great but they create garbage! This is because it instantiates a new iterator. What can we do about this? Consider to a size check first before the enhanced for() loop and you will prevent empty Iterators generated.

if (nodeList.size() > 0) {
   for(Node node : nodeList) {
       //do something
   }
}

Image recycling. Recycle the bitmaps on mobil devices. Bitmaps are finalized and finalizers may clear the data … eventually … some time. Even null setting does not help here (myBitmap = null). What you want to do is (myBitmap.recycle()). Do not wait for the finalizer to do the work if you need that memory now. References may be all gone but memory is not free for new allocation when really you need it.
Varargs Variable arguments on methods are packaged into a temporary array. Be aware of that and double check if you really need the variable arguments.
Generics Generics only deal with objects. Primitive types get autoboxed generating memory allocation. Do we really need that type parameter in MyClass? Consider other ways than generics on mobile devices.

MyClass<Float> myObject = new MyClass<Float>();

Tools and Demos. The rest of the talk showed some tools for finding and checking performance issues and memory allocations. The speakers gave some demos on several very useful tools. The two most exciting tools are:
- DDMS: Allocation tracking and limiting the allocation limit. Count the allocations being made. DDMS comes as standalone version or Eclipse plugin.
- hat (Heap Analysis Tool): track down memory leaks The demo on heap size analysis and memory leak detection showed how bitmap drawables keep a backward reference to the view port in order to be able to refresh the view. This reference causes unnecessary allocations when caching those bitmap drawables in static fields. Following from that
  - be careful with the context
  - be careful with static fields
  - avoid non-static inner classes
  - use weak references
Responsiveness. Single-threaded UI on mobile devices require special care! if you block the UI thread you block the user interaction. Instead use async tasks with messaging or handlers with messaging.
Overinvalidation Render things when things are important to be rendered. Do not draw more than you really should. Custom components need to take care of invalidation. This point was followed with a very impressive demo on message profiling. Trace view tracking shows exactly what is going on on the canvas and points to components that are unnecessary rendered all the time. This was caused by wrong invalidation. The solution is simple: Just invalidate the sections you need to refresh.

The modular Java Platform (Mark Reinhold)

Mark Reinhold is Chief Architect of the Java Platform Group at Oracle, where he works on the Java Platform, Standard Edition, and OpenJDK.

This session is about Java 7 or later handles application construction, packaging and publication. In other words how to get rid of the JAR hell which we have now?

Mark explained that in the Jigsaw they already have resolved a lot problems. These solution will come along with Java 7.

The main solution is: The Modular Java Platform

which enables escape from JAR hell by:

eliminate the classpath
record dependencies directly in source code
package modules for automatic download & install
easily generate sensible rpn/dev/svr/ips packages

The Module system requirements are:

fast class loading
- during startup and throughout runtime
- on all types of devices
- current class-path mechanism is too slow
Predictability
package subset: cannot (massively) refactor the existing SE API set
Substitutability: to support refactoring modules over time
Optionality
- a method can depend upon an optional module
- presence/absence of optional module detected at install
- method handles presence/absence at runtime
Self-applicability

Here are some examples how the modules can be declared:

Grouping Example

//module-info.java
module com.foo {
    class com.foo.Main;
    ...
}

//module-info.java
module com.foo {
    requires org.bar.lib;
    requires org.baz.lib;
}

Versioning Example

//module-info.java

module com.foo @ 1.0.0 {
    requires org.bar.lib @ 2.1-alpha;
    requires org.baz.lib @ 2.0;
}

Encapsulation Example

//module-info.java

module com.foo  @ 3 {
    permits org.bar.lib;
}

Optional Modules Example

//module-info.java

module com.foo {
    requires org.bar.lib;
    requires optional com.foo.extra;
}

Here is an example for how a module is packaged

$ javac -modulepath mods src/com.foo.app/...
$ ls mods
com.foo.app/
com.foo.extra/
com.foo.lib/

And this is an example how a module can be packaged for Debian

$ jpkg -m mods deb com.foo.app com.foo.lib

Getting things done for programmers (Kito Mann)

Kito Mann is the author of “Java Server Faces in Action” and he runs the http://jsfcentral.com website.

The whole talk is based on the book “Getting Things Done” by David Allen

The talk begins with a description of a programmers daily life beeing bombarded with eMails, tweets, phone calls, meetings.

All of this results in too many things to do - sounds familiar to me. He uses the picture of unclosed loops for that and

He describes the goal of “GTD” as to close those loops to avoid constant thinking about them leaving more energy to get things done.

GTD works like this:

collect things in an inbox by writing them down on
process your inbox with the goal of a zero inbox - put them into trash, make a task or project out of it
make tasks out of projects which you can complete - take hours or minutes
defer, delegate, delete or process tasks
put tasks in contexts such as cellphone, internet, office, whatever
filter what you can currently do by contexts
it might be usefull to use project ids such as “presentations.gtd.programmers”
do a weekly review of your tasks. the outcome might be more tasks, task updates or deleting tasks
select the tasks you do based on their context, the time available, your energy level and by priority
focus on tasks: work on one task at a time, avoid distraction, box your time: pomodoro technique

Then Mr Mann started talking about tools which can be used to do GTD. I left at this point …

The talk was very good over all and worth attending. I think “Getting Things Done” has some interesting ideas in it but is too much of a process for me. It seems very restrictive and not flexible enough. But I will definitely try out closing my mail client from time to time to get things done…