Monday, August 5, 2013

Scala: 6 silver bullets

Bye bye Java, Hello Scala

In the JVM world, Scala is certainly the rising star. Created at EPFL in 2001, its strongly gaining in popularity. Depending on the indices, it ranks now as a "serious" language reaching far beyond the academic world and adopted in mainstream companies (twitter backend, Ebay research, Netflix, FourSquare etc.).
For data scientists, this language is a breeze. Above the religion war between functional and object oriented believers, it succeeded by merging the best of both worlds, with a strong drive at "let's be practical."
If Grails/Groovy was a big step forwards in productivity on the JVM, Scala goes even further, mixing static typing (thus efficiency) with many improvements in the language structure, collections handling, concurrency, backed by solid frameworks and a very active community.
In this post, I'll picked up six major (and subjective) improvements, showing my hardcore Java colleagues how jumping on this train would be a promise of a great journey.

Bullet #1: Object orientation


For many, scala means "the come back of functional programming." That is certainly true, but they also totally revisited the object approach on the JVM. Thanks to static typing, more or less all what you can imagine the compiler to do is possible.

Immutable variables

By default, variable are immutable. Once a value has been defined, you don't cannot change it. This is key for the functional paradigm, where a call on object should return the same value, as no state is stored. It can be disturbing at first, but one realize that most of our codes can be written with immutable variable 
  val x:Int = 42
  x=12                                //error: reassignment to val
  var i:Int = 0
  i+=1                                //OK

Type inference

Why specifying the variable type wherever it can be inferred (apart for readability, for public method for example)?
  val x = 42                          // x is an Int

Lists and maps

We'll go further on collections later on, but from the definition point of view
  val l1 = List(1,2,5)                // List[Int]
  val l2 = List(1,2,5.0)              // List[Double]
  val l2 = List(1,2,"x")              // List[Any]    
And maps follow the same  approach
  val m1 = Map("x"->3, "y"->42)       // Map[String, Int]

null is discouraged

In Java, null stands for "does not exist" or "has actually a null value". Scala defines a parametrized type for that Option[Int]. That will make full sense below with pattern matching.
  val m = Map("x"->3, "y"->42)       // Map[String, Int]
  println(m("x"))                    // -> 3
  println(m("z"))                    // -> NoSuchElementException
  println(m.get("x))                 // -> Some(3)
  println(m.get("z"))                // -> None


Tuples

It can be convenient to move around pair or variable tuples without the need of defining temporary classes
  val p = (1,4.0)                    // Pair[Int, Double]
  println(p._2)                      // -> 4.0
  val l = List((1, 4.0, "x"), (2, 1.2, "y"))
                              // List[(Int, Double, String)]
  println(l(0)._1)                   // -> 1

Class

  class Peak(val x: Double, val intens: Double) {
    def sqrtIntens = intens * intens  // parameterless method
    lazy val hundred = {              // evaluated once
      println("hundred call")         //
when needed
       new Peak(x, 100 * intens)
    }


    override def toString = s"[$x, $intens]"
  }

And let's use our class
  val p = new Peak(5, 3)
  val i2 = p.sqrtIntens               // no need for parentheses
  println(p.hundred)
     

Singleton object

Scala is truly object oriented. Why bother with static and MyClass.getInstance()?
  object SuperStat{
    def mean(values:List[Double]) = values.sum/values.size
    def variance(values:List[Double]) = {
        val mu = mean(values)
        val sum2 = values.map( x => math.pow(x-mu, 2))
                         .sum
        sum2/(values.size-1)         // returned value
    }
  }

  println(SuperStat.mean(List(1,2,3,4,5))) 

Companion object

If an object is defined in the same source file as a class, with the same name, it can have some "special" relation to it (implicit conversion, factory etc.)
  object Peak {
    implicit def formPair(p:Pair[Double, Double]) = 

      new Peak(p._1, p._2)
  }
  val p:Peak = (1.3, 42.0)

Traits

Why an interface could not implement methods, and why a class could not inherit from multiple implemented parents?
While abstract class still exist, a trait play a more general role. It can define methods "to be defined in implemented class" and also  actual methods (but no fields).
  trait HasIntensity{
    def intens:Double
     def sqrtIntens = intens * intens
  }
  class Peak(val x:Double, val intens:Double) extends HasIntensity{...} 

Bullet #2: Collections


Among many other variants, there are three types of collections:
  • List[T]: to be traversed, with head and last direct access,
  • Vector[T]: with random access on integer index,
  • Map[K, V]: for dictionary or hash map structure.

Instantiation

Has unveiled above:
  val l = List(1,2,3,4)
  val m = Map("x"->3, "y"->1984)
By default, collection are also immutable, although they  also exist in a mutable flavor:
  import scala.collection.mutable.List

Adding values

  val l = List(1,2,3,4)
  val l2 = l1 :+ 5           // List(1,2,3,4)
  val l3 = l1 :: List(5,6,7) // List(1,2,3,4,5,6,7)

Access

  val l = List(1,2,3,4)
  l(2)                       // 3
  l.head                     // 1
  l.last                     // 4
  l.tail                     // List(2,3,4)
  l.take(2)                  // List(1,2)
  l.drop(2)                  // List(3,4)

And so much more

  val l = List(1,2,3,4)
  l.map(_ * 10)              // List(10,20,30,40)
  l.find(_%2 == 0)           // Some(2)
  l.filter(_%2 == 0)         // List(2,4)
  l.partition(_%2==0)        // (List(2,4), List(1,3))
  l.combinations(2)          // iterator (1,2), (1,3)...

A more  complex example

Let's pretend we have a list of peaks as defined above Peak(val x:Double, val intens:Double). We'd like to group them by integer bin on x, sum up the intensities and keep only the 2 binned values with the highest total intensity.
  peaks.groupBy(p=>math.floor(p.x))
       .map({pl =>(pl._1, pl._2.map(_.intens).sum)})
       .toList
       .sortBy(-_._2)
       .take(2)

Each operation return a new collection, on which can be applied an operator. The succession of these operations are concisely described

for loops

Ever written for(i=0; i<arr.size();i++){arr[i]}? Well, we can do better. Let's consider two imbricated loops that build a list of points
  val l1 = List(1,2,3,4)
  val l2 = List(3,4)
  for {
    i <- l1
    j <- l2 if i<j
  } yield (i,j)        // List((1,3), (1,4), (2,3), (2,4), (3,4))
  

Bullet #3: pattern matching


Let's think of a Java switch/case statements on steroids. That would even been tasteless.
Pattern matching is a functional programming technique. Depending on the passed variable (values, type, structure) a case statement is selected
  def useless(x:Any) = x match{
    case x:String => "hello "+x
    case i:Int if i<10 => "small"
    case i:Int => "large"
    case _ => "whatever"
  }

List structure

 For example, we have the very ambitious goal of reversing number 2 by 2 in a list of integers.
def biRev(l:List[Int]):List[Int] = l match{
    case Nil =>  List()
    case x::Nil => List(x)
    case x1::x2::xs => List(x2, x1):::biRev(xs)
  } 
biRev(List(1,2,3,4,5))            //List(2,1,4,3,5)

With regular expressions

Let's consider a function converting to kilometers. If the argument is an integer, it is returned, if it's a string matching \d+miles, number is converted into kilometers and so on.
  val reMiles="""(\d+)miles""".r
  val reKm="""(\d+)km""".r
  def toKm(x:Any)= x match{
    case x:Int=> x
    case reMiles(v) => (v.toInt*1.609).toInt
    case reKm(v) => v.toInt
    case _ => 

      throw new IllegalArgumentException(s"cannot convert $x")
  }
 



Bullet #4: concurrency


Easy list parallelization

Imagine we have a heavy function to be called on each list members (here, it will be sleeping 100ms...). Having immutable variables allows more easily to parallelize such a code on the multiple available cores with the .par call:

  val t0 = System.currentTimeMillis()        
  def ten(i: Int) = {
    Thread.sleep(100)
    println(System.currentTimeMillis() - t0)
    i * 10
  }                                            

  (0 to 20).toList.par.map(ten).sum;



Actors

Again, the functional trend of Scala enables easily to communicate between actors via message passing. Here a master send an integer to a slave, which decrease it by 1 at each step. Once finished, it send the word "stop". Pattern matching is use to select the correct behavior.

import scala.actors.Actor
import scala.actors.Actor._

object main {
  class MinusActor(val master: Actor) extends Actor {
    def act() {
      loop {
        react {
          case (i: Int) => {
            println(s"$i--")
            master ! (i.toInt - 1)
          }
          case "stop" => {
            println("ciao")
            exit
          }
          case _ => println("whatever")
        }
      }
    }
  }
  class MasterActor extends Actor {
    val minusActor = new MinusActor(this)
    def act() {
      minusActor.start
      minusActor ! 10
      loop {
        react {
          case i: Int if i > 0 =>
            minusActor ! i
          case _ =>
            minusActor ! "stop"
            exit
        }
      }
    }
  }

  new MasterActor().start()  
  Thread.sleep(1000
}



And much more with akka, future, async


Bullet #5: the ecosystem

No matter how brilliant, a language cannot succeed if it is not supported by a strong ecosystem which encompasses many aspects

Java integration

Scala code is compiled into Java. A very good side effect is that available Java libraries can be used transparently in a Scala code. Here is an simple example using apache commons.math.

import org.apache.commons.math3.stat.descriptive.moment.Variance
import scala.collection.JavaConversions._

object SD {
  val variance = new Variance()

  def apply(values: Seq[Double]): Double = {
    math.sqrt(variance.evaluate(values.toArray))
  }

  def apply(values: Seq[Double],
            weights: Seq[Double]): Double = {
    math.sqrt(variance.evaluate(values.toArray, weights.toArray))
  }
}

IDE

If Typesafe support an eclipse package, netbeans is used by many. Some development environment, such as Activator/Play! are strongly embedded into the browser and allow to use any text editor for the source code.

Web frameworks

if Play! is the most comprehensive one, some lighter alternative light scalar are available.

RDBMS integration

More than an ORM, slick is a mainstream solution. The comfort of an ORM, with the flexibility to manipulate list in the Scala fashion. Depending on the connected database, the generate SQL is optimized.

And NoSQL

Any Java driver is usable. But some tools are natively Scala oriented, such as Spark (in-memory large database) or reactive mongo.

Bullet #6 REPL


Experimenting is a key component of discovering a language. A REPL (Read-Eval-Print-Loop) allow to see the code executed on the fly.

worksheets

Even better, the eclpise IDE (at least) allows to have worksheet. Enter code, and each time the code is saved, it is evaluated. This offer the possibility to use object defined in the source base to be evaluated interactively.

This is the shortest bullet int this list. But it can be sometimes the most efficient.

Plenty more bullets


The selection of six was pretty subjective, I could have gone further with much more bullets that make the daily life much more productive:
  • xml parsing: mixing xpath and the list manipulation for a mixed SAX/DOM approach;
  • http queries;
  • file system traversal and files parsing;
  • Json serialization;
  • dynamic classes, to add handle whatever method code;
  • regular expression (ok, Perl is the king, but Scala is not bad);
  • macros;
  • string interpolation and reverse;
  • optional ';'
  • more Java integration;
  • abstract classes;
  • profiling with yourkit or jProfiler
  • context bound & parametric types
  • streams
  • foldLeft /:
  • for loops structure
  • map getOrElse
  • case class
  • implicit
  • Mockito, EasyMock
  • ...

Cons


Several recurring criticisms on Scala are regularly flying back. Without making mine all of them, here a few common ones:
  • backward incompatibilities between major version: a jar compiled in 2.8 is not usable in 2.9. That maybe a price to pay for a maturing language, not being tight to a full backwards compatibility. That may be a serious issue for some, but personally, upgrading my code frmo 2.8 to 2.9 and 2.10, with its dependency has never been more that a couple of hour issue.
  • sbt is a pain: ok. But I've never made my way smoothly through maven neither. For most of the issues I faced, googling was enough and even though I cannot a full understanding of the tool, it does the job. Slowly, for sure, but it does it.
  • It's hard to hire people. That should hopefully evolve, with more and more adopters and major companies making commitment to the language.
  • It's hard to learn: that is a common wisdom among scalers. It is definitely not a language a normal developer picks up in a week (contrary to Groovy, where the learning curve is steeper). Reading a book, or better, following Martin Odersky online course could be a worthy investment. But the common wisdom also says that when you have picked up the basis, the return on investment is great.

Further links

Google is your friends (well, nowadays, that's not totally true), but here are my favorites:

No comments:

Post a Comment