A Group Project

08/08/2017

In those far distant days when we wrestled with dinosaurs it was the senior year of my undergraduate experience. Our AI instructor assigned a group project and four of us met in that well air-conditioned computer lab around that warm glow of VT100s. There was an awkward bit of silence till Adam [names have been changed to protect the guilty] spoke up and said, “Well I don’t understand how we all work together on a program.”

I spoke up and with surety said “Simple, we just divide the work into subroutines and each person does a subroutine and all we have to agree on is interfaces.”

At that moment I became team leader and stepped right over onto a slippery slope paved with equal parts of my own ignorance and arrogance. I grabbed a sheet of paper and quickly scribbled out 3 interfaces.

f(x) :: Filename -> Config

g(x) :: Config -> (Scenario -> Score) -> Scenario

h(x) :: Scenario -> Score

With a definition of what each of those pieces were, and I said “None of those routines are too difficult to write, and if one needs to change the interface just communicate with your partner who is effected. I’ll write the main loop and do the write-up and we can make quick work of this. What do you say we meet next week, same time and place?”

There were nodes of agreement, and after fishing for some nickles [we called them bees] we made copies at the machine and everyone had a copy of the defined interface. Adam took f(x), Billy took g(x), and Carmen took h(x). I knew that all I had to do was write a trival function to tie it together and fill in the blacks on a piece of paper, delegation and leadership has its perks afterall.

I wrote my routine and headed off to Tulane’s infamous Boot to congratulate myself on such a brilliant piece of work.

print g(f(ARGV[1]), h)

The next week came and we met again to stitch it all together. Adam said, “I wanted to really understand what made g(x) needed so I wrote that but couldn’t figure out the proper regexs to make the file reading work so I don’t have f(x).”

Billy spoke and said, “I didn’t like the config definition as it didn’t work for me so I wrote something entirely different and the core algorithm didn’t interest me so I did something else the professor didn’t ask for. So I have J(x).

Carmen spoke and said, “I don’t have time for this, I was hoping one of you would have done my work already out of curiosity.”

There were odd stares around the table. The realization as to why we were assigned a group project began to sink in with me. Success in development teams is greatly impacted by other factors than technical prowess. It’s really about about communication and following interface specifications, something none of us had learned to date. It’s also about cutting losses. We still had another week, so I said “Everyone email me your code and I think I can make it work. Let’s meet in a week.” Being the leader also means you get to do what doesn’t get done.

I met up with Adam later. I worked with him on the regexs. The scoring function was trivial. We got his version working fairly quickly. The next meeting we all signed off and turned it in. Got an A, many teams never made it over the finish line. This small exercise I’ve seen repeated over and over on larger and larger scales of money and resources. Small group projects with definable interfaces are an incredible learning tool.

UseR!2017 Notes

07/10/2017 Vanderbilt sponsored a trip for me to Brussels to attend UseR!2017. This is a loose collection of my notes from the conference.

I met a Wolfgang from Austria who does clincial statistics on general medicine. Showed him my slides of tangram and he was very interested and said he would be watching the project as he thought it could be very useful to him. Later had dinner with him. He was mortified when I described Volka’s Fire in the Lake boardgame. “Why would anyone ever want to play such a monstrocity?” I really hope to get in touch with him and get tangram to help his research.

Attended Martyn Plummer’s JAGS tutorial. It was very well done. Martyn is a natural educator and very welcoming and engaging. The slides and exercises are provided at tinyurl.com/yca2soe5. He was asked about comparing it with Stan. He did, but ended with this quote: “All the cool kids are using Julia, so why not do that instead?” JAGS looks like a really nice set of tools for Bayesian modeling.

Met an Edvin from Leiden on the bus and we discussed financial risk models. He had a very interesting project. I shared with him my spectral ideas for treatment of a Cauchy as a sum of normals to estimate within a percentage of desired risk.

Met an Ian Lyttle from Scheinder electric. They are using R to model a good number of things about their product line, and are interfacing internally with their company via Shiny. Shiny comes up many many times for the rest of the conference. Overall an interesting presentation. He said he sometimes comes to the Smyrna plant near Nashville. He ended up out socializing with Lucy and Nic, so I think we’ll be hearing more from him.

Overall, the corporate presence was more than last year. Microsoft and Oracle have both discovered the power of R and are doing everything they can to pull companies into their R offerings. They were given for the most part the main hall lectures which unsurprisingly were mostly empty while the Shiny track was standing room only out the door. Shiny was the hot technology in R at this conference. I thought about why this was, and having experience about 20 web frameworks I think the programmatic interface for Shiny is incredibly easy and easy to extend. It’s performance as a web server is terrible in comparison to most frameworks, but for low volume applications it works quite well. There was an opening Shiny talk which was supposed to be about scaling which covered none of that, and the questions got tense demanding details about scaling.

There was no employment opportunity posting board. If anyone mentioned they were in the market for a job they got about 3 offers before they turned around. Demand for R programmers is very high. I got an unsolicited offer.

On a personal note, I was able to make the trip without a migraine and for someone with a severe migraine disorder triggered by sleep disturbances this was no small feat. I adjusted my sleep cycle over days going and over the days returning and the strategy worked. I did get ill on the first of the conference during the JAGS tutorial and feared about missing more but after a fever/etc it passed and I was back in the saddle for Day 2. Martyn’s talk on JAGS was really nice even though I was in poor spirits. I spoke with Martyn afterward and he was disappointed that Frank was not at the meeting.

Every person I showed my slides on my phone about tangram split into two categories, unimpressed and their work was non-clinical or ‘I need that NOW!’ and they do clinical work. I spoke with Tom and we need to find a conference to present at that overlaps clinical and statistical. I also thought hard about what I could have done to get a talk (there were some unimpressive talks) and I think the abstract needs a complete overhaul and some more marketing over time to build up a user base. The two things that people wanted most were LaTeX and Word interfaces, so the current work cycle towards a full stack implementation should continue.

Future ideas for tangram that I had at the meeting. The first is that the parser should support R formula syntax 100% and be useful outside the library as a formula parser. I spoke some with Uwe and he knew of no Backus-Naur (BNF) documentation of the formula parser. He said read the FORTRAN source. I think I’m going to have to, and create the BNF document myself and Uwe said the core team would love it documented. The internal parser makes assumptions about the semantics of the formula and isn’t easily used as an abstract syntax tree the way it’s coded.

Secondly, talking with the RStudio folks they were saddened to hear that Jeff had left us and wanted to contact him about job opportunities. They said that pandoc supported Word natively. I thought I tried this and it didn’t work well, but I’m going to give it another try. What is really needed in this context is a traceability key that can be added as a comment. They said they thought the pandoc team would be open to adding such a small extension, and then I could modify the tangram framework to work directly with pandoc–this would be ideal! The possibilities such a tiny modification to pandoc open up are enormous for traceability.

I’m working towards finishing out LaTeX support. I have a UNICODE to LaTeX map for use in R defined by merging three sources here: https://github.com/spgarbet/tangram/blob/master/R/render-latex-map.R. This I think would be useful in a variety of contexts. In general I’m going to create support for a subset of Rmd to LaTeX for use in the library. The biggest issue I’ve had in that department is that Rmd does not support superscript / subscript so I’m toying with including the LaTeX definitions in their raw form.

There was an interesting presentation of rags2ridges which was said to find network relationships in a more robust manner than lasso. A paper is published on the performance of the algorithm. In general a very exciting piece of work, and I need to know more about this as network discovery is something of interest to me.

There was a really nice visual presentation of clusters by Agnieszka Sitko using factorMerger. One of the methods used likelihood ratios to determine branching. See the paper.

Thomas Petzhold had the definitive growth rate package in R. It would take a growth rate model from differential equations, or any of several formulic ones such as Gompertz or logistic and allow for a regression over data sets. Really a wonderful growth rate package. The ‘fracprolif’ package should just be folded into his.

Saw a presentation on Multiple Correspondence Analys which was billed as ‘PCA on categorical information’ which could prove useful at somepoint.

I got to have a meeting with Bart Smeets and Iñaki Úcar and they are doing some wonderful work on ‘simmer’. They showed me the new package ‘simmer.optim’ which optimizes parameters from a simmer model. I need to explore this further.

simmer.optim can run any javascript inside R.

There were also some psychometrics presentations that had tools in Shiny for evaluating test performances over a student body. This may be of interest to our graduate program.

As an aside I saw this on twitter (https://twitter.com/R_Programming/status/885084780777353216), which I think tangram can do already and if not with very little modification can do it in all the rendering formats. They had a much larger team that was thrown at the problem. Note how well they marketed what they’ve done. I need to wrap up and write a short paper for the R Journal.

Julia and Parallel

03/15/2017

I’m working on an economic model of finding out the best strategy for patient testing with genomic panels and how it effects treatment. We used discrete event simulation, simmer in R, and it’s working well. The problem has become performance. Since most of the events we’re interested in are rare events, the simulation size is up around simulating 4,000,000 patients to tease out enough resolution to make economic forecasts of impact as measured in expected cost and quality of life. We need to look at sensitivity of the model in the parameter space and find where the biggest benefits are, and what makes the model efficient or not. To do this requires around 200,000 model runs. This ends up meaning we need about a trillion simulations. Current speed is around 500 simulations/second. Which leaves me needing around 65 years of computing time. Fire up that ACCRE cluster!

Or alternatively I could do something a bit more intelligent and optimize model execution. I’ve figured out that solutions of these models are all solvable via solving differential equations in queuing theory. I’ve even gone further and found that the general equation is re presentable as a age structured partial delay differential equation. Turns out these are difficult to solve, but fortunately for the project there is a subset of these problems that are solvable, and it just happens it’s this set of problems. We came up with a simple test model, and have implemented that in DES, and I rewrote it as a numerical delay differential and used deSolve to crack it. Takes 11 seconds to numerically solve the model. That’s the full solution with 1e-12 digits of resolution; the equivalent of infinite numbers of simulations. This puts it into the realm of taking 25 days of computing time to solve the sensitivity problem. Better, but not good enough.

That was when I discovered Julia in looking around for a compiled solution. I recoded the model, the resulting model code is arguably simpler than the R code. It runs a model 0.02 seconds once the Julia code is compiled (which it does on start up of execution–so there’s a delay in getting going). Now we’re talking an hour to run the sensitivity analysis. I’ve already done this and passed it into Saltelli’s Variance Based Method and validated results. Was able over the last few days to do around 5 runs and debug a few things in the flow. Julia is certainly a powerful mathematical language.

The full model is not going to be as easy as the simple one. Probably closer to 0.2 seconds a run to solve. This is feasible, but I want more. I have 8 cores on my box and should be able to run a model in parallel using all that power to make it happen in short order. I thought it’d be a simple command like mclapply from parallel as shown in this R session:

> library(parallel)
> solve_model <- function(x) 2*x
> x <- unlist(mclapply(1:100, mc.cores=8, function(i) {solve_model(i)}))
> sum(x)
[1] 10100

This is where is got weird. Julia doesn’t follow the principle of least surprise when it comes to parallel processing. I have this short Julia example equivalent of the R example of what was actually required to get it to work:

@everywhere module StuffToUseParallel

  function solve_model(x)
    2*x
  end
  
  export solve_model
  
end # StuffToUseParallel

using StuffToUseParallel

x = SharedArray{Int}(100)
# documentation says use @sync @parallel if you actually 
# want it computed at the end of the loop
# But that didn't work.
futures = @parallel for i in 1:100
  x[i] = solve_model(i)
end

# @sync didn't work, but this does
for f in futures
  fetch(f)
end

# Now it's safely available to use
println(sum(x))

First I had a lot of functions, and they didn’t exist in parallel threads unless I put @everything in front of every function. To get it to be available in multiple threads easily, I enclosed it in a Module and then used the @everything to make it available. Of course, since it’s a Module it has to export the useful functions. Then you have to using the module, and now those functions are available in multiple threads.

The documentation for Julia clearly showed that it was necessary to use a SharedArray, which didn’t work for higher order datatypes. I had to use a flat array. If you have multiple values being returned, this gets ugly fast. The real frustration was that the @sync directive didn’t actually do that. I had to put an explicit loop and fetch all the futures that were returned. Then the data was safe to use.

That’s a lot of boilerplate and several surprising little land mines along the way. While Julia is clearly a powerful language for solving mathematical models, the parallelism is a bit difficult compared with other higher level languages. I’m sure there’s good reasons for having all these constructs, since one is directing a compiler towards making thread safe code. However, it just doesn’t feel natural. Julia is exposing a lot of parallelism concepts and this puts a lot of power in the programmers hands–who knows what he’s doing! It’d really be great if there was a simple mode that just worked without all the extra directives. However, I’m satisfied–I ran a full sensitivity analysis of the simple model in 20 minutes on 8 cores. We should be able to achieve 2 hours for the full model with these numbers.

By the way, Chris Rackauckas was incredibly helpful in getting me going. He authored the DifferentialEquation suite for Julia. Numerically my experience is this code is solid, blazingly fast, and very easy to use.

Here it goes

03/15/2017

Well I bit the bullet and I’m going static page generation. A lot easier than dealing with code updates. Hosting is now trivial from github. It’ll take some time to resurrect old posts, but I’ve got the database backed up, and I just have to figure out how to decompress Zotonic assets.

Revamp Ideas

10/28/2016

Beginning a revamp of my website, once again. I saw Nick Stryer’s presentation today on how to simply put together a static website. I knew most of the parts and pieces, but the simplicity attracted me. Especially considering I have broken my Erlang site and have to invest some time to fix it. I’m going to write a Postgresql export from Zotonic and see how far I can get with that.

I think the really up side to all this, is a lot of my projects can just all get central hub-bed through a static site. The github model of hosting is quite amazing for simple static sites.

Nick used a photo of him yawning as an example. I’ve posting that here as a tribute, and because once you’ve posted something like that to the internet it’ll probably end up a poster in your retirement home.

Nick Yawning