Sorry about not seeing your reply yesterday, HN is not as helpful as Reddit so I...

simondanisch · on July 16, 2018

On Julia 0.7, the broadcast kernel will look much more like this (and actually the version used in CuArrays/CLArrays currently also doesn't use those macros):

https://github.com/jrevels/MixedModeBroadcastAD.jl/blob/mast...

let I = @cuda_index(shape)

    @inbounds dest[I] = bc[I]

end

There are several things to point out here:

1) The macro @cuda_index doesn't really need to be a macro, but like this it's the shortest way to check the index and return when out of range (in CUDA/OpenCL you often need to schedule a loop with more iterations than you need).

2) the `dest[I] = bc[I]` does all the unzipping of arguments and dispatch to the correct fast getindex for a certain layout of dest/bc via the type system + tuples of the arguments (not using C-style varargs). This simple function makes it possible to get all the advantages of julia's lazy fused broadcasting to work.

Feel free to checkout this great blog article to see what's possible with the new lazy broadcasting:

https://julialang.org/blog/2018/05/extensible-broadcast-fusi...

You can also look directly at the broadcasting code in Julia Base, which is already a bit bloated to deal with lots of edge cases, but is still pretty readable in parts:

https://github.com/JuliaLang/julia/blob/master/base/broadcas...

abstractcontrol · on July 17, 2018

I like how the broadcasting comes out in the end in Julia. Besides that Julia has quite many niceties that I could only wish Spiral had and I do think that they matter significantly even if they are not vital. This is definitely praiseworthy work.

Since I haven't done so and I really should have, if you or anyone else still looking at this thread are curious how GPU programming without macros looks like then take a peek at this:

https://github.com/mrakgr/The-Spiral-Language/blob/3a94ca644...

Comparing Spiral and Julia code that has been linked so far, I feel that no one would ever guess without knowing ahead of time that Spiral is a static language and Julia dynamic based on these examples.

As an example of this in action, here is how layer normalization is implemented using the 'seq_broadcast' kernel which does a sequence of reductions and broadcasts in registers. It is also used to implement softmax for example.

https://github.com/mrakgr/The-Spiral-Language/blob/3a94ca644...

simondanisch · on July 17, 2018

>GPU programming without macros looks like then take a peek at this:

I'm not sure if you missed my examples, or if there is some other missunderstanding going on ;)

GPU code in Julia works 100% without macros, but you are free to use them, so people do ;) Also, you can pretty much write the gpu kernels like pseydo code, not sure how much simpler it can get.

With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)

I wrote an article some time ago about generic programming with Julia's gpu infrastructure:

https://techburst.io/writing-extendable-and-hardware-agnosti...

If you can make concrete examples of how things can be simpler than this, I'd be delighted to hear them :)

Cuda only (for no reason actually) and with lots of macros, here is another article about gpu programing in Julia: https://mikeinnes.github.io/2017/08/24/cudanative.html

Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.

abstractcontrol · on July 18, 2018

> With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)

Heh, I've noticed that the setup code tends to come out longer than the actual kernels. It is inside `kernel = cuda`. Spiral is indentation sensitive like Python and F#.

The stuff inside {} is just module creation, think of it like tuples with named fields.

> Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.

No, I've yet to actually benchmark it. It really depends on how good of a job NVCC does with the generic sequential reduce kernel. I'll do an in depth analysis when I am done with all the neural network work that I am doing currently which might take a while.

https://github.com/mrakgr/The-Spiral-Language/blob/7ecd30bdf...

You can see how it is implement here for the forward and backward parts.

https://github.com/mrakgr/The-Spiral-Language/blob/7ecd30bdf...

In the actual cost function, it is a bit different since I fuse the forward and the backward parts.

It is a bit ugly because of all the type checking for whether the argument is a dual, but that is all done at compile time.

> https://mikeinnes.github.io/2017/08/24/cudanative.html

This is actually loop unrolling example that I had in mind when I said that Julia needs macros for.

https://github.com/mrakgr/The-Spiral-Language#3-loops-and-ar...

You can in fact get loop unrolling with just standard functions in Spiral. I go into it in the context of this chapter. It is achieved by pattern matching over tuples and recursion.

> If you can make concrete examples of how things can be simpler than this, I'd be delighted to hear them :)

Yes, by replacing meta programming with intensional polymorphism and inlinining guarantees. Also reflection should be done using pattern matching. I think this last one could be taken entirely seriously as it would not involve a replacement of the entire type system.

All the claims in the article about inlining and specialization that make it sound like magic is what in general makes me dubious about languages pretending to be speed kings. Yes, I am aware that GPU kernels do not require optimizers capable of having monads for breakfast and that in the context of GPU programming where they were made they are probably true, but inlining is the sort of thing that matters more the more high level a language is. For very high level languages that desire speed, there isn't much choice but to make them a part of language semantics despite the added burden it puts on the user.

Since we are still at it, I have a question I need to ask.

Recently I've been informed that Julia is capable of GCing GPU memory. If this is fully integrated that would be a major feature which is not possible in say .NET or Racket. I really wanted this in Spiral and could not get it in .NET.

By fully integrated, I mean much like for regular memory for which the GC takes note of the state of the system for when to do collection and defragmentation. If it can only make a thin wrapper with a finalizer (much like in .NET) around an unmanaged resource then it is not a big deal.

Is it fully integrated or is it a wrapper style memory management?

If it is the later, then that is too bad, but I'd suggest to Julia devs that they work on making it fully integrated as it would be a really good feature. Obviously, I can't do it in Spiral as I would need to write my own VM and I have only so many years in my life.

simondanisch · on July 18, 2018

>You can in fact get loop unrolling with just standard functions in Spiral.

We can do this too actually - and spirals approach sounds very similar to what you'd do in julia :)

https://gist.github.com/SimonDanisch/812e01d7183681ed414932c...

As a matter of fact, it's what the compiler devs keep recomending. As a developper I must say, that it's pretty relaxing to take a meta programming shortcut from time to time ;) In general, we plan to offer a tool box that employs these kind of patterns for loop unrolling, tiling etc, to make it easier to write GPU code without meta programming and perform certain optimizations/scheduling patterns based on a more trait like system.

> Also reflection should be done using pattern matching

I feel like what Julia does is just more low level right now - And you get pretty far with just multiple dispatch! If we needed more than that, we could put the effort into extending the language into that direction. But I think your use cases must get be pretty advanced untill you need something more complex. Can you recommend a nice article that shows off beautiful pattern matching for reflection? I see how it sounds nice, but I can't really imagine right now how it would improve my life.

Concerning inlining, I do think we actually force codegen to always inline when compiling for the GPU. I don't remember a 100% anymore, but there might have been some problems with that. But it's definitely the goal, to have a flexible compiler that you can fine tune to specific domains like GPU programming. And if we don't get it as part of the compiler, we can definitely do things like that with https://github.com/jrevels/Cassette.jl/

Appart from that, I'm pretty happy that our GPU compilation infrastructure isn't making Julia less useful as a general purpose language ;)

>Is it fully integrated or is it a wrapper style memory management?

It's wrapper style. We're playing around with different kind of hacks to make it less worse, but those are hacks you could do in any GC language. The good news is, that the Julia compiler devs take GPU computing seriously and promised to work on a full integration that is aware of the GPU hardware.

abstractcontrol · on July 19, 2018

> As a matter of fact, it's what the compiler devs keep recomending.

I definitely agree with them in this matter. Macros might have a role in language development, but they should not be a stand in for compiler optimizations. For safety and speed, the type system should be there.

https://github.com/mrakgr/The-Spiral-Language/blob/0741959ac...

Here is how the example you've shown could be done in Spiral. Maybe I should add express support for literal testing in pattern matching, but it is not a pattern that comes up too often by itself.

> I feel like what Julia does is just more low level right now - And you get pretty far with just multiple dispatch!

What you say is exactly right as pattern matching compiles down to those low level operations, so there no reason at all why those low level operations should be done by hand. Pattern matching does not depend on a particular type system or whether the language is dynamic or static. There are only so many good ideas in programming languages and this is one of them.

Though there is some overlap between pattern matching and multiple dispatch, the roles are different. The purpose of multiple dispatch is extensibility, but the purpose of pattern matching is destructuring.

https://stackoverflow.com/questions/2502354/what-is-pattern-...

Note the great disparity in the F# examples that do pattern matching and the C# examples that do manual reflection.

https://github.com/mrakgr/The-Spiral-Language/blob/0741959ac...

Here are a few very simple examples of it in action in Spiral. I use them as compiler tests.

https://github.com/mrakgr/The-Spiral-Language/blob/0741959ac...

Here is quite a complex example of how it is used in action. I won't go into detail of this here, but you can see how I repeatedly match on the contents of a module at different times in order to get more generic functionality for the kernel.

The particular kernel shown here is the most complex one that exists in the library right now - I am yet to get to things like generic matrix multiplication and convolution, but I'll get there eventually.

> I see how it sounds nice, but I can't really imagine right now how it would improve my life.

Let me just say that it is really difficult to know ahead of time how a particular language feature would affect your programming life. I could have said the same thing about first class functions back in 2015. I am sure in the future there will be such features I can't even imagine right now.

> And if we don't get it as part of the compiler, we can definitely do things like that with https://github.com/jrevels/Cassette.jl/

I'll have to watch the talk on this. Thanks for the link.

simondanisch · on July 26, 2018

>The purpose of multiple dispatch is extensibility, but the purpose of pattern matching is destructuring.

Agreed! ;) Just wanted to point out, that I was able to use multiple dispatch elegantly in the unroll example, where you are using pattern matching.

>pattern matching compiles down to those low level operations

So I guess that means Julia could indeed satisfy you, if we gave those low level operations some more syntactic sugar? Have you seen: http://kmsquire.github.io/Match.jl/latest/ ?

>Let me just say that it is really difficult to know ahead of time how a particular language feature would affect your programming life

True story. I'll try to be more observant when writing code and see, if I could actually solve things more elegantly with better pattern matching.

Btw, I'm writing on an article about GPU programming in Julia - do you have feature you'd like to have explained, or a killer feature you believe we can't do and that would impress you if we did?

StefanKarpinski · on July 17, 2018

You keep making assertions based on completely false assumptions about Julia and its type system. It might come across better if you posed these assertions as questions instead. A few counterpoints...

> One immediate benefit of a powerful type system that I'd like to point out is that Spiral allows reflection over tuples, which is a lot more elegant solution to a problem of a function having variable arguments than the C-style varargs mechanism that I see Julia using in the examples you've shown me.

* Julia's type system can reflect on tuples just fine at both compile time and run time and we do so all the time. When tuple size and contents are statically knowable, there is no run time overhead.

* Julia's varargs are represented as tuples, which can be and are reflected on by the compiler. Julia's varargs has nothing to do with C varargs except superficial usage similarity.

> Another benefit is that Spiral has no need for separate static array machinery. Because literals can be a part of a variable's type and if not, unless explicitly prevented the literals tend to be propagated through function call boundaries (join points).

* The fact that Julia's default array types doesn't include static dimensions is a design _choice_ not a type system limitation. Specializing on the exact dimensions of an array is _not_ usually what you want to do in general computational work. If it _is_ what you want to do, you can use the StaticArrays package.

* Literals can be type parameters in Julia too—this is done all the time. Indeed, as I said elsewhere in this thread, type parameters can be other types, integers, floats, symbols, tuples, and recursive constructions thereof.

* The StaticArrays package is implemented in pure Julia, demonstrating that the type system is perfectly capable of reflecting on tuples and having dimension sizes and tuples of them as type parameters—that's how the static array types are implemented. The compiler generates completely unrolled, fully inlined code for static array operations since the dimensions are known at compile time.

> Assuming they are known at compile time the inlining of tensor dimensions in loops is actually the default behavior in Spiral. I've yet to do benchmarking to see how helpful that is, but I am sure it would help reduce register pressure.

I understand that Spiral is targeted at a context (GPUs) where static arrays are a good default. The fact that Julia does not take that approach as a default is not a type system limitation (as proven by the existence of StaticArrays), it's a choice based on the fact that static arrays are not the right default for general numerical computing. There are many situations where being forced to know array dimensions at compile time is not helpful or even actively problematic.

> Your examples are not quite enough to get me to apologize for my rude behavior, but are enough to get me to shift my views a little so I'll change the offending sentence so it reflects reality more accurately.

I'm unclear on why any examples are necessary to apologize for rudeness. Being right doesn't justify being rude.

As far as I can tell, the relationship between Spiral and Julia's approaches is roughly:

* Spiral is static and primarily focused on statically known, fixed sized arrays and generating good code for GPUs.

* Julia is dynamic, and primarily focused on flexibility along with C-level performance on CPUs.

* Julia partially covers Spirals target territory with StaticArrays and GPU code generation capabilities.

* Spiral does not, on the other hand address (or aim) to the more general array computing that Julia supports.

Given the substantial overlap in capabilities, it seems like dismissing Julia out of hand for the application areas you're interested is both unwarranted and unwise. You may want to be a bit less dismissive and understand it and learn from it instead. In particular, Spiral's type system is not more powerful than Julia's. They are different since Spiral's is static and forces types to be known and checked at compile time (as far as I understand). Julia's type system relaxes this, allowing types to be unknown until much later—and as a tradeoff, checked much later. I have yet to see anything that Spiral's type system can express that cannot be expressed in Julia's. Of course, we're also cheating since Julia doesn't do type checking and therefore having a highly expressive type system doesn't really cost us much.

abstractcontrol · on July 18, 2018

* Julia's type system can reflect on tuples just fine at both compile time and run time and we do so all the time.

Then why do I nowhere see that actually being done? Reflection on tuples should be done much like pattern matching on lists in functional languages. What is the point of that VarArgs nonsense? I know that Julia has pattern matching - now it needs to take the next step and actually make use of it.

Is this Julia's way of making itself familiar to C programmers?

I see those `...` elipses used as some kind of operator in both C++ (and Racket ironically) and they are a horrible idea as they are completely implicit and non-obvious in their function.

Using type membership tests + if statements is so 90s. Is Julia also trying to draw in the Java crowd here by trying to be closer to that language?

> <all those other points>

> There are many situations where being forced to know array dimensions at compile time is not helpful or even actively problematic.

It is not that Spiral enforces that dimensions be static. Rather if they are known at compile time then that information is merely propagated forward including through function call boundaries.

You seem to miss what it really means to have first-class types together with staging in a language. Spiral's tensors can be arbitrarily static or dynamic in their dimension and can even allow some dimensions to be static (known at compile time) while the others are dynamic. No friction results from this.

All this does not actually require separate implementations like in Julia. Both Spiral and Julia have Turing complete type systems, but based on this I can conclude that Spiral's is more expressive.

It would be trivial to force it so all the dimensions of a tensor are dynamic, but why would one want to propagate less information during compilation? It is not like dimensions of tensors change that often or arbitrarily like common variables do.

Also let me just state for the record that Spiral is intended to be more than a GPU language. A language with the capabilities of doing it elegantly was my motivation, but Spiral featureset makes it uniquely suited for both very high level and very low level programming.

A language saying it wants to be as fast as C counts for very little, the question is how fast it would be once the code starts being really abstract? This is why the rare one benchmark that currently exists in Spiral's documentation is for parser combinators.

GPU kernels are not a good test bed for language speed because they do not use that many high level features except for the ones needed for tensors. I'd be more concerned with all the scaffolding needed to set up the kernel. And in fact that was one of the majors concerns for me back when I was doing a ML library in F#.

https://www.youtube.com/watch?v=YRoruJRmuLc&t=2537s Jeffrey M. Siskind – The tension between convenience and performance in automatic differentiation

In the field that Julia is aiming for, I'd rather see a benchmark for a CPU based AD library that works on scalars which is a use case in scientific computing. Optimizing this is actually difficult given that none of the mainstream functional languages can do it. I'd expect the same situation as with monads for Julia where the inliner just gives up.

> Of course, we're also cheating since Julia doesn't do type checking and therefore having a highly expressive type system doesn't really cost us much.

I do not know whether the code that was linked in these threads is a representative sample for Julia, but Julia's code to me looks much more like it was written in a static language than Spiral's does.

I think that language wars such as these are necessary in order to come to the truth. It is not that I am being rude on purpose, it is that rudeness in a battle is to be excused and seen as unavoidable. Being right is a process rather than a fact because being able to get it down to a fact is rare and getting the fact to be accepted is even rarer. Cooking a good meal requires flames.

Language semantics have indivisible algorithmic properties that set them apart from each other and hence they cannot ever be equal. The best thing therefore is to enjoy the division. I see this as different than celebrating diversity.