Heterogeneous computing
Support for heterogeneous computing is currently being worked on.
User-facing interface
GPU support in Trixi.jl is controlled via two keyword arguments to semidiscretize:
storage_type: the array type used for all internal data structures. Set toCuArray(from CUDA.jl) for NVIDIA GPUs orROCArray(from AMDGPU.jl) for AMD GPUs. Defaults tonothing, which keeps the standard CPUArray.real_type: the floating-point element type. GPU workflows typically benefit from setting this toFloat32. Defaults tonothing, which retains the type used when building the semidiscretization (usuallyFloat64).
Both arguments can be used independently. A typical GPU setup looks like:
using CUDA # or using AMDGPU
ode = semidiscretize(semi, tspan; real_type = Float32, storage_type = CuArray)The rest of the elixir (callbacks, ODE solver call) remains unchanged. See, e.g., examples/p4est_2d_dgsem/elixir_euler_source_terms.jl for a concrete example.
To use Float32 consistently, make sure to write all equations, initial conditions, boundary conditions, and source terms in a type-stable manner — avoid hard-coded Float64 literals and instead use, e.g., the f0 suffix (0.5f0) for exact values or convert(RealT, ...) for non-exact ones, where RealT = eltype(u). When using a StepsizeCallback for CFL-based step size control, also pass dt = 1 (an integer) rather than dt = 1.0 to solve. See Numeric types and type stability for detailed guidelines.
Internal use of Adapt.jl
Adapt.jl is a package in the JuliaGPU family that allows for the translation of nested data structures. The primary goal is to allow the substitution of Array at the storage level with a GPU array like CuArray from CUDA.jl.
To facilitate this, data structures must be parameterized, so instead of:
struct Container <: Trixi.AbstractContainer
data::Array{Float64, 2}
endThey must be written as:
struct Container{D<:AbstractArray} <: Trixi.AbstractContainer
data::D
endfurthermore, we need to define a function that allows for the conversion of storage of our types:
using Adapt
function Adapt.adapt_structure(to, C::Container)
return Container(adapt(to, C.data))
endor simply
Adapt.@adapt_structure(Container)additionally, we must define Adapt.parent_type.
function Adapt.parent_type(::Type{<:Container{D}}) where D
return D
endAll together we can use this machinery to perform conversions of a container.
julia> C = Container(zeros(3))
Container{Vector{Float64}}([0.0, 0.0, 0.0])
julia> Trixi.storage_type(C)
Arrayjulia> using CUDA
julia> GPU_C = adapt(CuArray, C)
Container{CuArray{Float64, 1, CUDA.DeviceMemory}}([0.0, 0.0, 0.0])
julia> Trixi.storage_type(C)
CuArrayElement-type conversion with Trixi.trixi_adapt.
We can use Trixi.trixi_adapt to perform both an element-type and a storage-type adoption:
julia> C = Container(zeros(3))
Container{Vector{Float64}}([0.0, 0.0, 0.0])
julia> Trixi.trixi_adapt(Array, Float32, C)
Container{Vector{Float32}}(Float32[0.0, 0.0, 0.0])julia> Trixi.trixi_adapt(CuArray, Float32, C)
Container{CuArray{Float32, 1, CUDA.DeviceMemory}}(Float32[0.0, 0.0, 0.0])adapt(Array{Float32}, C) is tempting, but it will do the wrong thing in the presence of SVectors and similar arrays from StaticArrays.jl.
Writing GPU kernels
Offloading computations to the GPU is done with KernelAbstractions.jl, allowing for vendor-agnostic GPU code.
Example
Given the following Trixi.jl code, which would typically be called from within rhs!:
function trixi_rhs_fct(mesh, equations, solver, cache, args)
@threaded for element in eachelement(solver, cache)
# code
end
endMove the inner code into a new inlined function
rhs_fct_per_element.@inline function rhs_fct_per_element(..., element, ...) ... endBesides the index
element, pass all required fields as arguments, but make sure to@unpackthem from their structs in advance.Where
trixi_rhs_fctis called, get the backend, i.e., the hardware we are currently running on viatrixi_backend(x). This will, e.g., work withu_ode. Internally, KernelAbstractions.jl'sget_backendwill be called, i.e., KernelAbstractions.jl has to know the type ofx.backend = trixi_backend(u_ode)Add a new argument
backendtotrixi_rhs_fctused for dispatch. Whenbackendisnothing, the legacy implementation should be used:function trixi_rhs_fct(backend::Nothing, mesh, equations, solver, cache, args) @unpack unpacked_args = cache @threaded for element in eachelement(solver, cache) rhs_fct_per_element(element, unpacked_args, args) end endWhen
backendis aBackend(a type defined by KernelAbstractions.jl), write a KernelAbstractions.jl kernel:function trixi_rhs_fct(backend::Backend, mesh, equations, solver, cache, args) nelements(solver, cache) == 0 && return nothing # return early when there are no elements @unpack unpacked_args = cache kernel! = rhs_fct_kernel!(backend) kernel!(unpacked_args, args, ndrange = nelements(solver, cache)) return nothing end @kernel function rhs_fct_kernel!(unpacked_args, args) element = @index(Global) rhs_fct_per_element(element, unpacked_args, args) end