placeholder image Depiction of the bidirectional flow of data from Javascript to C++ and vice versa.

Performance Driven Development of Cross Runtime Node-API Native Addons (Part 1)

By: James Deal

Est. 6 min read

With the relatively recent release of the Bun Javascript runtime, I’d been inspired to start working on a new Javascript Machine Learning library targeting Bun (read: a viable alternative to Tensorflow.js). Taking a page out of Jarred Sumner’s book, I naïvely decided to write the library from scratch in Zig and started hacking away. Now this wasn’t an inherently a bad idea, but as a self taught developer, I had definitely bitten off a bit more than I could chew.

Fortunately, I discovered shumai, a fast, network-connected, differentiable tensor library for TypeScript/Javascript, which was in its infancy at the time, and was able to connect with @bwasti and @jacobkahn to help contribute to the project.

Fast forward a few months and shumai is more fully featured/production-ready (at least, to the same extent that Bun is production ready), but currently only supports Bun as the present implementation relies on Bun’s Foreign Function Interface (FFI) to call into shumai’s native bindings.

Check out the current implementation of shumai here.

Given our longer term goal of adding support for all popular Javascript runtimes (namely, Node.js and Deno) to shumai, we set our sights on implementing shumai bindings via Node-API, which is supported by Node.js, Bun, and Deno. This is the first in a series of posts on our journey to implement shumai bindings via Node-API; we’ll detail our approach to performance driven development of cross runtime Node-API native addons and attempt to provide some, generally speaking, “best practices” for optimizing the execution time and memory usage Node-API native addons.

Identifying Areas to Optimize

While Bun’s FFI API is incredibly low overhead, we’ll still want to profile the code to look for low-hanging fruits in terms of areas where we can optimize the logic. This will give us a good idea of where we can best leverage Node-API to improve performance.

Using Bun’s startSamplingProfiler (exported by bun:jsc), it’s incredibly simple to profile a simple loop to look for any performance issues:

// bun run toArrayBuffer/index.ts
import * as sm from '@shumai/shumai';
import { startSamplingProfiler } from 'bun:jsc';

startSamplingProfiler('toArrayBuffer');
const array1 = new Float32Array(128);
const array2 = new Float32Array(128);
for (let i = 0; i < 128; ++i) {
	array1[i] = Math.random();
	array2[i] = Math.random();
}
const a = new sm.Tensor(array1);
const b = new sm.Tensor(array2);

let res: Float32Array;
for (let i = 0; i < 100000; ++i) {
	const c = a.add(b);
	res = c.toFloat32Array();
}

The output indicates that the most expensive calls are:

  • _float32Buffer, which returns the underlying buffer backing the Tensor.
  • toArrayBuffer, which is called internally by toFloat32Array, and used to convert a pointer to an ArrayBuffer:
Sampling rate: 1000.000000 microseconds. Total samples: 912
Top functions as <numSamples  'functionName#hash:sourceID'>
   729    '_float32Buffer#<nil>:4294967295'
    43    'toArrayBuffer#<nil>:4294967295'
    39    '_add#<nil>:4294967295'
    39    'wrapFLTensor#<nil>:5'
    // ...

Notably, calls to the toArrayBuffer method exported by Bun’s bun:ffi module seems potentially avoidable if we were to leverage Node-API (NAPI). This is consistent with conversations with Jarred Sumner on Bun’s Discord with regards to optimizing performance of Bun’s FFI/NAPI implementations; per Jarred, ”toArrayBuffer() is a cycle through JS & native… You should use NAPI for that.”

Given the above, we’ll likely find the largest performance gains will be realized if we can leverage Node-API when initializing a new Tensor from an existing Javascript TypedArray and when returning a Tensor’s underlying buffer as a Javascript TypedArray (both rely on calls to toArrayBuffer in the current implementation).

“Wrapping” Native Objects for Javascript

N.B. Given that shumai’s backend is built with Flashlight, which is written in C++, we’ll be leveraging the node-addon-api module to write our Node-API native addons. Additionally, given that building shumai’s FFI bindings is done via CMake, we’ll be employing Cmake.js to build the Node-API native addons.

If you’re anything like me, you’ll briefly research how to wrap a Native Object using NAPI and quickly assume that best practices would dictate the usage of Napi::ObjectWrap to expose a native object that contains a field that will hold the reference to the native object we’re attempting to expose to Javascript; something along these lines:

class Tensor : public Napi::ObjectWrap<Tensor> {
 public:
  Tensor(const Napi::CallbackInfo&);
  static Napi::FunctionReference* constructor;
  // stores pointer to the native `fl::Tensor` being wrapped
  fl::Tensor* _tensor;
  // class methods, `Finalize`, etc. omitted for brevity...
};

While this design works, it’s overkill for our use case given we ultimately only need to expose a pointer to the native fl::Tensor to Javascript and will not be making use of any of the features unique to Napi::ObjectWrap; rather, we can simplify the logic by simply returning Napi::External<fl::Tensor>, which creates an Napi::Value object with arbitrary C++ data, directly:

  auto* tensor = new fl::Tensor(t);
  Napi::External<fl::Tensor> wrapped = Napi::External<fl::Tensor>::New(env, tensor, DeleteTensor);
  return wrapped;

TODO: ADD BENCHMARKS DEMONSTRATING PERFORMANCE GAINS

Constructors

Given that the most commons means of initializing a new instance of the Tensor class exported by shumai is by passing a Javascript TypedArray, we’ll need to implement functions that handle initializing a native fl::Tensor object from a Javascript TypedArray. Here’s the initial implementation of _tensorFromFloat64Array, which takes a Float64Array as an argument and returns a reference to the newly initialized fl::Tensor:

static Napi::Value _tensorFromFloat64Array(const Napi::CallbackInfo &info)
{
  Napi::Env env = info.Env();
  if (!info[0].IsTypedArray())
  {
    Napi::Error::New(env,
                     "`tensorFromFloat64Array` epects args[0] to be "
                     "instanceof `Float64Array`")
        .ThrowAsJavaScriptException();
    return env.Null();
  }
  Napi::TypedArray _tmp_typed_array = info[0].As<Napi::TypedArray>();
  if (_tmp_typed_array.TypedArrayType() != napi_int8_array)
  {
    Napi::Error::New(env,
                     "`tensorFromFloat64Array` epects args[0] to be "
                     "instanceof `Float64Array`")
        .ThrowAsJavaScriptException();
    return env.Null();
  }
  // relevant logic omitted for brevity (see next snippet)...
}

With an eye for minor optimizations at each step of the process, we’ll consider the ultimate usage of this function. Given that _tensorFromFloat64Array is only called internally in the Javascript Tensor class constructor and that the constructor accepts arguments of types other than TypedArray, we’ll need to check the types Javascript Tensor class constructor; this means the type checks in the _tensorFromFloat64Array are in fact, redudant, and can be removed. The resulting logic is cleaner and runs slightly faster after removing the duplicated type checks:

static Napi::Value _tensorFromFloat64Array(const Napi::CallbackInfo &info)
{
  Napi::Env env = info.Env();
  Napi::TypedArray _tmp_typed_array = info[0].As<Napi::TypedArray>();
  int64_t length = static_cast<int64_t>(_tmp_typed_array.ElementLength());
  double *ptr =
      _tmp_typed_array.As<Napi::TypedArrayOf<double>>().Data();
  auto *t = new fl::Tensor(
      fl::Tensor::fromBuffer({length}, ptr, fl::MemoryLocation::Host));
  auto _out_bytes_used = static_cast<int64_t>(t->bytes());
  g_bytes_used += _out_bytes_used;
  Napi::MemoryManagement::AdjustExternalMemory(env, _out_bytes_used);
  Napi::External<fl::Tensor> wrapped = ExternalizeTensor(env, t);
  return wrapped;
}

TODO: ADD BENCHMARKING/GRAPHS/CHARTS

Future Areas to Explore

At least in Bun runtime, it’s feasible to use a hybrid approach where Node-API and Bun’s FFI can be used in tandem to optimize performance. For example, we could use Bun’s FFI to initialize a new Tensor and return the underlying data backing the array as a Javascript TypedArray, while leveraging Bun FFI for Tensor operations. Such an approach allows us to escape the bottlenecks of toArrayBuffer while simulataneously getting the performance benefits provided by Bun FFI “doing as much as possible inline directly.” (source: Jarred Sumner)

Shoutouts

@Jarred-Sumner - Bun has been a joy to work with; thanks for being accessible in Bun's Discord to field questions, help point us in the right direction, quick bug fixes (#1733, #1739, #1808), quick to add features, and for being generally awesome!

@bwasti & @jacobkahn - Thanks a ton for helping me get up to speed working on shumai and for greatly accelerating my C++ learning curve by pushing me to work on logic a bit more outside of my comfort zone.

Copyright © 2023 - All right reserved by James Deal