React Fast at Scale

Process More at Scale

Data Ingest at Breakthrough Speeds

5G MEC, Autonomous Vehicle, Predictive Maintenance and similar edge compute and transport platforms need to make decisions and react at real-world speeds.   Our extended heterogeneous processing support results in JSON to Apache Arrow Ingest at 100X lower latencies vs. scaling out Xeon processors alone.

For your AI+ML this means latency sensitive algorithms react 100x quicker than generic Pulsar brokers alone.

A baseline analysis of both latency and throughput TCO make plain the benefits of adoption.  Value scales significantly by leveraging advanced processor support.   Unlocking this capability is straight forward.  SigmaX supports certified PCIe add-in cards for customers who wish to augment their existing deployments and/or offers fully qualified curated servers to build clusters from scratch or extend your capabilities.

Streaming Analytics

The Sigmax Stack takes a modern view of data as a stream of updates

Modern data – more specifically the characteristics of modern data are what motivate development of the Sigmax Stack. Over the past decade, both the quantity and quality of data have changed. No longer do we deal with static sets of data – for which the database driven architectures of past system architectures were built and performing full table scans in response to client queries are the surest way to kill performance in systems running at scale. In the real world of today, data is dynamic, voluminous and complex. It is constantly being updated. Further, our data needs to be immune from corruption and have its state reliably accessible from any point in the past. We need real-time access to newly ingested data, and we need to be able to query and stream data independent of its locality: Edge, Enterprise or Cloud.

These modern data at-scale characteristics inform and drive the development of the Sigmax Stack. Our stack is continuously enhanced and upgraded and we offer expert support to our customers to ensure implementation success. The open source nature of the Sigmax stack enables our customers to leverage the brilliance of the Apache community of open source developers, to avoid being locked in to a single proprietary vendor, and to have confidence of uninterrupted access.

Featured Stack Elements

Apache Arrow

Is an in-memory data format that is ideal for streaming analytics and big data systems. Arrow is the first piece in building an efficient, flexible and highly performant dataflow. Its columnar in-memory structure is cache efficient and allows for extremely fast query and processing in AI+ML and analytic environments.

Apache Arrow can be:

  • Written durably to storage
  • Transmitted over networks without the need for deserialization at the point of receipt, a key difference when compared with Avro or Parquet
  • Streamed in record batches of structured data rows
  • Read with zero copies

Arrow is an excellent data structure to support both distributed SQL query execution as well as dataflow connection to heterogeneous compute acceleration technologies (such as GPU and FPGA).

https://arrow.apache.org/

Apache Arrow Flight

This is a new client-server framework built for high performance transport of large datasets over network interfaces. Apache Arrow Flight is architected to be general purpose (can talk with any gRPC capable client with or without knowledge of Apache Arrow format), however, when paired with Apache Arrow data you get extreme efficiency in data movement.

Key features of Apache Arrow Flight include:

  • An on-the-wire representation of Apache Arrow data that doesn’t need to be de-serialized at its destination – Reclaims 60%+ of wasted CPU cycles
  • Language independence (11+ supported languages)
  • gRPC access to data on the wire, at the edge and in the cloud
  • Support for parallel synchronous data transfers to and from clusters of servers. (scale-out)
  • Security built in (TLS) plus ability to further customize security profile
  • Data Compression
  • Data Location Awareness (is my data local?)
  • Interoperability with any gRPC capable client

With Apache Arrow Flight large datasets are sent in a batch of rows at a time (record batching) and data streams which are sequences of record batches. The protocol acts as essentially a peer-peer connection which get and put operations are both initiated by the client.

Apache Pulsar

Apache Pulsar

Apache Pulsar is a cloud-native, distributed messaging and streaming platform.  It offers significant benefits when compared to other messaging platforms for example vs. Apache KAFKA:

  • Apache Pulsar has native support for data geo-replication
  • Significantly lower latency
  • Larger message sizes
  • Higher throughput
  • Streaming, Queuing and Pub/Sub native
  • Tiered persistent storage native
  • Linear horizontal scale-out

Pulsar is well suited for both Latency sensitive requirements of new 5G wireless data infrastructures as well as high bandwidth requirements put forward by high complexity schemas.

Apache Presto

Apache Presto

Apache Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.  Presto is not a database however – It understands SQL which provide the features of a standard database but can operate against a modern streaming analytics distributed compute environment.  Apache Presto is architected to perform well in big data scale environments and offers users of the Sigmax stack a simple SQL interface to query Pulsar data.

Dataflow Example

Example: FPGA coprocessor accelerated Ingest

In this example telematic data of a complex type is produced in many cars, trucks, trailers, etc. moving throughout the world. The interested party wishes to execute analytics based on a stream of updates flowing from a global base of registered vehicles. Data is generated within the vehicle on a CAN bus and is converted in-situ to JSON code that acts as our source for data ingest.

The Sigmax Stack includes FPGA hardware assisted conversion to Apache Arrow. Converted data are pushed from the FPGA smartNIC to an Apache Pulsar message broker. This broker has the benefit of being natively higher throughput than any other open source message broker. It further benefits from underlying Intel OPTANE support to further increase its maximum throughput and widen its recent data processing window. The entire solution scales perfectly linearly to data ingest and higher data throughput means less hardware and cost in your analytics pipeline. All of these elements tie neatly together and natively support the persistent storage and support aging data into very cost effective cloud (s3 for example) storage.

Pulsar clients now have the ability to answer fundamental questions with real time data such as – is the user driving dangerously? have they been in an accident, etc. AI+ML models can examine historic data retrieved via Pulsar to train and improve models. The list of supported client programming languages include high level favorites like Python, Go, R, Matlab, Jupyter Notebooks or even Javascript.

Complex vs Simple Schema

In the example above we show the processing of a fairly complex schema and key application concerns centered around maximizing throughput and access to both edge and historic data for various AI+ML pursuits. However this isn’t the only use case which can benefit. Examine the opposite scenario which is common to IoT and 5G networks – A relatively simple schema which needs ingest that can keep up with and retain the exceptionally low latency characteristics of the wireless 5G network.

Our next example places a spotlight on Apache Arrow Flight and how it can open a second real-time path from ingest to client. We add this functionality to the above described Apache Pulsar datapath. In this use case Apache Pulsar also benefits from Intel OPTANE memory boosting latency performance by as much as 10x. This is done without the use of proprietary software and therefore keeps customers free from vendor lock-in, leverages the cutting edge developments of the open source community and provides an assurance of availability.

Example: Apache Arrow Flight Dataflow

Apache Arrow flight compliments the Arrow Co-Processor to Pulsar model and adds an ability to direct stream data ingested to a FOG platform directly to a client application independent of data and client locality. In the example dataflow above we have a 5G signal loaded with IoT data to ingest. This IoT data is a perfect data source to take advantage of a message queue based analytics pipeline. Data is a constantly streaming list of updates for which various clients have an interest. Clients can subscribe to their stream and topic of interest and receive data in a very low latency fashion from Pulsar. This is for 2 reasons. First – Pulsar is innately the lowest latency open source message broker right off the shelf and secondly as mentioned the right hardware can boost this even further. Pulsar Also scales linearly as FOG servers are added. Pulsar does not need to reside on a single server but rather very happily stripes across each node to add bandwidth easily as requirements grow. As data ages Apache Pulsar also manages moving data off FOG local stores to either enterprise class HDFS or cloud storage (for example Amazon S3).

While Pulsar is great in this application space for its ability to produce data streams and manage durable long term storage, in this setup the client has another even lower latency option to listen to ingested data. The client can choose to request data (authenticate) from an Arrow Flight translation service living at the FOG node. In our diagram we show this authentication process taking place via gRPC and HTTP/2. Once authenticated the client receives a ticket to listen to authorized ingested data direct streaming into the fog node for a limited time. This connection offers a direct hyper low latency method to access data at the very point of data ingest.

Edge & Fog Hardware
AI+ML Acceleration Hardware
Enterprise HDFS Storage Hardware
HPC FPGA + GPU Accelerated Hardware
Pre-loaded Sigmax Curated Hardware

Sigmax also offers our customers tailored 2U server solutions configured to take the most advantage of the Sigmax stack. These servers include software pre-installed including: Sigmax stack, Intel Quartus Prime Pro development tools, NVIDIA tools, and CENTOS operating system. Hardware acceleration including Intel Arria 10 PAC for ingest and Arrow coercion, and Intel OPTANE with paired Xeon Silver processors to optimize Apache Pulsar performance.

We offer Curated configurations for:

  • Edge and Fog
  • AI+ML
  • Enterprise HDFS Storage
  • HPC Compute Acceleration

Servers are sourced and assembled from Tier 1 manufacturers. Tested, Validated and Fully Supported.

Partners

SigmaX

Advanced, Open Source, Data Focused

Did you know? SigmaX offers Curated Hardware to Match the SigmaX Stack: off-the-shelf tuned hardware to easily deploy with the SigmaX Stack.