Apache Arrow with Flight IPC
When you think of industry support, Apache Arrow is currently the de-facto way for in memory processing of data and got adopted by a lot of modern data science frameworks.
Besides processing of data, like transforming and cleaning, of course it has to be queried and transferred to frontend for visualization. While accepted approaches often recommended aggregating data first and importing it back into traditional warehouses, this pattern is no longer applicable for large datasets that need to be queried and visualized with higher granularity.
With Apache Spark, Spark-based accelerators like RAPIDS or by using Apache Druid there are multiple industry-standard solutions available to query big data directly. Recently, also more modern and faster alternatives are gaining popularity: Apache Datafusion offers a very modern approach and great out-of-the-box experience for task execution and querying big data with familiar SQL-like interfaces. Datafusion also supports direct transfer of data via Arrow IPC, what I really like as an approach.
So, there are plenty of ready-made options available. But in case you got a limited amount of datasets and a fixed amount of access patterns, it might actually be an alternative to implement an IPC server yourself. This provides you with full control of query execution and also gives you highest transfer rates. Not that I would recommend that for enterprise kind-of use-cases, but for smaller projects it is straight-forward to realize a modern backend with quite a small team.
In essence, there are just two GRPC methods needed to retrieve datasets from the service: GetFlightInfo and DoGet - and the framework already provides the necessary stubs to implement the service in a ready made way.
To show you some sample code - mainly taken from the Apache Arrow documentation - here is how fast you can actually implement the two methods and have an almost working server.
arrow::Status GetFlightInfo(const arrow::flight::ServerCallContext&,
const arrow::flight::FlightDescriptor& descriptor,
std::unique_ptr<arrow::flight::FlightInfo>* info) override {
ARROW_ASSIGN_OR_RAISE(auto flight_info, MakeFlightInfo(descriptor.cmd));
*info = std::unique_ptr<arrow::flight::FlightInfo>(new arrow::flight::FlightInfo(std::move(flight_info)));
return arrow::Status::OK();
}
arrow::Status DoGet(const arrow::flight::ServerCallContext&,
const arrow::flight::Ticket& request,
std::unique_ptr<arrow::flight::FlightDataStream>* stream) override {
ARROW_ASSIGN_OR_RAISE(auto reader, CreateRecordBatchReader());
*stream = std::unique_ptr<arrow::flight::FlightDataStream>(new arrow::flight::RecordBatchStream(reader));
return arrow::Status::OK();
}
This is of course just an excerpt to get you started. You can find the full sample code, also for a client, in my GitHub repository at: https://github.com/matt-do-it/ArrowAcero