Diving into Opensource ML Preprocessing and Inference for big data in 2025

Back in 2018, I attended the Databricks Spark & AI Summit in London. 2018 was an exciting year for Big Data technology. It saw the keynote announcement of MLflow, as well as some truly mind-blowing projects like the integration of OpenCypher and persistent graphs into Spark with Project Morpheus by Neo4j (which I even tried to revive myself). Unfortunately, it was ultimately abandoned because it wasn’t incorporated into Spark 3.0. (The SPIP proposal suggested that Cypher support would eventually evolve into Spark’s GQL support.) The announcement of Delta Lake followed soon after the conference which i was also early to adopt in my team.

One of the lectures that resonated with me at the conference was Autotrader’s presentation on their use of Spark for training models and the challenges they faced in exporting those models and features—trained on very large datasets—for low-latency inference. At the time, I was encountering similar challenges in my own work and experimented with approaches quite similar to those discussed by Autotrader.

Below is their full 2018 lecture.

Fun fact: At minute 22:42, you can hear me asking the lecturer about the production challenges in inference that we both faced.

To summarize it , Auto Trader faced challenges deploying their “Days to Sell” machine learning model, which predicts how long a vehicle will take to sell based on various factors. Moving models from development to production was a slow and complex process that required tight collaboration between data scientists and developers. To overcome this, they adopted MLeap, a framework for serializing and deploying machine learning pipelines built with Spark (for example , using spring boot as the runtime) . MLeap allowed them to package their trained models into a portable format that could be easily used for real-time predictions without needing a full Spark cluster. Alongside this, they used Databricks notebooks for collaborative development and Kubernetes for scalable, efficient serving of predictions. This streamlined their workflow, enabling quicker and smoother deployment of their machine learning models into production.

Big Data Preprocessing and Low-Latency Serving in 2025

Fast forward seven years from that 2018 Spark & AI Summit:

MLeap, which Auto Trader used back then, has declined in popularity in recent years. It was even deprecated in MLflow (from version 2.6.0 onward, MLflow warns that mlflow.mleap.add_to_model is deprecated and to use mlflow.onnx instead). In addition, MLeap hasn’t been updated to support Spark 3.5, so many teams are migrating to other serialization formats like ONNX or PMML. This shift highlights just how fast our ecosystem changes, requiring us to keep an eye on new standards for easier model portability.

the challenge of handling big data preprocessing—and then serving those models quickly—is still very relevant. Yes, the world today is buzzing about LLMs and advanced neural networks, but classic “tabular data” pipelines remain a huge part of day-to-day machine learning in most businesses. Dr. Amitai Armon’s paper, “Deep learning is not all you need,” reminds us that simpler approaches are sometimes more cost-effective, easier to maintain, and sufficiently accurate (XGboost, i’m looking at you).

Introduction to MLflow

Before diving into the two methodologies, let’s quickly look at MLflow, an open-source platform that helps track the entire machine learning lifecycle. MLflow lets you:

Track experiments: Log parameters, metrics, artifacts (like plots), and the model itself.
Version your models: Store them in a centralized registry.
Package for deployment: Export models to various “flavors,” including Python functions, Docker images, and more.

While MLflow can integrate with Spark, scikit-learn, TensorFlow, and many other frameworks, we’ll primarily emphasize Spark-based pipelines here. You can log your pipeline after training, manage versions, and later pick the format you need for deployment.

Methodology A: Packaging the Model for a REST Endpoint

When we talk about packaging the model for a RESTful service, we’re essentially turning the trained pipeline (including transformations and the final model) into an artifact that can be queried over HTTP. This approach is popular because:

It’s language-agnostic: any client that speaks HTTP/JSON can send data and receive predictions.
It scales well with container orchestration platforms (Kubernetes, Docker Swarm, etc.).
It’s straightforward to integrate and monitor alongside other microservices.

1. Logging the Model in MLflow

Let’s assume you have a Spark ML pipeline that’s already been trained. A first thought might be:

import org.apache.spark.ml.PipelineModel
import org.mlflow.tracking.MlflowContext
val pipelineModel: PipelineModel = ... // your trained Spark pipeline
val mlflowContext = new MlflowContext()
mlflowContext.startRun("my_spark_pipeline_run")
try {
  // Log the model as a Spark flavor
  mlflowContext.logSparkModel(
    pipelineModel,
    "spark-model",
    Map("description" -> "My Spark Pipeline model for tabular data")
  )
} finally {
  mlflowContext.endRun()
}

However, be aware that logging a Spark flavor model typically requires a Spark session at inference. If you try to serve it using mlflow models serve in a simple environment (without Spark), you’ll likely face ClassNotFoundException or other classpath issues.

Workarounds to Enable a REST Service

Run Spark at Inference:
- If you truly do want to keep Spark around for inference (e.g., in a micro-batch or structured streaming job), you can load the same Spark model in a Spark environment. But that’s not a traditional “lightweight REST” pattern.
Convert to a PyFunc or ONNX:
- Instead of storing the raw Spark flavor, you can extract parameters and rebuild your pipeline in scikit-learn (or directly convert to ONNX/PMML).
- Then log that model as a pyfunc or onnx flavor in MLflow, which you can serve with mlflow models serve -m ... or in a container-based REST microservice—no Spark needed at runtime.
(Older) MLeap Approach:
- Historically, MLeap allowed Spark pipelines to be served without a full Spark cluster, but it’s no longer updated for newer Spark versions and is now deprecated in MLflow.

2. Packaging as a REST Service

Assume you’ve converted your Spark pipeline into a scikit-learn or ONNX pipeline. A simple approach for containerized REST serving could look like this (in Python with Flask):

from flask import Flask, request, jsonify
import onnxruntime as ort
import numpy as np
app = Flask(__name__)
# Load your ONNX model
session = ort.InferenceSession("my_spark_pipeline.onnx")
input_name = session.get_inputs()[0].name
@app.route("/predict", methods=["POST"])
def predict():
    data = request.json["data"]  # shape: [n_samples, n_features]
    arr = np.array(data, dtype=np.float32)
    preds = session.run(None, {input_name: arr})[0].tolist()
    return jsonify({"predictions": preds})
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Containerizing this is straightforward:

FROM python:3.9-slim
RUN pip install onnxruntime flask numpy
COPY my_spark_pipeline.onnx /app/
COPY app.py /app/
WORKDIR /app
EXPOSE 5000
CMD ["python", "app.py"]

Then docker build -t spark-onnx-serve . and docker run -p 5000:5000 spark-onnx-serve. Now your Spark-trained (but ONNX-exported) pipeline is a REST microservice.

3. REST Pros and Cons

Pros:
- Easy to integrate with many client apps.
- Scales horizontally behind a load balancer.
Cons:
- Adds network hops for each prediction request.
- Requires you to keep track of versions if you do frequent releases (though MLflow model registry helps).

Methodology B: Low Latency Without REST

Now let’s consider a scenario where you might want to avoid a REST approach—often because you need even lower latency or have continuous data ingestion through streaming.

1. Integrating the Model Into a Streaming System

In cases where you’re pulling data from Kafka or another high-velocity source, it can make sense to embed the Spark ML pipeline right in a Structured Streaming job. This way:

You skip the extra overhead of JSON/HTTP calls.
Predictions are done as soon as new data arrives in the stream.
Spark can handle micro-batch or near-real-time intervals for you.

Example (simplified Scala code):

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.ml.PipelineModel
object StreamingInferenceExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("LowLatencyInference")
      .getOrCreate()
    val pipelineModel = PipelineModel.load("s3://my-models/spark-pipeline")
    // Read streaming data from Kafka
    val inputStream = spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "kafka:9092")
      .option("subscribe", "my_input_topic")
      .load()
    val parsedStream = inputStream
      .selectExpr("CAST(value AS STRING) as message")
      // parse your 'message' into columns like col1, col2, col3, etc.
    // Apply the pipeline
    val predictions = pipelineModel.transform(parsedStream)
    // Write predictions back to a sink (console, Kafka, DB, etc.)
    val query = predictions.writeStream
      .format("console")
      .outputMode("append")
      .start()
    query.awaitTermination()
  }
}

Here, you never spin up a separate microservice; you’re just scoring in-flight events directly in Spark.

2. Pros and Cons of the No-REST Approach

Pros:
- Minimal overhead—no separate HTTP calls.
- Fully integrated with Spark data ingestion (Kafka, Delta Lake, etc.).
- Potentially simpler operational flow if your data platform is all Spark-based.
Cons:
- You need a Spark environment for inference (possibly expensive or complex to maintain).
- Harder to let external applications call the model on-demand.

Many production systems end up combining these approaches—Spark streaming for real-time data, plus a REST endpoint for user-facing or ad-hoc predictions.

Summarizing Both Methodologies

Packaging for REST:
- Common in microservice architectures.
- Often done with ONNX or a scikit-learn “pyfunc” flavor to avoid needing Spark at inference.
- Containerization with Docker + a lightweight server (Flask, ONNX Runtime, or NVIDIA Triton) is straightforward.
Low Latency Without REST:
- Embeds your model in Spark (or a similar engine like Flink).
- Eliminates network hops for real-time data streams.
- Best suited for continuous ingestion from Kafka, event hubs, or IoT streams.

Digging Deeper: Serialization and Serving Challenges

A. Direct Spark-to-ONNX vs. Manual Extraction

Direct
- The onnxmltools.convert_sparkml method can automatically export Spark ML pipelines to ONNX, if your pipeline only uses supported transformations and models.
- Quick and easy—no manual copying of parameters.
- May fail or be incomplete if you have advanced/unsupported Spark stages.
Manual
- You extract the learned parameters (means, std, PCA loadings, etc.) from the Spark pipeline in Scala/Java.
- Rebuild an equivalent scikit-learn pipeline in Python.
- Convert that pipeline to ONNX with skl2onnx.
- This is more labor-intensive, but guarantees you can replicate any custom or cutting-edge transformation that the direct method might not support.

B. Spark Flavor in MLflow vs. PyFunc

Spark Flavor:
- Easiest to log (just logSparkModel), but typically requires a Spark environment to load at inference.
- Not ideal if you want a small, self-contained REST server.
PyFunc / ONNX Flavor:
- Runs anywhere with Python or ONNX Runtime, no Spark needed.
- Means you usually do a “bridge” step (parameter extraction or direct ONNX conversion).

C. Containerization Tips

Keep your ONNX model in the container.
Use ONNX Runtime or NVIDIA Triton for high-performance inference.
For more advanced orchestration, consider a model server like KServe (Kubernetes) or MLflow’s built-in model serving (pyfunc).
Validate that predictions match your Spark pipeline’s output on a test subset before deploying.

Conclusion

We’ve seen two main ways to integrate Spark-based ML pipelines into production:

REST Approach:
- Convert the Spark pipeline to ONNX/pyfunc and serve it in a container (Flask, Triton, or MLflow serving).
- Scales horizontally, easy to manage with standard DevOps.
Embedding in Spark:
- Attach the pipeline to a streaming or micro-batch job.
- Achieve lower latency and skip overhead for continuous data flows.

Either path can be combined or tailored depending on your application’s requirements. If you’re looking for a more universal format than MLeap (now deprecated in MLflow), ONNX is a leading choice, as it’s widely supported and actively maintained.

We mainly focused on Spark-based approaches, given that Spark remains the most widely used big data processing engine in many data and ML teams who rely on open-source technologies. That said, there are other robust frameworks and methodologies worth exploring—such as Kubeflow, Ray Data with Ray Serve, and open-source feature stores (like Feast)—which can also streamline large-scale data processing and machine learning pipelines. Ultimately, the choice of tooling depends on your team’s skill set, infrastructure, and project needs.

Discover more from Avichay Marciano

Subscribe to get the latest posts sent to your email.

Big Data Preprocessing and Low-Latency Serving in 2025

Introduction to MLflow

Methodology A: Packaging the Model for a REST Endpoint

1. Logging the Model in MLflow

Workarounds to Enable a REST Service

2. Packaging as a REST Service

3. REST Pros and Cons

Methodology B: Low Latency Without REST

1. Integrating the Model Into a Streaming System

2. Pros and Cons of the No-REST Approach

Summarizing Both Methodologies

Digging Deeper: Serialization and Serving Challenges

A. Direct Spark-to-ONNX vs. Manual Extraction

B. Spark Flavor in MLflow vs. PyFunc

C. Containerization Tips

Conclusion

Share this:

Discover more from Avichay Marciano

Leave a ReplyCancel reply

Discover more from Avichay Marciano