What Happened

Alibaba Cloud's AI platform PAI, using the open-source DataJuicer data processing framework and the Paimon streaming data lakehouse, processed 2 million video files tot aling 30,000 hours of footage in approximately 200 minutes, according to a technical post published on Juejin. The distributed job ran on 45 nodes each equipped with 8 NVIDIA 5090 GPUs and 180 CPU cores, producing over 10 million video clips as output. A smaller validation run processed 6,865 video files (52.4 hours total) on a single L20 8-GPU node in roughly 20 minutes, yielding approximately 17,000 video slices.

The benchmark used the publicly available Youku-AliceMind caption validation dataset sourced from ModelScope. The work load was launched via PAI's DLC (Deep Learning Containers) distributed job scheduler .

Why It Matters

Training-ready multimodal datasets are the primary bottleneck for teams scaling video-language models. The pipeline described here automates the full preprocessing chain — from raw ing estion to captioned video-text pairs — and demonstrates linear scaling from single -node to 45-node clusters without architectural changes. For engineering teams building video foundation models or fine-tuning multimodal LLMs, the through put figures (roughly 150 video files per minute per node at scale) provide a concrete baseline for infrastructure planning.

The integration of Paimon as the storage layer is also notable. Paimon, an Apache-incubating streaming lakehouse format, is positioned here as a unified store for heterogeneous binary assets including raw video blobs alongside structured metadata. This architecture sidesteps the common pain point of maintaining separate object storage and metadata databases during large-scale data curation runs.

The choice to offer both Ray and Daft as compute backends signals awareness of the emerging competition between these frameworks for ML data engineering workloads. Daft, a newer distributed DataFrame library optimized for multimodal data, is gaining traction as an alternative to Ray Data for teams that find Ray's scheduling overhead expensive on GPU clusters.

The Technical Detail

The 7-stage pipeline is implemented using DataJuicer's operator composition model and runs as follows :

  • Scene segmentation: ContentDetector algorithm splits video at frames where inter-frame difference exceeds a configurable threshold, producing semantically coherent clips .
  • Duration filtering: Clips outside acceptable length bounds are discarded before heavier compute steps.
  • Frame extraction: Uniform sampling or keyframe detection extracts representative frames; invalid or corrupted frames are dropped inline .
  • NSFW content filtering: A dedicated NSFW detection model scores each clip based on sampled keyframes; clips exceeding a safety score threshold are removed.
  • Motion scoring: Clips are sampled at 2 frames per second to compute a motion intensity score; near-static clips below a configurable threshold are filtered out.
  • Aesthetic scoring: An aesthetic quality model scores uniform ly sampled frames; only clips with average aesthetic scores within a target range are retained.
  • Caption generation: Surviving clips are passed to a multimodal model (Video BLIP is cited as an example) to generate text descriptions, producing video-text pairs ready for model training.

The storage integration uses the pypaimon Python client. Video files are referenced via BlobDescriptor objects that store the OS S URI, offset, and file size — allowing Paimon to manage blob metadata without copying raw binary data into the lakehouse storage layer. Read paths use PyArrow via table_read .to_arrow(splits), keeping the pipeline compatible with standard ML data tool ing.

Installation requires two packages:

pip install py-data-juicer pip install pypaimon

GPU utilization curves shown in the source post indicate sustained high utilization across all 8 GPUs during the large-scale run, with the Ray dashboard confir ming distributed task distribution across the 45-node cluster.

What To Watch

  • DataJuicer adoption outside Alibaba: The framework is open source and the Daft backend option suggests the team is tracking adoption by non-Alibaba infrastructure users. Watch for community PR s adding support for additional storage backends beyond Paimon and OSS in the next 30 days.
  • Daft vs. Ray Data benchmarks: The post offers both backends but does not publish a head-to-head performance comparison. Independent benchmarks on GPU-heavy multi modal pipelines would clarify which framework wins on throughput vs. operational complexity.
  • NVIDIA 5090 availability: The large-scale test ran on 45 nodes of 5090s — a GPU not yet in wide cloud availability. Watch for PAI or competing clouds announcing general availability of 5090-based instances, which would make this pipeline reproducible at scale outside Alibaba's internal infrastructure.
  • Paimon 1.x roadmap: Apache Paimon's blob storage support is relatively new. The next PMC release could affect API stability for the BlobDescriptor interface used here — teams building on this stack should pin pypaimon versions until the API stabilizes.
  • Competitive responses: AWS S ageMaker Data Wrangler and Google Cloud Vertex AI Pipelines do not currently offer a comparable end -to-end video curation pipeline with integrated lakehouse storage. A response from either vendor in the form of a managed DataJuicer-compatible service or a competing open-source release would be significant.