Written by zgiaonews• January 29, 2024• 4:00 pm• Discussions

### Enhancing Machine Learning Iteration Speed through Accelerated Application Development and Packaging

HomeDiscussions### Enhancing Machine Learning Iteration Speed through Accelerated Application Development and Packaging

Slow build times and inefficiencies in packaging and distributing execution files were costing our …

In the rapidly evolving realm of AI/ML development, it is essential to ensure that our infrastructure can meet the escalating demands of our ML engineers. Their workflows encompass various tasks such as code checking, writing, building, packaging, and verification.

In our quest to uphold efficiency and productivity while empowering our ML/AI engineers to deliver innovative solutions, we encountered two primary challenges that demanded immediate attention: sluggish builds and inefficiencies in packaging and disseminating executable files.

The issue of slow builds often emerges when ML engineers operate on older revisions that our build infrastructure fails to cache effectively, necessitating repetitive rebuilding and relinking of numerous components. Furthermore, build non-determinism exacerbates this problem by triggering rebuilding due to the generation of different outputs for the same source code, rendering previously cached results obsolete.

Another significant hurdle lies in executable packaging and distribution. Traditionally, most ML Python executables were in the form of XAR files, posing challenges in leveraging OSS layer-based solutions efficiently. The computational costs associated with creating such executables, especially with a large number of files or substantial sizes, can be prohibitive. Even minor modifications to a few Python files often mandate a complete XAR file reassembly and distribution, causing delays in execution on remote machines.

Our objective in enhancing build speed was to minimize extensive rebuilding requirements. To achieve this, we optimized the build graph by reducing dependency counts, addressed build non-determinism challenges, and maximized the utilization of built artifacts.

Concurrently, our focus on packaging and distribution aimed to introduce incrementality support, eliminating the time-consuming overhead linked to XAR creation and distribution.

Strategies for Enhanced Build Speeds

To expedite builds, our goal was to minimize unnecessary processes by tackling non-determinism and eliminating redundant code and dependencies.

We pinpointed two sources of build non-determinism:

Tooling Non-determinism: Certain compilers like Clang, Rustc, and NVCC may produce varying binary files for the same input, leading to inconsistent results. Resolving these tooling non-determinism issues proved challenging, often requiring in-depth root cause analysis and time-consuming fixes.
Source Code and Build Rules Non-determinism: Developers inadvertently introduced non-determinism by incorporating elements like temporary directories, random values, or timestamps into build rules code. Addressing these issues demanded significant time investment to identify and rectify.

Thanks to Buck2, which directs most build actions to the Remote Execution (RE) service, we successfully implemented non-determinism mitigation within RE. This ensures consistent outputs for identical actions, paving the way for a stable revision for ML development, thereby reducing build times significantly in many scenarios.

While completely removing the build process from the critical path of ML engineers may not always be feasible, we recognize the importance of managing dependencies to control build times. As dependencies grew, we enhanced our tools to better manage them, identifying and eliminating unnecessary dependencies. These improvements streamlined build graph analysis and overall build times. For instance, we excluded GPU code from the final binary when unnecessary and devised methods to identify utilized Python modules and reduce native code using linker maps.

Implementing Incrementality for Executable Distribution

A typical self-executable Python binary comprises numerous Python files (.py and/or .pyc), substantial native libraries, and the Python interpreter. This results in a multitude of files, often numbering in the hundreds of thousands, with a total size reaching tens of gigabytes.

Engineers spend considerable time on incremental builds, where the packaging and fetching overhead of such a large executable surpass the build time. In response, we introduced a novel solution for packaging and distributing Python executables – the Content Addressable Filesystem (CAF).

CAF excels in operating incrementally during content addressable file packaging and fetching stages:

Packaging: By adopting a content-aware approach, CAF intelligently skips redundant uploads of files already present in Content Addressable Storage (CAS), reducing redundancy in different executables or versions of the same executable.
Fetching: CAF maintains a cache on the destination host, ensuring that only new content needs to be downloaded.

To optimize efficiency, we deploy a CAS daemon on most of Meta’s data center hosts. The CAS daemon manages the local cache on the host, materializes into the cache, and organizes a P2P network with other CAS daemon instances using Owl, our high-fanout distribution system for large data objects. This network facilitates direct content fetching from other CAS daemon instances, significantly reducing latency and storage bandwidth usage.

In the context of CAF, an executable is defined by a manifest file detailing all symlinks, directories, hard links, files, digests, and attributes. This implementation allows deduplication of unique files across executables and employs a smart affinity/routing mechanism for scheduling, minimizing content downloads by maximizing local cache utilization.

While there are similarities to Docker’s OverlayFS, our approach differs significantly due to the diverse dependencies of our executables, making layering less efficient and more complex to organize. Direct file access is crucial for P2P support.

We opted for Btrfs as our filesystem due to its compression capabilities, direct writing of compressed storage data to extents, and Copy-on-write (COW) capabilities. These features enable us to maintain executables on block devices with a size similar to XAR files, share files from cache across executables, and implement an efficient COW mechanism that only copies affected file extents when necessary.

LazyCAF and Uniform Revisions Enforcement: Future ML Iteration Enhancements

Our implemented improvements have significantly reduced overhead and enhanced the efficiency of our ML engineers. Faster build times and more efficient packaging and distribution of executables have led to double-digit percentage reductions in overhead.

However, our journey to minimize build overhead continues. We have identified several promising enhancements that we plan to implement soon. By optimizing the fetching of executable parts on demand, we aim to reduce materialization time and minimize the overall disk footprint in scenarios where only a fraction of the executable content is utilized.

Furthermore, enforcing uniform revisions across all ML engineers can accelerate the development process. Operating on the same revision will enhance cache hit ratios, leading to a higher percentage of incremental builds as most artifacts will be cached.

Visited 1 times, 1 visit(s) today

Last modified: January 30, 2024