lakeFS: Spearheading Data Version Control in the Field of Data Management

Follow lakeFS on :

Einat Orr, Ph.D, Co-Founder, and CEO
Einat Orr, Ph.D, Co-founder and CEO of lakeFS recalls an instance where she and her co-founder, Oz Katz, were leading the engineering for SimilarWeb (NYSE: SMWB). As part of the data development process of their big-data product, which relied on eight petabytes of data on S3, they were running periodic data retention that deleted data that was no longer required. On one occasion the retention had mistakenly deleted one petabyte of production data, data that should not have been deleted. At this moment, Orr realized that she couldn’t easily reverse the action made on her own data lake.

This realization had emphasized the difference between the teams that were delivering software applications and the teams who were developing and maintaining the data assets for the company’s data-intensive products. While for years, software engineers benefited from engineering best practices such as the agile development methodology and the ALM tooling that support this methodology (such as git, Jenkins, testing platforms, etc.), data engineers were not equipped with such basic and needed tools for their data, and hence were struggling with fixing quality issues and recovering from these issues. These hurdles brought an enormous cost of error for the entire company.

This is the reason Orr and Katz developed lakeFS - an open source tool, which transforms object storage buckets into git-like repositories. lakeFS is providing versioned data lake operations and uses them to bring a development workflow and methodology into the field of big data. It arms data engineers with simple yet powerful tooling that can increase their productivity and reduce their cost of error.

“The basic need of engineers in general, and data engineers in particular, is to be able to develop and test things freely without worrying that their changes will break things in production. Data engineers need to safely & confidently develop and test the pipelines they are building with the entire production data”, says the co-founder and CEO of lakeFS. “lakeFS has created a simple way to develop and test in isolation without needing to copy the data lake multiple times. This is done without any compromise to the performance of the data lake, as lakeFS can manage exabytes of data and allows applications accessing the data to benefit from Git-like operations”.
lakeFS can manage billions of objects and provide highly scalable and high performance Data Version Control for data lakes. The open source lakeFS project supports AWS S3, Azure Blob Storage, and Google Cloud Storage (GCS) as its underlying storage service, together with on-prem object storages with s3 interface such as, Chef, Vast, Weka and Dell EMC S3. It is API compatible with S3 and integrates seamlessly with popular data frameworks such as Spark, Hive, dbt, Trino, and many others.

Through its versioning engine, lakeFS enables built-in operations inspired by git, helping organizations to apply efficient lifecycle management practices to their data engineering. The first one - lakeFS’ Branch action, is a cost-effective metadata operation that provides businesses an isolated development and testing environment with a snapshot of the data lake repository - without copying any data. This drastically reduces the storage cost for organizations and enhances the efficiency of data engineers as they can develop freely and safely against production data in isolation.

Through its versioning engine, lakeFS enables built-in operations similar to Git, helping organizations apply efficient software engineering best practices to their data operations

Also, lakeFS provides the revert action (atomic and immediate, with no manual effort) whenever an error is found in production data. This enables the sought for “undo” functionality that is missing in the way data lakes are designed today.

In the process of introducing advanced practices in the data management arena, lakeFS solves crucial problems related to the transient nature of data. As data is in a state of constant flux, lakeFS opens the floodgates of data version control tools to incorporate efficiency and agility in data management. Adopting lakeFS will speed up companies’ development and deployment cycles, reduce the chance of incorrect data making it into production, and make recovery less painful if it does.


Santa Monica, CA, US.

Einat Orr, Ph.D, Co-Founder, and CEO

lakeFS transforms object store buckets into git-like managed repositories, to enable similar development workflows for code and data and saving organizations money and engineering efforts.