This realization had emphasized the difference between the teams that were delivering software applications and the teams who were developing and maintaining the data assets for the company’s data-intensive products. While for years, software engineers benefited from engineering best practices such as the agile development methodology and the ALM tooling that support this methodology (such as git, Jenkins, testing platforms, etc.), data engineers were not equipped with such basic and needed tools for their data, and hence were struggling with fixing quality issues and recovering from these issues. These hurdles brought an enormous cost of error for the entire company.
This is the reason Orr and Katz developed lakeFS - an open source tool, which transforms object storage buckets into git-like repositories. lakeFS is providing versioned data lake operations and uses them to bring a development workflow and methodology into the field of big data. It arms data engineers with simple yet powerful tooling that can increase their productivity and reduce their cost of error.
“The basic need of engineers in general, and data engineers in particular, is to be able to develop and test things freely without worrying that their changes will break things in production. Data engineers need to safely & confidently develop and test the pipelines they are building with the entire production data”, says the co-founder and CEO of lakeFS. “lakeFS has created a simple way to develop and test in isolation without needing to copy the data lake multiple times. This is done without any compromise to the performance of the data lake, as lakeFS can manage exabytes of data and allows applications accessing the data to benefit from Git-like operations”.
Through its versioning engine, lakeFS enables built-in operations inspired by git, helping organizations to apply efficient lifecycle management practices to their data engineering. The first one - lakeFS’ Branch action, is a cost-effective metadata operation that provides businesses an isolated development and testing environment with a snapshot of the data lake repository - without copying any data. This drastically reduces the storage cost for organizations and enhances the efficiency of data engineers as they can develop freely and safely against production data in isolation.
Also, lakeFS provides the revert action (atomic and immediate, with no manual effort) whenever an error is found in production data. This enables the sought for “undo” functionality that is missing in the way data lakes are designed today.
Through its versioning engine, lakeFS enables built-in operations similar to Git, helping organizations apply efficient software engineering best practices to their data operations
In the process of introducing advanced practices in the data management arena, lakeFS solves crucial problems related to the transient nature of data. As data is in a state of constant flux, lakeFS opens the floodgates of data version control tools to incorporate efficiency and agility in data management. Adopting lakeFS will speed up companies’ development and deployment cycles, reduce the chance of incorrect data making it into production, and make recovery less painful if it does.