whereshow team

LinkedIn is, by many measures, one the smartest users of data in the world. The data scientists at the Mountain View, Calif.-based professional networking site take member profiles and analyze the heck out of them. 

They have the world’s finest tools at their disposal — from sources and sinks like Hadoop, Teradata, Oracle, Espresso; to schedulers like EasyData, Oozie, Azkaban and Appworx; to transformers like Informatica, SQL, Spark, Hive and Cubert.

Being able to pick from such a wide variety tools is nice, but it can also create a bit of a mess. 

Consider that LinkedIn has now captured the status of 50 thousand datasets (with more than 15 petabytes storage footprint across multiple Hadoop, Teradata and other clusters), 14 thousand comments, 35 million job executions and related lineage information.

Metadata Around Your Data

Where did each one come from and how did it get there? 

That’s a problem LinkedIn engineers Shirshanka Das, Jianyong Bai, Zhen Chen, Eric Sun and Zhaonan Sun set out to solve. To do so, they created a data discovery and lineage portal which they, rather appropriately, named WhereHows. It helps the company’s data scientists understand the data that they own, find the data sets they need and to then share them.

This is something that isn’t difficult to do inside a single product. But when you are working between products, there is not good tool available.

“We looked for commercial solutions and at open source projects and couldn’t find one,” Das told CMSWire. 

So the team at LinkedIn did what it had to do. It built one for its own use and then spent time adding capabilities and refining it. 

Though WhereHows will probably never be a “finished” product, it’s well enough polished now that the team wants to share it with the greater community of data engineers and scientists and LinkedIn’s management has given them the go ahead.

Now on Github

Today, WhereHows becomes available on Github under an Apache 2.0 license. The immediate win for companies who are encountering similar data discovery and lineage problems is that they might not need to build their own solutions because there’s one free for the taking.

But in the longer term, something pretty special could happen, especially if the greater data community shows and interest in WhereHows and will want to participate in its growth. 

After all, since developers won’t have to spend time building a WhereHows-like solution from scratch, they can put their efforts into refining it and extending its capabilities for the benefit of all. That’s the idea behind open source.

But LinkedIn isn’t open sourcing WhereHows and walking away. It has plans to broaden its metadata coverage by integrating with more and more data systems such as Kafka and Samza as well as data lifecycle management and provisioning systems such as Gobblin and Nuage for enrichment of metadata.

But enough about the future, today is a very big day for four LinkedIn engineers who sought out to solve a problem they were having and are now donating the results of their efforts to the greater community. They can be proud.