The Best Way to Manage Unstructured Data Efficiently

# The Best Way to Manage Unstructured Data Efficiently ![](https://lh6.googleusercontent.com/DkkTi_ms4Y7bsxSa54AAXU6K73GrsZxGqwFyMMWpVj8CUcMHK-e2f17OLVu7zAtPWSKeaCgrwcSIMbEGPUoGoJNHefI6KrPW062YvoTbkwxYmdlECihWzKgUlh1UU5UMf4B_JF6u) Photo by [jesse orrico](https://unsplash.com/@jessedo81?utm_source=medium&utm_medium=referral) on[ Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral) As much as 90 percent of data is defined as unstructured data. And unstructured data is growing by [55--65 percent each year](https://www.forbes.com/sites/bernardmarr/2019/10/16/what-is-unstructured-data-and-why-is-it-so-important-to-businesses-an-easy-explanation-for-anyone/?sh=2668ad2f15f6). If you have been working in data science for a while you must have noticed the sheer amount of unstructured data. If you aren't familiar with the term "unstructured data", think of data where the structure of data samples is inconsistent, such as audio files, text files, videos and even images. The large volume of current and expected unstructured data means that data science must be fully capable of analyzing and working with various kinds of data to obtain meaningful insights. We most commonly read papers that use tabular data or some other sort of data with labels but, in the real world, marked and labeled data rarely exists. If you want to work in data science, you will need to be excellent at analyzing unstructured data. One of the latest technologies growing popular and proving tremendously useful in the area of handling unstructured data is[ **object storage**](http://lakefs.io/object-storage). It has solved almost every issue that one faces with traditional structured storage systems, which are: 1. Scalability issues 2. Constant speed of data retrieval with respect to consumed storage 3. Limitations occurring from the standard file hierarchy ## What is object storage? A huge portion of modern tech companies is moving towards a data-driven approach, meaning they want to collect a huge amount of data about their users so that they can drive decisions based on that data rather than hunches and guesses. This comes with some challenges though, the first major one is scalability. A standard storage system like file storage or block storage is quite bad in terms of scalability. Think about it this way, if you have a huge amount of files, you probably have a lot of nested files. And when a user sends a request to see one of those files, the file system has to keep digging into the file hierarchy to find this file. The more files you have the slower this process becomes until it starts offering a terrible user experience, especially for standard impatient users like myself. ![](https://lh3.googleusercontent.com/8vnMN-Z1ECYpD_Fy0zVBZ8w9Zn458e1LM3G331cm8p3XZnddS-yvKXqAnrEqO-xylAkp-PMx4jIlHytUQNbFpd_EXST3wnkBR5JRN8CGqh5Jtmn9BUiOKswO0qSZkTjoK4iGxEcD) Photo by [USDC Technologies](https://usdc.vn/object-storage-vs-traditional-storage/) Object storage solves this problem by offering a flat structure, meaning that all of the data is stored on the same level. You might be wondering how this improves navigation in object storage (particularly because this is the main advantage of a file system). ## Is it better than traditional forms of storage? Typical object storages store 3 things for each data sample: 1. The data itself 2. Metadata: This can be the size of the data, the date of modification/upload etc. One of the best things about object storage is that this is a customizable field, meaning that you can store whatever metadata you want for each data sample. This isn't the case for standard file systems. 3. A unique identifier for each data sample. Now the trick to quick navigation and retrieval speeds for modern object storage lies in utilizing that unique identifier. One way that I think of this is that it's like the magic of[ indexing](https://stackoverflow.com/questions/1108/how-does-database-indexing-work). ### Scalability and retrieval Object storage doesn't rely on a file structure, it only stores the data samples as nodes with the 3 components mentioned above and uses the ID to find the data samples. This means that there is no need for the data samples to be on the same device/physical location. This is a great advantage because the data samples can be split across multiple different locations. So, for instance, if you have a collection of data points that you know are going to be only accessed by a certain country, you can store them in a server that is closest to this country. This gives quicker retrieval time at no extra cost. Also, because the metadata is customizable, you can easily use it to set policies and regulations around the data to avoid legal conflicts since some countries may not allow the presence of certain types of data or if you have export-controlled products, for example. ## With tons of data and large object storage, comes data science potential ![](https://lh6.googleusercontent.com/WDBErC5EIvRbBUlvHWSDmBREJKEuQtynAUkPmOx_7QH32rg4KaEqABOWp_M0va4BA1bDXRKPfD1Wr3pW1AdgDcebWJw3mOk53wgaFNuuEmuk8m2ORW5Qn9ZfhcimmFwZBDD4Uv0B) Photo by[ Stephen Phillips - Hostreviews.co.uk](https://unsplash.com/@hostreviews?utm_source=medium&utm_medium=referral) on[ Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral) A lot of people seem to place a lot of focus on data analysis techniques and machine learning models when building a high-quality ML production pipeline. However, what a lot of people miss is that storage is one of the most important aspects of your pipeline. This is because the pipeline has 3 main components: collecting data, storing it and consuming it. Effective storage methods do not only boost storage capabilities, but also help in more efficient collection and consumption. The ease of searching with customizable metadata is available in object storage and helps in doing both of those. Not only do you want to choose the correct storage tech, but you also want to choose the correct provider. [AWS](https://aws.amazon.com/) comes to mind as one of the best object storage providers mainly because its infrastructure provides smooth service and ease of scaling. Furthermore, for effective consumption of data, there must be a software layer that runs on top of this storage for data aggregation and collection purposes. This is also an important choice and needs to be discussed in another article dedicated to the topic. Essentially, you want to have a data version control (DVC) that versions the data and manages it. Because although generally the more data there is for machine learning models the better, there are sometimes anomalous data points that through the model off. Meaning that the model doesn't perform too well on this specific anomaly or a group of anomalies and, if that happens, we want to be able to go back to the last best version quickly and investigate those anomalies. Quite often those anomalies are key to improving machine learning models. Added to the versioning software, you will probably need a data aggregation framework such as data lake management platforms or [Apache Spark](https://databricks.com/spark/about). These layers provide essential basic data operations that significantly ease the consumption of data by machine learning models. I used to think that adding these layers is not that effective, but separating these groups of operations into different layers saves you tons of time in debugging your models in the long run. This is actually the backbone of most high-quality web applications/software projects. ## Final thoughts and takeaway Finally, you want the integration between your storage system, versioning platform and aggregation platform to be as seamless as possible. This is one of the most important things---not all of the options are going to be easily compatible with one another and an unsmooth configuration is usually a nightmare. I believe that every data scientist should start looking into harnessing the power of effectively analyzing and storing unstructured data since it makes up most of the present data and will make up much more in the future.