Let’s be fair for a moment – you want to employ less people and do more stuff in you company. Or at least have them do some actually important things. If you have to employ some people to do stuff that you can do but really don’t want to, then they probably don’t want to do it either. It would be much better to give them some creative and challenging task. This way both, them and your company would develop.
In this case you would probably like to have some software that does these boring tasks for you. Via data you can easily optimize your life and work. But to actually have a possibility to improve anything you have to make this data work for you. The first step is always a process of data extraction. It means from all the junk you have on your databases you have to find the part that is actually useful. Without that you cannot start your data project because you will not have your building blocks. In this article we will show you what methods are used to achieve it.
What is data and why we use it?
Right now, it does not matter if you are leading your own business or just using some social media for your own purposes – you cannot avoid gathering enormous data. It can be anything – starting from some text messages and pictures with your friends up to some statistical measurements of the market. You may ask what is the big deal. Let’s assume you have got some web application which is used to sell products of your company. On the main page, you would like to show products which are the most interesting for a customer. You can easily solve this problem by using this data. Via usage of data extraction on your web page you can gather information about your customers. For example, age, place of living (in this situation place where your customer connected to the Internet), and information about products on which they have clicked on. By using some statistics, you can make an intelligent advertisement for your web page. It would show different, specific content for different people to skyrocket your profits. While reading this, you have probably realised how common this is. You can see it every day – on social media, news pages and many others. But do not worry, these are not all the possibilities. You have to realize that data is everywhere and is everything. It may sound dumb but you are actually limited only by your imagination. Sounds like something impossible or at least a little bit exaggerated ? You are not wrong. But you are not right either. Right now, Data Scientists are solving problems connected with language processing, medicine, art, history, business and even politics. Probably you are not interested in most of this topics, at least not in the matter of your business. But it really is the thing. Most important things that data can help you with at your work are:
- Decision making
- Finding new customers
- Improving customer service
- Predicting sales trends
- Optimising your company’s costs
There is a perfect example of what data could optimize. The need of having better and more sophisticated software is increasing day by day. As this necessity rises exponentially, some software development teams have to change their approach in software creation. When you are creating an app for a specific group of people, you not only have to write some code, but also understand this group. Of course, it is extremely hard to keep in contact your developers with the potential users all the time. More useful and futuristic approach is to create some kind of demo for the users which gets the data about its usage. Next, you can use this data to optimize your project by making some visualization. Visualized data can be shown to your developer team to make the needs of your customers more clear.
A topic which has been just raised is called ‘Custom software development’, you can read about it in the next paragraph.
Custom Software Development
There are two main targets of the software development. The first one that everybody is familiar with is the commercial software. We can download it from app stores, find as the web apps or buy licence for programs like Microsoft Office. However apart from this, there is also a second one – custom software development. It means that it is prepared specifically for a special group of people to solve their problem. An example could be a team in the corporation, or even whole corporation itself. The most popular example is a system that automates or at least optimizes some recurrent tasks. Let’s look at some example. Imagine that there is no such thing as a software for cashiers in grocery stores. Every cashier would have to take some pen and paper and write down every apple and sweet doughnut you buy. And every person before and after you. We would spend half of our lives in queues.
As we discussed in the introduction – each one of us produces data. And because of that, it’s obvious that big companies produce data on much bigger scale. Successful CEO’s are aware that not using this data would waste a potential profit. That is why every big company uses it.
Good example of this data are work hours of employees or the availability of free rooms in the company’s building. It makes a possibility to create some software that can help or even autonomously take care of usual tasks. It’s easy to come up with examples. Preparing meetings and schedules for employees knowing which rooms are available. Knowing how big the group of people is, it can propose a room of appropriate size. Data about a character of a meeting also is important. If the topic is a presentation then software would look for a room with projector, dimmable lights and windows with curtains. Using the customer data from specific locations to make advertising more intelligent/personalised would be a more business – centred approach. For example if at one place our company sells more of a product A and in other location more of a product B then the focus could be put on advertisement. In this case software can give us suggestions of optimal approach.
This way, we can use data not only to maximize our gains but also to minimize our loses which sometimes is even more important.
If you want to start working with data you have to get it – but what does it actually mean?
Gathering data, or saying it in more sophisticated and professional manner – data extraction – is a process of gaining information from various sources. Scraping is a good example because it enables you to use data you do not have in your database. There are two main types of scraping data – web scraping and screen scraping. Web scraping, web harvesting, or web data extraction is a process of retrieving information from websites or elements that are hidden in their source code. While web scraping can be done manually by a software user, the term typically refers to automated processes using a bot or a software called web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a local central database or spreadsheet, for later retrieval or analysis. While building a web scraper you deal with a ton of patterned text data that can be easily cleaned in many different and often easy ways. On the other hand, we have got screen scraping – a technique in which a computer program extracts data from human-readable output coming from another program. In this situation we already have the data, but we cannot process it because of its form and we are forced to use some special methods like Machine Learning.
After all this boring stuff, you probably ask yourself if this is really so important for your company. Let’s say you are making some application that analyses messages sent by users of some social media, for example Facebook. It uses Facebook programming interface application to connect and retrieve this data from servers. Result you get is in HTML format and all other data like photos are stored in separate folder. Only links in the main html file can tell you what message contain specific one. To analyse the data you have to ‘clean’ it by means of web scraping – as HTML is a web format. Only then, after you organise and arrange it, you can perform some operations to get some analytical results.
WASKO S.A. has its own approach in the subject of data extraction. Right now we are working on a project called REGOS.
REGOS is a project based on screen scraping. Its main purpose is to extract and clean data from various logs of network devices. Let’s say you have a ton of devices connected to the network and as a server owner you want to get all of their IP addresses. Right now you could do it manually – by searching the logs by yourself. But why would you? REGOS is a software that helps you clean your desired data by using some scraping techniques. There are two main approaches that come along. First one, which is already implemented, is called regular expressions (or REGEX) – a large tool used for cleaning text data. The way it works is pretty straight forward. We have to define some patterns we are interested in and search the text looking for data that matches the pattern. Easy example is a PESEL number – polish equivalent of SSN. As it always looks the same, regexes are good solution as they can find anything that corresponds to some patterns – we only have to tell them what patterns.
Regular expressions have one big issue in this project – every time there comes a new set of data, which was earlier unknown, we have to create a new one. This is where the new, second approach comes to life. Right now we plan on using some machine learning techniques to create a learning model that can work on logs it has never seen before. Right now we have faced a serious problem of data being send in ASCII tables – if you are a network device developer reading this please stop this trend. Analysis of such table is pretty easy when we know that we have to deal with it. Let’s assume we have about 1000 devices connected to some network and all of them has its own type of structuring data. We have decided to completely change our point of view. What we are trying to do is a natural language processing model – it means that our software will learn what the typical logs are and try to understand the ‘language’ of the logs. Being more precise we want to make a model that knows what is an environment or what surroundings wanted data just by looking at the template given by the user. In other words what we are planning to create is an intelligent regex. If we could achieve that, every time the model sees a line of text with something that it seems to be data it returns with an information what type of message it thinks it obtained. It is still more like an idea which is slowly reaching the point of usefulness. But when it will reach the top, the world of network devices will change.
We have to remember one thing. In these days there is no way to avoid creating more and more data, it is almost as a dust accumulating in our apartments but contrary to dust we can use data to help us and there is absolutely no reason not to do it.
Summarizing – data is only a fancy word for information which is stored somewhere. Most of us have heard the word “Big Data”. It comes from the fact that data is accumulated almost everywhere and contains very useful information but also comes with a problem of retrieving it due to its quantity. This work is performed by Data Scientists, Data Engineers and Data Analysts by means of many different techniques like Data Scrapping. Then the retrieved data can be processed and used for making strategic decisions, software and all kinds of other things like optimizing current solutions. The last but not least, an actual situation in WASKO S.A. In REGOS project where we want to shift our current solution to the more intelligent one. We will be using so called “NLP” techniques which stands for “Natural Language Processing”. This way we will not have to face each device, command and client individually as our system will adapt to it itself.
Krzysztof Kramarz – A machine Learning developer and Data Scientists in WASKO S.A. and a student of Silesian University of Technology. Big fan of USA and Asia culture, one day would like to write an AI that creates hip hop/jazz songs and also have a bar where you could listen to it and drink some good whisky.
Damian Kucharski – Working on data engineering at WASKO S.A. and studying Computer Engineering on Silesian University of Technology. Privately NLP enthusiast. Intending to make a bot that mimics other people based on data from messaging apps like WhatsApp.