What is a big data developer

Big data

Driven by new storage technologies based on new processes such as in-memory computing, column-oriented databases or distributed programming models (map reduction), the big data topic has become more relevant, especially in larger companies. Top managers of large corporations on the technical and IT side have to deal with the megatrend and evaluate how the new technological possibilities can best be used for their area of ​​responsibility.

While companies used to use data from their own applications almost exclusively, a large number of external sources such as social media or networked devices in the Internet of Things have been added in recent years. This then leads to new job profiles - the term "data scientist" has been appearing more and more recently. This seems to be the kind of "wizard" every company needs to bring the marvel of big data to life. Like a large hydra, it seems to be the solution to all problems - something different for everyone, but always suitable. New courses are emerging that train their students to become a "Master of Data Science", and not just since the Harvard Business Review named it the "Sexiest Job of the 21st Century". But who is this hero of the present, whose job description is not all that new?

In order to provide a little insight and a more diversified picture, some terms and roles within companies are described below that are often associated with the data scientist profession.

(Big) data engineer

The data engineer is essentially responsible for merging data. From the available data and technologies, he creates a landscape in which the data scientist can live. His knowledge is not only limited to the data available in the company and its storage locations, he also knows how to best integrate this data into a central analysis infrastructure, which technologies are suitable and which additional external data can be used for enrichment .

He becomes a big data engineer when he works with large amounts of data that require big data technologies for storage and processing. The delimitation of big data is not strictly defined - but large amounts of data can be, for example, one million sales transactions by an online retailer or one million hosted phone calls by a telecommunications provider. But also a sensor that produces 50 megabytes of data every two nanoseconds. His performance begins with the understanding of the technical requirements and the planning and development of a robust and flexible big data infrastructure (also known as a big data architect), continues with the connection of internal and external data sources via batch, real-time and streaming Interfaces up to ensuring smooth operation and up-to-dateness of data. He's basically the stadium architect, greenkeeper and kit manager for the soccer team. The (Big) Data Engineer is the master of the data supply.

Management Scientist

The management scientist, on the other hand, is more like the manager or team boss in order to stay in the picture of the soccer team. He is the first to be on site, analyzes the situation and discusses the technical problems that are to be solved with the help of data analyzes. With the growing popularity of data-driven decision support, there is hardly a technical area or industry today in which data analysis is not used.

The service of the management scientist consists in translating the language of the technically and data-ignorant specialist into that of the data scientist. It starts with the specification of the actual technical problem definition, the translation and sharpening of the underlying analytical question, continues with the identification of required data, the management of the operational analysis and the communication of analytical results and recommendations for action. For his job, the management scientist needs a good understanding of analytical methods and procedures as well as technical processes and effects. He needs a certain understanding of the specialist areas in order to understand the specialist representative and to explain the problem to the data scientist, as well as the ability to evaluate analytical results and to make the procedure and results palatable to the specialist representative in his language. The management scientist is the mediator between two worlds.

(Big) data scientist

The main task of the data scientist is to generate answers to analytical questions from data with the help of analytical methods from the areas of statistics, machine learning or operations research. He becomes a big data scientist when he works with large amounts of data and generates insights with the help of analysis methods based on big data technologies such as Hadoop. His task begins with understanding the technical problem, continues with the selection of the necessary tools such as data, technologies and methods and ends in an idea generation phase.

A good data scientist is primarily characterized by the repertoire of this construction kit and the quick access to the appropriate resources. The stereotype of the data scientist is a bearded, t-shirt-wearing nerd who, with the help of freely available tools and technologies, draws ingenious insights from a small amount of data within a very short time and visualizes and explains them clearly. So he's the playmaker who can make all the difference in the decisive match. The data scientist works closely with the data engineer and the management scientist in every phase, because only together can they solve the technical task. If the data scientist has to take on fewer tasks of the data engineer or the management scientist, he has more time for his actual tasks - just like in professional football, the players don't set up the goals themselves or mark the lines on the pitch. But the smaller the company, the more often the data scientist usually takes on the tasks of data engineer and management scientist.

  1. 15 analytics tools for web, mobile and social at a glance
    Modern analytics tools from the cloud enable companies to better understand their customers and to plan and evaluate their marketing initiatives more efficiently. The following is a presentation of professional alternatives for analyzing websites, mobile apps and social media profiles.
  2. Mixpanel
    Mixpanel is a sophisticated analytics tool for web and mobile apps. Software manufacturers and website operators who want to understand their users better can benefit from this.
  3. Intercom
    A lesser known but promising alternative to Mixpanel is Intercom. The SaaS service, also from San Francisco, addresses software providers who not only know who their users are and how they use their products, but also want to get in touch with them.
  4. Kissmetrics
    While Google Analytics focuses on page impressions, visitor numbers and similar statistics, Kissmetrics shows which people are behind the clicks. The highlight: Thanks to sophisticated "user tracking" methods, the service launched in California in 2008 is able to record the activities of site visitors via various online channels.
  5. Woopra
    n direct competition to Kissmetrics is Woopra. This solution, which is specially tailored to the needs of sales and marketing teams, also focuses on personal customer profiles.
  6. GoSquared
    Professional analytics tools don't necessarily have to come from the United States. The software provider GoSquared, based in England, serves as proof of this. Its analytics platform of the same name is primarily aimed at e-commerce providers and scores with professional features in the area of ​​social, real-time statistics and trends.
  7. Chartbeat
    Chartbeat is a useful tool that focuses on the analysis of real-time data. What is happening on my website right now? How many visitors are currently active on this or that page? Which countries do they come from?
  8. App Annie
    App providers who want to professionally measure the success of their mobile apps will find an analytics service in App Annie that provides detailed app store statistics.
  9. Flurry Analytics
    Flurry Analytics is something like Google Analytics, but specifically for app providers. The service from California is also used for efficient data traffic analysis, not just from websites, but from mobile apps.
  10. Apsalar
    A notable alternative to Flurry Analytics, which is also developed in San Francisco and is compatible with iOS and Android, is Apsalar. In this case, the user must also incorporate an SDK (Software Development Kit) into his app, which ensures that the user data is automatically recorded.
  11. App Figures
    App developers not only want to know how their own app is received by users and how it is used in practice, but also how it compares to the competition.
  12. Mopapp
    App providers who are interested in tools such as Apsalar and App Figures, but not only want to evaluate the iOS and Android stores, have come to the right place at Mopapp.
  13. AppTrace
    With AppTrace, the Berlin software house Adjust offers another online service that also provides many interesting store statistics and is free of charge. As the provider explains, public data from 155 countries are evaluated.
  14. SocialBench
    SocialBench is a sophisticated social marketing tool that brings community management and numerous analytical tools to a common denominator.
  15. Sprout Social
    A good alternative to SocialBench is Sprout Social. Founded in Chicago in 2010, the cloud service also serves as a holistic social media management dashboard that has numerous analytics and monitoring functions.
  16. Quintly
    The online service Quintly enables the effective analysis and control of your own company presence in the most important social networks. The solution developed in Cologne supports Facebook, Twitter, Youtube, Google+, LinkedIn and Instagram.

Incidentally, there were already data scientists in the past. Their names were data miners, data analysts or they were in special roles that gave them titles such as marketing analyst, actuary or logistics planner. What has changed and is now playing into the new job description of the data scientist is a form of creativity and cleverness. The data scientist is the master of data evaluation.

Data steward

The data steward is a role that also likes to fall in this context: Compared to the other three job profiles, he has nothing to do directly with the game. He is responsible for monitoring the quality and technical correctness of data. Often he shares the task with other data stewards in data areas or works with colleagues in the IT department who are called "data custodians". The data areas are structured according to technical entities, for example according to customer, product, transactions, payments, partners or campaigns.

The tasks of the data steward range from the definition of the data areas in its sovereignty, through the definition of general rules and guidelines on the content and use of these data areas, to checking compliance with the applicable rules. With the standards, which he ensures compliance, he ensures the permanent quality of the data, their processing, their evaluation and ultimately their use for operational control and decision-making of business processes. Angry tongues claim that the data steward is like a sports official - you don't know exactly what for, but somehow you need him. (sh)

  1. The terms around big data
    Big data - what is it actually? Everyone talks about it, everyone understands something different by it. Click through our glossary with the most important and most used terms (some also say "buzzwords") and you will understand what exactly is meant by that.

    compiled by Kriemhilde Klippstätter , freelance author and coach (SE) in Munich
  2. Ad targeting
    Trying to attract the potential customer's attention, mostly through "tailor-made" advertising.
  3. algorithm
    A mathematical formula cast in software with which a data set is analyzed.
  4. Analytics
    With the help of software-based algorithms and statistical methods, data is interpreted. This requires an analytical platform that consists of software or software plus hardware and that provides the tools and computing power to be able to carry out various analytical queries. There are a number of different forms and uses, which are described in more detail in this glossary.
  5. Automatic Identification and Capture (AIDC)
    Any method of automatic identification and data collection about a given condition and subsequent storage in a computer system. For example, the information from an RFID chip that a scanner reads out.
  6. Behavioral Analytics
    Behavioral analytics uses information about human behavior to understand intentions and predict future behavior.
  7. Business Intelligence (BI)
    The general term for the identification, origin and analysis of the data.
  8. Call Detail Record (CDR) analysis
    This contains data that the telecommunications companies collect on the use of mobile phone calls - such as the time and duration of the calls.
  9. Cassandra
    A distributed database management system for very large structured databases (“NoSQL” database system) on an open source basis (Apache).
  10. Clickstream Analytics
    Describes the analysis of a user's web activities by evaluating their clicks on a website.
  11. Competitive monitoring
    Tables that automatically store the activities of the competition on the web.
  12. Complex Event Processing (CEP)
    A process in which all activities in an organization's systems are monitored and analyzed. If necessary, you can react immediately in real time.
  13. Data aggregation
    The gathering of data from different sources for the preparation of a report or for analysis.
  14. Data analytics
    A piece of software that is used to pull information from a data set. The result can be a report, a status or an action that is started automatically.
  15. Data Architecture and Design
    Explains how company data is structured. This usually takes place in three process steps: Conceptual mapping of the business units, logical mapping of the relationships within the business unit and the physical construction of a system that supports the activities.
  16. Data exhaust
    The data that a person generates "on the side" during their Internet activity.
  17. Data virtualization
    The process of abstraction of various data sources through a single layer of access to the data.
  18. Distributed Object
    A piece of software that allows you to work with distributed objects on another computer.
  19. De-identification
    The removal of all data that associates a person with specific information.
  20. Distributed processing
    The execution of a process across different networked computers.