Today, there are 6,500 people on LinkedIn who call themselves data engineers according to . In San Francisco alone, there are 6,600 job listings for this same title. The number of data engineers has doubled in the past year, but engineering leaders still find themselves faced with a significant shortage of data engineering talent. So is it really the future of data warehousing? What is data engineering? These questions and much more I want to answer in this blog post. stitchdata.com In companies like Facebook, Google, Apple where data is the fuel for the company, mostly in America, is where data engineers are mostly used. In Europe, the job title does not completely exist besides the startup mecca Berlin, Munich, etc. They are called or included in jobs like software engineer, big data engineer, business analyst, data analyst, data scientist and also the business intelligence engineer. Myself, I started as a business intelligence engineer and using more and more time on the engineering rather the business part, that’s why I am starting this blog from the data warehousing angle. unicorn What is a Data Warehouse? What is data warehousing or what is a business intelligence engineer doing, and why are they using a data warehouse? To use the analogy to a physical retail-type warehouse, . In a data warehouse (DWH) you have typically structured data and optimised them for business users to query. If you dig a little deeper, you offload data from the trucks in the back of the physical shop, before it gets sorted and structured into the warehouse for the customers to buy. In a DWH you basically do the same, just with data. As you see in the DWH architecture below, the offloading area in the back of the store is your where you store the source data from your operational systems or external data. you want to sell very structured products in the most efficient way to your customers stage area A traditional Data Warehouse architecture by : Wikipedia The physical warehouse where the customers buying the articles is in a DWH normally the so-called . The data processed between each layer seen in the architecture above is called . This is not to confuse with which is the common mythology data lakes (more in my recent ). In a DWH you always transform to get data as clean and structured as possible. data mart ETL (Extract Transform Load) ELT (Extract Load Transform) post Why having a Data Warehouse? Besides the obvious reasons of a shop explained above, a data warehouse gives you big advantages: In theory: and of raw data of an organisation sources (mostly very structured like SAP, CRM, Excel, etc.) into meaningful and . Integration transformation from multiple useful information, historical stored In practice: Similar — All Measures and s are at one place in order to steer the plane and take the right decisions a cockpit in an aeroplane KPI’ It allows businesses to make by accessing the data well structured better decisions + It helps to make informed based on facts decisions + It improves the of business operations efficiency + It enhances the of customer service quality The visual products from business intelligence based on top of a data warehouses are largely: + Ad hoc Reporting + Standard Reporting + Dashboards / Cockpits What is Data Engineering? Data engineering is the less famous sibling of data science. Data science is growing like no tomorrow and so does data engineer, but much less heard. Compared to existing roles it would be a as the Hadoop ecosystem, streaming and computation at scale. Business creates more reporting artefacts themselves but with more data that needs to be collected, cleaned and updated near real-time and complexity is expanding every day. With that said more programmatic skills are needed similar to software engineering. (more in the chapter ) while used in engineering with tools alike Apache Airflow as well as data science with powerful libraries. Where today as a BI-engineer you use for almost everything except when using external data from an FTP-server for example. You would use bash and PowerShell in the nightly batch jobs. But this is no longer sufficient and because it gets a full-time job to develop and maintain all these requirement and rules (called pipelines), the data engineering is needed. software engineering plus business intelligence engineer including big data abilities The emerging language at the moment is Python below SQL The role of a data engineer In order to get high quality and , it is important to distinguish between data pipelines that are by data engineers and all the others that are mostly exploratory. We at Airbus use a folder that is called “cleaned” and all data sets produced there are constantly updated, documented and of the highest quality. Based on these data sets you create your own. We use the data lake solution (brand name of Airbus: ) which provides you with a map where you see the easily. as otherwise, you lose the overview of your data, which is also one main task of a data engineer. frequently updated data sets done and cleaned Palantir Foundry Skywise data lineage Documentation and metadata to each data set are crucial Services that a data engineer provides Another important task or that data scientists or data analysts do manually. A good overview what task this includes are provided by , the founder of , a tool that helps a data engineer to lift the majority of tasked mentioned: service which a data engineer provides is automation Maxime Beauchemin Apache Airflow “ : services and tooling around “scraping” databases, , … data ingestion loading logs, fetching data from external stores or APIs : frameworks to compute and summarise engagement, metric computation growth or segmentation related metrics : automating data consumption to or when trends are changing significantly anomaly detection alert people anomalous events occur : tooling around allowing generation and consumption of metadata, making it easy to find information in and around the data warehouse metadata management and experimentation frameworks is often a critical piece of company’s analytics with a significant data engineering component to it experimentation: A/B testing : and attributes related to those events, data engineers have vested interests in making sure that high-quality data is captured upstream instrumentation analytics starts with logging events : in time, allowing analysts to understand user behaviours_“_ dependencies pipelines that are specialized in understand series of actions While the nature of the workflows that can be automated differs depending on the environment, the need to automate them is common across the board. By Maxime Beauchemin When is a data engineer needed? I believe, that . His skills are mostly required if the company either: not every company is in need of data engineers has already a product that is fully web-based and therefore data-driven or the need or desire to analyse lots of data (Volume) from any kind of sources (Variety) and fast (Velocity) to get the insights (Value and Veracity) (see more about the ) five V’s of Big Data Data engineer job description and skills If English is the language of business, SQL is the language of data and Python the language of engineering. While technology disappears often, SQL is still here. This means you need a reliable understanding of: SQL to a high level of complexity + e.g. using and functions, changing the structure of a database with or change the data with , fine tune or read database , how window aggregate DDL DML execution plans indices s and dimensional modelling Data modelling techniques: ERD Solid ETL understanding Architectural projections**:** needs to have a at its disposal high-level understanding of most of the tools, platforms, libraries and other resources able to connect the dots from source to destination with any that does the job best. (Probably Python at the moment) programming language anticipated that as company size increased, so would the focus on scaling-related skill. However, that’s not the story the data told. Instead, data engineers at : Stitchdata.com larger companies tend to be more focused on “enterprise” skills like ETL, BI, and data warehousing, whereas data engineers at smaller companies focus more on core technologies The tool language, Python The growth of major programming languages Programming languages have always come and gone, but in the last couple of years, Python rises on top of the popularity. The question is why. One valid reason for sure is because of the rise of the data engineers but also the use of libraries for data science and data analytics. According to the and their source data from , they say it’s connected to the rise of data science. This and were the biggest trends in tech 2017. Additionally, Python has become a go-to language for data analysis. With data-focused libraries like pandas, NumPy, and matplotlib, anyone familiar with Python’s syntax and rules can deploy it as a powerful tool to process, manipulate, and visualise data. Codeacademy Stack Overflow machine learning Related to the rise of data science and data engineering, it’s clear to me that Python is here to stay and it’s becoming the . Swiss Army Knife of programming languages Python for data engineers But for what can you use Python in data engineering. For example, (reshaping, aggregating, joining disparate sources, etc.) which mostly done with the library , small-scale ETL, API interaction (our presentation usually happens in Tableau which has Python APIs) and automation with , which is also natively in Python. you use it for data wrangling Pandas Apache Airflow “ ” said by . Apache Airflow has several building blocks that allow Data Engineers to easily piece together pipelines to and from different sources. Because it is written in Python, Data Engineers find it easy to create ETL pipelines by just extending classes of Airflow’s DAG and Operator objects. And this allows us to write our own Python code to create any ETL we wish, with the structure given by Airflow. Airflow uses several packages mentioned all ready to do the job: boto for S3 handling, pandas for obvious advantages with data frames, psycopg2 for popular integrations with Postgres and Redshift, and several more. David Dalisay Facebook data engineer According to a job description as a at Facebook in Menlo Park in Seattle, he needs to have the following qualification and responsibilities. data engineer Minimum Qualifications 2+ years of Java and/or Python development experience 2+ years of SQL (Oracle, Vertica, Hive, etc.) experience 2+ years of LAMP stack development experience 2+ years of experience in custom or structured (i.e. Informatica/Talend/Pentaho) ETL design, implementation and maintenance Experience working with either a MapReduce or an MPP system on any size/scale. Responsibilities Build data expertise and own data quality for the awesome pipelines you build Architect, build and launch new data models that provide intuitive analytics to your customers Design, build and launch extremely efficient & reliable data pipelines to move data (both large and small amounts) to our ridiculously large Data Warehouse Design and develop new systems and tools to enable folks to consume and understand data faster Use your expert coding skills across a number of languages from Python, Java and PHP You have developed applications within the LAMP Stack environment Work across multiple teams in high visibility roles and own the solution end-to-end A picture of such a persona could look like ( ) quora.com Data Engineer Salary The salary of a data engineer is hard to say as it is very new, especially in Switzerland where the following salary-report is from. But on the following table you see all the jobs that are related or close by (unfortunately only in German, sorry for that) and their salary for a full year ins Swiss Francs (CHF) created by : Robert Half.ch Business Intelligence Engineer vs Data Engineer As already mentioned in further up, the data engineer has the skillset of a business intelligence engineer plus also solid programming and big data skills. I believe BI-engineers will transit over to a data engineer anyway, depending on the size of a company. But why? What is changing/has changed? What has changed As the power of computers and especially the speed of the internet is growing, more data can be collected and needs to be analysed. Therefore many parameters around the data warehouse environment have or will change. Below the points that I see with most influence: Computer get faster and , no persistent layer needed for performance optimisation, a semantic layer is enough reports can be built directly on source data to be finished during the night, data needs to show immediately (almost real-time) Business has not the time to wait for ETL + Keeping up with customer demand through new BI deployments + Older BI deployments cannot keep pace with success (as it relates to real-time data, streaming data and data from IoT devices).” as says + “The main factors that drive development and deployment of new data warehouses are being agile, leveraging the cloud and the next generation of data Brian E. Thomas at cio.com + (Data warehouse automation can help here, read more in my recent ) posts Structure needs to put in place after you collect data ( ). ELT instead of ETL Easy collaboration and exchange of data is needed in order to have the full overview of the data, timeframe and quality Governance getting more important as more data is stored New trends, tools, technologies and competitors are arising all the time, companies must be on top of that ETL is changing Furthermore the way we do ETL is changing as well, as , data engineer at Airbnb quotes: “ ” He is also saying that rather than using drag and drop tools (ETL-tools). Most important what I see as well, that the . Maxime Beauchemin Product know-how on platforms like Informatica, IBM Datastage, Cognos, AbInitio or Microsoft SSIS isn’t common amongst modern data engineers , and being replaced by more generic software engineering skills along with understanding of programmatic or configuration driven platforms like Airflow, Oozie, Azkabhan or Luigi. It’s also fairly common for engineers to develop and manage their own job orchestrator/scheduler. code is the best abstraction there is for software transformation logic is of a higher need and shouldn’t be locked away exclusively for BI developers Data Modelling is changing As you can’t change ETL without modelling differently, also this is changing: for performance gains is mostly compensated with faster databases engines or cloud solutions. Further denormalisation in dimensions can be tricky and not human-friendly as we prefer business keys. Maintaining surrogate keys With the popularity of document storage and cheap blobs in cloud storage, it is becoming easier to create and develop database without writing . schemas dynamically DML-statements Systematically dimension compared to handle complex and maybe contra-intuitive is a way to in a DWH. Is it also easy and relatively cheap to denormalise dimension’s attribute directly on the fact table to keep important information at the moment of the transaction. snapshoting slowly changing dimension (SCD) simplify track changes and as in general is extremely important in nowadays data warehouse and data environments. But . Conformed dimensions conformance to be more collaborative and work on same objects it is a necessary trade-off to loosen it up Not only more working on the same project within data warehousing, also more people from business and other departments . In that sense data needs to get more real-time rather than batch processing and precompute calculations, this can be done more ad-hoc with new fast technologies like Spark that and on-demand. getting more data-savvy than ever before ran complex jobs ad-hoc Education is changing Facebook, Airbnb and other companies taking it a step into so-called “Data Camps or Data University” to educate internal employees in respect of data to get more data savvy. Conclusion So after all, is the data engineer the new business intelligence engineer? I would say in the long run yes. I imagine that data warehouses — in any way — will always be a need for the business, where the data is fully structured and easily accessible. But how we build DWHs or a similar type, will change and therefore more engineering and data engineers, are needed. Originally published at www.sspaeti.com on March 8, 2018.