An analysis working agile with cloud-based microservices

This post “How can we adjust to Working Agile with Cloud-based Microservices in CI/CD” is my software testing degree project written during the last course of a software test engineer education. All information from the thesis is not included and some text is changed. The original document is available as a pdf Link to pdf here This is part 4 of 5.

I recommend to read my first posts in this series to understand the post below, since it is based upon the findings and conclusions found in previous posts.

Table of Contents (click to open)

Reasons and thoughts
Analysis
Research question 1, Difficulties
Research question 2, Approaches
Research question 3, Priority

Reasons and thoughts

First I want to explain the thoughts and reasons why this research changed from my main thoughts.

In my analysis I came to a crossroad. With a background as a developer, I have been working with Cloud-based systems and microservices. Agile teams are also a familiarity. Adding years in the profession of technology such as mobile devices, tv and other media, together with my educational background including assistive devices; Most topics I researched was familiar at some level. Also, the knowledge of the lack of this topic among developers was one of the things that got me to start research it before I decided to write this degree project.

Could not ignore what I found, even if I wanted to, it didn´t feel right. The research shows a somewhat unexpected analysis result that leads the thesis into the theme of organizational structure alongside with system structure. So it became to be a higher level research with only basics in experimental research.

Analysis

The analysis made from the empiric data is categorized based upon the different research questions. They are listed in sub-categories to have a controlled overview of the data in each theme and how to easier show how the data relate to the main research question.

Research question 1, Difficulties

What is the main problems companies are facing today within microservice testing when working agile in a cloud-based environment?

Perspective & Focus about problem areas

Here I divided the topic in two sections to easier define a pattern in the analysis. This first section is an analysis of the participants focus and perspective of problem areas.

Secondly the opinions they had about problems faced when working agile with cloud-based microservices.

Focus & Perspectives:

In this topic most answers were focused on non-technical subjects. This study was made with the decision to contact people who is working or have worked as test leads. What type of test lead or test field was not of the same importance, only that the experience was within agile and microservices. This had influence in their answers, a somewhat clear difference. Participant who mostly works within manual testing answered only within the area of the team’s work environment. One example in the answers is a participant who clearly states this: “the problem in any environment is not about access or running tests, it is about access to the information”.

In these answers there are also a focus on the test planning more than test execution, mentioned was documentation, goal with the testing. Some mentioned short-staffed teams with too high workload.

One respondent has worked with in both larger companies and as an educator on the topic of Cyber Security testing. They had an opinion on testing that understanding was the most important focus. They stated it like this

“To take time to go thru every step of the system, gain all knowledge needed before going into the topic of testing is the way we can make an application safe”.

The participants that have a more technical profession and some a background in system development mentioned more technical focus areas in the discussions. More about framework, parts of development pipeline, type of applications was discussed.

Tools within different areas was another focus, testing, monitoring and management. The participants in all roles and background had sometime during the discussion a focus on this topic.

Problem areas:

This question is one of the main questions for this thesis along with best practices, therefore it was important to make the conversations as in-depth as possible. This also gave results that varied on various levels. But two themes were repeated and stood out, overview and communication. These two will be mentioned in more dept in the next coming sections below. End-to-end testing were pointed out as a problem in every single discussion. In my own research end-to-end testing were also a repeated find to be a complex problem when testing microservices. In the research where monolithic applications were presented end-to-end seemed to be of a more reliable solution. On this topic I made follow up question of why end-to-end testing were used and if the reliability were questioned in their development process. The answers were basically the same, convenient and a “do as we always done” mindset in management.

There were also answers that went in details of security testing. I will not include those answers in this thesis with the argument that security testing is a topic of its own. It is a complex area and I felt that it wouldn´t do the topic justice to research within the time limit.

One participant mentioned that the one thing they thought where the most significant problem were test and production environmental differences. Clarification here were that we can´t predict all services behavior until they are up running in production.

Knowledge outside of the team

This topic goes as a connected thread in all discussions, no matter knowledge or profession. But the participants give different viewpoints of how and why. These are some I could find in my analysis of their answers:

Agile teams decide tools, frameworks etc. within the team and do not communicate their decisions outside of the team.
Developer team do not have testers in the team, this brings late communication when and what to test (for example: what has been updated, is regression testing needed)
Documentation is non-existing.
Goals and priorities might not be clear for all teams and causing unmotivating work environment.
Developers and testers do not share same perspective of what is important, fast deployment or reliability tested.
The difficulty for test leads to get the full picture of the application. To collect all information needed for tracing and observability, monitoring.

System Complexity

It is known fact that large applications with microservices, containers, IaaS, PaaS, FaaS, native development and more is a complex architecture. The answers mentioned this in a couple of different ways, I tried to identify and connect the ones that highlighted the same problem area.

To know that we covered the most critical areas and not missed anything significant was something mentioned in both end-to-end testing discussions and in discussion of how much different test techniques we need. Here the opinions differ a bit, in my analysis I can´t pinpoint exactly why since there is no clear pattern connecting these opinions. This will be explained in section 6.2 Research question 2, Approaches below. Both in my own research and in answers from participant there are arguments that too little test automation is done, without automation we cannot do an approved test coverage, and argument that too much test automation is done, that automation is seen as the ultimate answer of cloud and service testing and not as a tool.

Testing too much with no clear test strategy was the answer from one participant, explanation of the statement was that test strategies tended to be high-level. Causing it to be interpret that as many different test methods and techniques as possible are needed to cover the testing areas.

Knowledge of frameworks, coding and cloud architecture among testers causing problems to understand where in the development process to start testing and to decide for correct type and technique for specific areas was mentioned by a participant with a role within DevOps and Scrum.

Performance and priorities, here I will combine several topics that was mentioned as my analysis says that all these answers were in the end pointed to the same things. There was the risk of performance issues if too much testing is included, both before deploying and in production. Too prioritize setting up a full monitoring architecture to reach full observability, instead of the logging sprawl. Logging sprawl is definition of logging included in the code without any documentation. (Sidenote here that I make assumption that by documentation it refers to diagram, application map or architecture connection.)

Research question 2, Approaches

What could be a best practice approach to reach most effective testability and test coverage for a microservice environment?

Shift Left

Shift left practice is the approach mentioned the most as best practice, agreement that we need to test early in the development process. One participant mentioned this as the best practice to avoid bottlenecks in both test process and in bug fixes. Testing early gives an overview of critical areas and makes it easier to set up a low-level test plan. Also mentioned in the topic of shifting from waterfall to agile, that it is a benefit to implement a shift left testing to easier get the teams to work side by side in the development pipeline. In the perspective of security one participant mention shift left and integrate with the DevOps team. To drop tools into the pipeline for a better test flow.

Low-level Test

To test often and focus on services functions is another approach that is mentioned by both participants and in most research materials I came across. There is a difference in the how depending upon when the material was written. With the change of new architecture ideas, more knowledge, and added resources the opinions have changed.

In my analysis I followed up two well-known companies that I am familiar with and looked on their test approaches from around 2014-2016 and today. Company 1, 2014. First, I want to mention that this company has testing as a high priority and in my analysis the conclusion is that their discussions of development process have more focus on testing than coding in most material I read. They had a structure of, Unit test, Integration test and end-to-end integrated test. This combined with logging and monitoring. Today they do use unit test but with caution, explained that unit tests limit code changes and are time-consuming. End-to-end test is something they trying to avoid completely. The approach they have today is:

Cross-functional teams, with test expertise areas
Unit tests but not seen as testing, it´s quality checking.
Implementation testing only on isolated complex parts of the application
Integration testing
Contract testing, consumer-driven
Test in production, Internal testing
Monitoring

Company 2, 2013. When company 1, had testing as a high priority company 2 is the opposite. They leave a lot to the developers, put the developers responsible for code quality and reliability. Therefore, they also put a lot of focus on unit testing and secondly automation test. Test automation is seen as testing and most used. From the automation they rely upon logs to trace issues. And for today they have the approach:

Developer code-ownership
Unit tests
Test Automation, automate all them tests.
Test in production, dummy data/ testing doubles
Beta testing

Test in production

From shifting left we have a test coverage of functions and communication between services, but we still do not know how our system will behave in production. Here there is a clear opinion among the participants and my research, where I couldn´t find many differences. To test in production is high priority, my analysis shows that it is mentioned just as much as Test automation in deployment pipeline as it is in discussions of test in test environment.

A bit of a side theme including the three sections above, felt worth to be mentioned in the context of best practice instead of or together with End-to-end testing. As results and analysis shows end-to-end testing to be flaky and a problem, load testing is something mentioned as often. Load testing is mentioned as an obvious test needed, both in integration testing and end-to-end. Both load and stress testing will give us answers of the application’s overall behavior and critical areas both within the services and communication between.

Monitoring, Observability

In this section I will collect all types of monitoring, logging, synthetic monitoring, tracing for easier overview.

Monitoring of some sort is mentioned in every discussion, every research material on microservice architecture. The least mentioned is logging, I tried to find a theme of why but couldn´t find any specifics. If we look at some answers from the participants logging is mentioned, one to use logging data for testing. Another is for measure data to find potential issues. Therefore, I connect this with monitoring more than basic logging. Reason for this is one participants’ answer that they do have lots of logs but no real structure to monitor what and why. And that the data is difficult to use for more than specific already appeared issues and then mostly serves the developers.

The research for best approach comes to land on MELT,

Metrics
Events
Logs
Traces

Communication

This topic has become a critical part of this research, participants mention communication repeatedly. I also had the opportunity to do this research at a time where I had closer insight in a company were I both could study and ask questions about their internal architecture within both organization and system. This gave me data to study some areas closer and compare to other parts of my research. First, I want to mention that this company has a well-covered communication approach, it was noticeably clear that they put a lot of planning into this. The employees get onboarded with information how to communicate with all areas of the organization, using several different channels like Slack, Video meetings, daily, weekly, monthly updates within and outside of the teams. Test engineers work in the teams but also have their own meetups. In my analysis this gives a reliability to know that testing solves issues in the application and developers are fast informed of these. Although, communication is lacking in areas where it affects the test engineers. My conclusion for this is that all professions have different priorities, knowledge and understanding.

As one participant mentioned that in any environment not just agile, developers understanding, and data availability is not enough.

A find in this topic is that different test engineers have different perspectives, and this seems to cause a misunderstanding between testers. Testing is seen as a group of whole, working towards the same goals. This seems to depend on testers backgrounds and focus areas, but not to be discussed among testers therefore causing misunderstandings.

Organizational structure

I will start this section with two insights of one company´s problems and changes in their organizational structure:

“… fragmented ecosystem of developer tooling where the only way to find out how to do something was to ask your colleague. ‘Rumor-driven development’, we endearingly called it” (Spotify, 2020)

On the same topic within a new client-side architecture build, they made changes by parallel testing and development, adding more people to the test groups along when the feature they were working on became more stable. They went from 30 employees to 1000 in four months. And they state that this gave them control over the testing process. (Spotify, 2020.1)

The text above summarizes my analysis of this topic rater well. It shows that no matter how much tools, automation, expertise, IaaS, PaaS we use if there isn´t a well-working structure with enough resources it becomes an issue. I do a parallel to “put out fires” in development, where we tend to fix an issue without thinking ahead. In the end the codebase becomes fragile, costly, and time-consuming to fix.

From the participants answers there is a shortage of test engineers within the organizations that might cause critical areas in testing a complex system gets overlooked. A best practice is to invert the resources and make sure testing is a priority with good test plans, strategy, and knowledge coverage. From the themes above we can also see that expertise in different areas is needed. A test engineer team with organizational knowledge, development knowledge, manual, regression knowledge. And specific specialist areas as: Security, platform, accessibility and more.

Research question 3, Priority

What are the impressions of the priority of testing microservices/container applications among test leads?

Analysis of this question were following a somewhat connected thread thru all answers, the majority started the answer on this question with: “No, …. “. The clarifications followed were divided to the theme sections below for easier summarizing.

We and them

There is a clear opinion of a division between test engineers and developers, and test engineers and management. For developers and testers, the cause is mentioned to be that developers do not provide enough information and have too little knowledge, understanding for testing. There was also a difference depending on the test engineer’s role in these answers, manual tester with a more theoretical background focused the answers on developers’ ability to give too technical information instead of documentation. Opposite for the participants with roles and backgrounds in development, security, or test automation.

For the management and test engineers’ opinions it was down prioritization of testing in favor of production that was mentioned. More importance of build, release new features than application performance.

Continuous Deployment

This theme continues from the theme above, we and them. I will use the participants statement from above for explanation: ““push to prod” trap”. This is said in other ways by other participants, that it is easy to overlook testing minor changes, since they are just slight changes. If test engineers are left out on these minor changes it will become a bottleneck in the end if issues are detected later in production. Here I also connect the managements decisions of new features and fast changes to be a part of the opinions that test engineers are not needed to be involved unless there are issues.

Test or Tool opinions

Two things are connected to this theme in my analysis: Test automation and unit testing. Unit testing is mostly done by developers and by the answers from the participants there is a problem with developers view on it. Developers see Unit tests as a good enough assurance for definition of done when test engineers do not. Another theme found in the analysis is the opinion if Test Automation is testing or just a tool in testing. From the participants answers most sees test automation as a tool, it can help the testing process but not replace manual testing. (Sidenote: that I use the word manual testing for clarity but am aware that some test engineers do not agree with the term, that it should be called just testing).

Resource shortage

In my own research I studied opinions about team structure with two main questions: 1. How many test engineers in each team/in the development. 2. How does it differ from number of developers for the same workload? The subject was not easy to find information about, as mentioned above most material about cloud-based and microservices is towards development, or management in agile. Testing and test engineers are mentioned in the sidelines or grouped together as test lead or test engineers in the descriptions, diagrams seen. This when developers are divided in DevOps, Backend, Front-End, Android, iOS and more. Management is mentioned in specific roles, like, Financial, Operation manager, Domain Expert, Senior management.

And for the main questions it was not possible for exact numbers since it varies depending on company, size of system, application. Although, there were a pattern in one test engineer per team/service. This did not seem to change with larger systems or with deployment frequency. It was less likelihood that more test engineers were added to a team than extra developers where there was a need for more resources. This I connect to the push to prod theme above.

This is all for this post. Next part will be the final part with conclusions.

Links to previous posts in this series can be found here:
Link to part 1
Link to part 2
Link to part 3

Link to the next and final part can be found here:
Link to part 5

An analysis working agile with cloud-based microservices – Part 4