Asset Reliability Management: From General to Specific

Risk-Oriented Operation and Maintenance Management: Premises, Initial Results and Recommendations

Russian industrial companies are increasingly interested in risk-oriented individual approaches to the operation and maintenance of technological equipment (O&M) instead of periodic standard measures. In the case of such a transition, processes change: the company needs to consider more risk factors and choose from a variety of O&M strategies. This inevitably complicates the process of making repair decisions at all levels. How justified is it for each company to take a unique path and revise the approach to a particular piece of equipment or technological system? This article is dedicated to reflections on these "seesaws" from universal to individual.
Premises for the transition from planned preventive maintenance to a risk-oriented approach
The growing interest of companies in a risk-oriented approach and the organization of processes for managing equipment reliability is evident to everyone who is somehow involved in the topic of managing production assets. Here are a few of our observations:

  • Key terms from the field of production asset reliability management are spreading and becoming familiar.
  • The number of exhibitions, seminars, educational courses on reliability management is actively growing, there are more consultants and profile specialists.
  • Regular announcements are being made about successfully implemented regulatory and methodological, as well as comprehensive projects with information technology and business process changes. Such projects are promoted not only by contractors, but also by the customers themselves.
  • We note an increase in the number of players in the market for developing and implementing information systems for reliability management. Interestingly, even those companies for whom software development is not the core business are among the developers.

In our opinion, although the reasons for the high interest in the topic of reliability management are not new, the problems underlying this interest have become critical enough to lead to significant changes in the market.

Market participants have accumulated a critical volume of complaints about the classic organization of repairs, taking into account regulatory deadlines and operating hours (the so-called planned preventive maintenance, PPM). Among the main complaints:

— Expensive and ineffective (there are still many failures and damage);
— The real technical condition of assets and its dynamics are not taken into account (we do not repair what is needed, but break what worked);
— The classic approach is not "friendly" with the real risks of equipment failure (we were saved from penny losses by repairs, but missed the real problem).

Complaints about PPM are exacerbated by internal and external factors that are beyond the control of the business.

Internal factors:
  • Depletion of assets (sometimes exacerbated by the "Stakhanovite" approach to operation) and reduced investment in them;
  • Departing personnel and experience (when the remaining do not understand how it worked before);
  • Reduction of planning horizon ("we live like on a volcano!") and margin of economic strength (saving everywhere and right now).

External factors:
  • Departing suppliers (there used to be someone to ask for help);
  • Collapse of the service maintenance concept (we have to take on what was previously serviced by suppliers or service contractors);
  • Collapse of supply chains (when, even understanding what and how to do, you don’t do it, because there are no tools or spare parts);
  • Lack of qualified personnel.

As a result, over the past few years, in almost all asset-intensive industries, there has been a conscious trend towards a transition from PPM to reliability management within the RCM (Reliability-centered maintenance) concept. This methodology for planning O&M of engineering systems is based on the analysis of possible failures of systems, their elements and consequences.

According to this methodology, enterprises have begun to pay more attention to accounting for failures and analyzing their causes, assessing the "health" of each specific piece of equipment, the conditions of its use and the associated risks, developing individual ways to prevent failures. There has also been a need to informatize this process in order to preserve knowledge, reduce the negative impact of the human factor and the departure of experienced specialists from the company.

Further in the article, we will try to make sense of the first practical results of such a transition. We do not claim comprehensiveness and completeness of the study, but we will try to analyze some characteristic and important, in our opinion, "exaggerations on the spot".
The current practice of the transition
The first attempts to transition from averaged-normative to risk-oriented operation and O&M management were undertaken in Russia many years ago out of interest of individual top managers in experiments, rather than out of the real needs of enterprises. The trend was maturing, and today these ideas are fully supported "from below", as the theoretical provisions of RCM methodologies meet the real needs of chief mechanics and operation and repair departments.

Familiar electric motors, pumps, and gearboxes were suddenly joined by complex concepts such as "function", "functional failure", "root cause of failure", and "unmitigated risk matrix". New career paths have emerged and strengthened — "APM manager", "reliability engineer". The departure of foreign software suppliers in this field and related business practices, in turn, triggered a "boom" in import-substituting development.

We would not dare to assess the success of each project — this is a very individual matter. We will try to identify what can be called "growth reserves".

In almost all projects for transition available to us for analysis, we noticed a common logic. Briefly, it looks like this:

  • Reliance on the individual experience of specific employees, even within the same group of companies (GC). The idea is that specialists in individual branches of the GC often analyze technological systems or assets with completely identical purpose and characteristics in different ways.
  • Fragmented analytics, not covering aspects that they have not encountered. In other words, only failure statistics and experience available to a particular specialist or working group are subject to analysis.
  • Obsession with uniqueness at the expense of standardization. Here there is a clear exaggeration — if we previously tried to do everything on average, and this did not work, now we will do everything uniquely.
  • Isolation of reliability specialists from O&M planners, both in terms of regulatory and reference information, and in terms of approaches to O&M within the framework of general business planning.

What are the problems with this logic? In our opinion, the following "growth reserves" are hidden here:

1. Inefficient use of time and knowledge of reliability specialists. One risk factor can be interpreted differently and more than once, which lengthens the analysis time and reduces its quality. In the worst case, this will lead to an imbalance in the work of services that will be forced to implement the corresponding recommendations.

2. Narrow horizons, when specialists rely only on their own knowledge, ignoring or even not knowing the experience of others. In this case, the quality of the analysis directly depends on the experience of a small team, rather than the extensive experience of the market. Such isolation is definitely not in the best interests of the enterprise.

3. Difficulty in seeing common problems for similar equipment, identifying patterns, causes, and a unified ownership strategy, as well as starting to accumulate representative and analyzable statistics.

4. The problem of adapting the recommendations of "reliability experts" to the realities of the enterprise. The reliability specialist formulates a recommendation and its economic justification, but in the end it goes against the production capabilities. The expert-proposed frequency of interventions is difficult to follow in the real production schedule due to the priority of other production tasks. As a result, achieving the desired level of reliability for the enterprise will most likely become unrealistic.

All this can be called "overregulation" - when we moved away from the general to the specific, but got carried away and now have to go back to a general vision in order to transition to the specific again, revising the approach.
The Main Issue During the Transition
So, we consider "overregulation" to be the key problem of the transition. To reveal the essence of this term, as we understand it in the context of this article, we can analyze two extreme viewpoints on equipment unreliability: "All similar units of equipment ‘get sick' with the same thing at the same frequency" and "Let's monitor technological systems and equipment units individually, as everything is different." In our opinion, the truth, as always, lies somewhere in the middle.

Different systems and equipment have unique features and common patterns. Moreover, some factors of unreliability are only noticeable when analyzing many similar systems and equipment units.

Unique Equipment Features:

  • Consequences of failures, as they are determined by the application of the system. Even here there is room for standardization if we are talking about systems with a similar place and method of application.
  • Technical condition of the system elements, history of its operation, repairs and "diseases." Here, uniqueness is associated with the sequence of events in the life cycle of the equipment, but not with the types of "diseases" and their causes.
  • Methods of risk assessment, determined by the geography, ecology, demographics and social aspects of the region where the enterprise is located. However, here too there is scope for regional standardization.

General Features of Equipment:

  • "Diseases" of different types of equipment. The breakdown map depends on the design, physical and chemical processes during operation. The description of the "disease" of one pump is completely applicable to any other pump of the same type without taking into account its place of application. Different operating conditions have a greater impact on the assessment of the frequency and depth of different breakdowns than they radically change the map itself.
  • "Medicines" for equipment. The same breakdowns are repaired by the same means. The difference can be in the dosage of interventions, the duration and cost of work, taking into account the location and surroundings of the equipment. Different ways of accessing the equipment and other technological features are usually presented in the form of individual technical cards and repair regulations.
Conclusions and Recommendations for Transitioning to a Risk-Oriented Individual Approach to O&M

  • The trend towards individual strategies does not negate the importance of classifying and typifying equipment. This helps to determine the map of possible malfunctions, identify their causes, and even categorize the consequences and ways to solve them. We call the totality of this information the "reliability model".

The reliability model (RCM model) is an analytical structure consisting of interconnected definitions: function — functional failures — mechanisms and types of failure / causes — consequences (unmitigated risk) — recommendations (actions and mitigated risk) — strategies (packages of actions over time).

  • Any conscious private problem should be considered in the context of a single reliability model for the entire class of such similar objects. And even if such a problem has not been identified on other objects, it is only a matter of time.

  • The consequences of breakdowns (even the same ones) are individual. We can and should talk about common types of consequences, but the amount of material damage from them undoubtedly depends on the specific location, conditions and period of equipment operation. The issue of damage assessment should be considered at the stage of reliability analysis and/or risk calculation for a specific system or equipment unit. Even better — to entrust the assessment to an information system that will collect the necessary data from different sources.

  • Methods of "prevention and treatment" are unified for each specific risk factor. Differences relate to the technological specifics of O&M for a specific object and the associated regulations. When determining the repair strategy in a specific case, the "reliability specialist" is obliged, by default, to use the existing reliability model for similar objects, if necessary clarifying the details of the technical card and regulations.

  • It is necessary to ensure close communication between the "reliability specialist" and technologists. Before making any recommendations, the "reliability specialist" should know the capabilities of the enterprise. It is necessary to take into account, when calculating the effectiveness of the proposed strategies, the cost of interventions and the costs of organizing the process of their implementation. For this, we propose to limit the "reliability specialist’s" imagination within the framework of existing technologies, typical O&M technical cards and accept new recommendations only with justifications.
Effective RCM Analysis Process
We believe that the reliability model should be defined at several levels.

1. Equipment type (class) level — typical (template) model, containing:

  • Typical functions — without reference to the actual technological process and its KPIs. For example, the typical function of any pump is to pump liquid with a certain pressure rise from the inlet to the outlet. What kind of liquid it is, where it comes from, where it goes and in what volume it flows, and what will happen if the function is interrupted, is determined at the level of the technological system in which this type of pump is used. It is assumed that further pumps of this type will implement this particular typical function.
  • Typical functional failures — are also determined by the very nature of the object type being analyzed. Any pump, even in different applications, will "harm" in the same way — partially or completely not pump liquid, leak and dirty everything around, vibrate and make excessive noise. Specific characteristics and consequences of such failures need to be assessed individually, but the ways of "harming" are common to all pumps in the world.
  • Typical types of failures and causes — today, the most standardized area with standards for codes of failure types and causes, and their statistics. It is this part of the model that determines the map of "diseases", or breakdowns, and their causes. A fatal mistake of any "reliability specialist" is to come up with a new reason for a failure without first looking in the reference books.
  • Typical consequences of failures. Here, only the types of consequences can be standardized, while real material damage is purely individual. But the standardization of types of consequences is a good practice, which, in particular, will allow the formation of correct risk matrices.
  • Typical recommendations. As we wrote above, when forming an O&M strategy for a specific piece of equipment, it is necessary to clarify the technological card and the "treatment" regulations. The recommendations should clearly indicate the company’s willingness to implement the proposed measures.

2. Technical location (equipment unit) level. For a specific unit of equipment operating in a known place of the technological process, reliability analysis boils down to two steps:

Step 1. Based on the fact that the equipment belongs to a certain type or class, it automatically has its own reliability model (see above). The specialist receives a blank of this model at the start of the work and proceeds to analyze it.
Step 2. The task of the "reliability specialist" is to clarify the specifics of the application of typical functions and functional failures, to link the estimates of consequences (unmitigated risk) to the realities of using this equipment, and to clarify the recommendations and assessment of mitigated risk. Only in the case of extreme specificity should this model be supplemented.

3. Technological system level requires the following steps:

  • Step 1. System definition. A system is defined as a sequential set of technical locations that implements the main function — participation in the production of final products. Additionally, it is necessary to define auxiliary or ESG ("Environment, Social, Government" - ecology, safety, legislation) — functions of this system, if any.
  • Step 2. Assembling the system model from the models of the equipment units in its composition. The reliability model of the system is made up of the sum of the reliability models of the technical locations included in it. When including a piece of equipment in the composition of the analyzed system, its functions must be linked to one of the functions of the system. The rest of the system’s reliability model should be a link to the models of its elements.
  • Step 3. Adapting the assembled model to the production indicators of the system. When determining the consequences of failures and matrices of unmitigated and mitigated risk for all elements of the system, it is necessary to clarify the volumetric indicators of damage according to the purpose of the system.
  • Step 4. Formation of a single O&M strategy for all elements of the system. Here, in addition to individual recommendations on the intervals between interventions on the elements, it is necessary to take into account the complexity of O&M — all interventions should be included in packages with a unified implementation interval with minimal planned downtime.

In our opinion, during the process of reliability analysis and strategy development, the following simple rules should be followed:

1. When analyzing a new object, it is necessary to clarify the typical model defined in accordance with the classification of the object. For any critical system or equipment unit, by default, there should be a model and strategy that requires its "grounding" to a specific location and operating conditions.

2. New typical models can be created from private models of specific objects, provided that all aspects of reliability are studied, even those that have not been encountered. For this, there is a world of various technical standards, associations of reliability engineers, experience sharing and consultations with manufacturers.

3. All elements of the individual and typical model (functions, functional failures, types and causes of failures, consequences, recommendations and strategy) must be elements of the corresponding reference books — in the information system or on paper. The emergence of a new entry in such reference books cannot be spontaneous, but should be classified as an "event" with specific rituals.

All of the above can be quite correctly called a description of the process of creating a Reliability Knowledge Base. And the creation of a Knowledge Base in our time inevitably leads us to the need to informatize this process. And if (or when) you decide on this — pay attention, does the information system you choose or create follow the rules described above? If not, then it is easier and cheaper to use the familiar MSExcel.

Thank you for your attention and a successful journey to "there and back" on the path to reliability!