Avoid code duplication

by **Ursego** » 21 Feb 2013, 21:05

Link to this posting

by Ursego » 21 Feb 2013, 21:05

Reuse same code many times rather than create its multiple copies.

Whenever evidence of repetition is found in the code base, a good programmer should seek to refactor, so that the code to tackle a given problem or enforce a given rule is implemented in one place only. In other words, common logic should be refactored into a single reusable code unit (a generic fragment within a same function, or a brand new generic function). This proper form of careful code reuse reduces the possibility of bugs, greatly improves the robustness of our code, and is a vitally important part of defensive programming.

The dangers of copy-and-paste.

Copy-and-paste as a means of solving a set of similar problems leads to code duplication. In turn, this means that we need to maintain multiple copies of essentially the same code, but with each copy subtly modified to suit a particular need. That will lead to multiple, inconsistent versions of almost the same logic being scattered throughout your code base, and a maintenance nightmare.

Often, we have code that works perfectly well for a particular purpose, and then we find that we need to implement some very similar functionality in another place. It is all too tempting to just copy the code, adapt it to meet the new requirements, and then deploy this slightly modified code. However, every time we copy and paste code in this manner, we are exposed to the following two real dangers:

1. When the requirements change, we need to make sure that the change is reflected, not just in the original code, but in all the subsequent copies. When we cut and paste fragments, we expose our code to the possibility of bugs if we fail to reflect the requirements change in each of the multiple implementations of one and the same logic in exactly the same way. And that absolutely can happen since the developer can even be unaware, that a duplication exists somewhere, burred in kilometers of code in hundreds of functions.

2. Let's say, a bug is found in the logic, which is duplicated somewhere. We fix the code fragment, executed in the reported test case (so, formally, the bug is fixed). But we forget (or are not aware of existence of) other similar code fragments, leaving other business flows with the same bug (which can be potentially hidden and, if not found, survive through many years).

But if the enhancement (or the fix) is done in one central place, which serves different modules of the application, then your change is automatically propagated to all these modules.

There is one more benefit of implementing the logic in one place only: the overall length of code decreases, functions are kept smaller. Let me quote the topic Keep functions simple:

Divide your program into short methods that perform one identifiable task ("separation of concerns"). The main function should call well-named sub-functions which call sub-sub-functions (and so on, and so on...), so a developer can easily "travel" up and down between different levels of abstraction through hierarchies of any depth, each time concentrating only on one level. That will result in simple overall code even if the whole system is extremely complex, and dramatically facilitate understanding, debugging and enhancement of serious enterprise applications.

In short, copy-and-paste coding is a direct violation of the DRY (Don't Repeat Yourself) principle, which is so fundamental in software engineering.

Avoid code duplication within a function.

Don't write similar code more than once. Instead, handle the differences in one, generic fragment. The following example was found in one of the old stored procedures I was enhancing:

*** BAD code: ***

Code: Select all: IF @lang = 'FR' SET @brok_dtl = 'Courtier: ' + @prov_cd + ' ' + right ('0000' + convert (varchar, @brok_no), 4) + '-' + right ('00' + convert (varchar, @brok_ofc_no), 3) + ' ' + @name + ' - ' + @street_addr_1 + ' ' + @city + ', ' + @province ELSE SET @brok_dtl = 'Broker: ' + @prov_cd + ' ' + right ('0000' + convert (varchar, @brok_no), 4) + '-' + right ('00' + convert (varchar, @brok_ofc_no), 3) + ' ' + @name + ' - ' + @street_addr_1 + ' ' + @city + ', ' + @province

If I would be its author, I would write this way:

*** GOOD code: ***

Code: Select all: SET @brok_dtl = CASE WHEN @lang = 'FR' THEN 'Courtier: ' ELSE 'Broker: ' END + @prov_cd + ' ' + right ('0000' + convert (varchar, @brok_no), 4) + '-' + right ('00' + convert (varchar, @brok_ofc_no), 3) + ' ' + @name + ' - ' + @street_addr_1 + ' ' + @city + ', ' + @province

Extract duplicated code fragments to a brand-new function and call it from the locations the code has been moved from.

If duplicated fragments are in classes, inherited from a same ancestor, then create the new function in the ancestor and call it from the descendants. If the classes are not inherited from one ancestor then create a generic function in a third class (even if you have to create that class only for one function!). If you are working with a not object oriented language (like C or Transact-SQL), simply extract the code to to a brand new function / stored procedure.

Extract similar code fragments (and merge functions with similar functionality) into a brand-new smart generic function even if they are not absolutely identical. Call that function from the locations where the algorithm result is consumed, each time supplying the unique (location-specific) data.

In general, there are 3 ways to supply the unique data: as an argument, by returning from a dedicated function, and via an instance variable:

Supplying unique data as an argument.

The specific (different) data can be supplied as argument(s) of the new generic function - that is the most common way. For example, we have two original functions in two different classes (Car and Motorcycle) which have a same ancestor (Transport):

*** BAD code: ***

printCarInfo() of the class Car:

Code: Select all: [fragment 1] String transportDesc = "Car"; [fragment 2]

printMotorcycleInfo() of the class Motorcycle:

Code: Select all: [fragment 1] // same as [fragment 1] in printCarInfo() of class Car String transportDesc = "Motorcycle"; // ooops, this line is different... [fragment 2] // same as [fragment 2] in printCarInfo() of class Car

*** GOOD code: ***

Generic function printTransportInfo(), created in the ancestor class, Transport (into which printCarInfo() and printMotorcycleInfo() were refactored):

Code: Select all: [fragment 1] // same as [fragment 1] in printCarInfo() of class Car and printMotorcycleInfo() of class Motorcycle String transportDesc = aTransportDesc; // aTransportDesc is the function's argument; "Car" or "Motorcycle" can be passed [fragment 2] // same as [fragment 2] in printCarInfo() of class Car and printMotorcycleInfo() of class Motorcycle

Supplying unique data by returning from a dedicated function.

In the next example, getTransportDesc() was created only for the sake of returning the circumstances-dependent data to one universal algorithm. Of course, in such a simple case as returning a hardcoded value, you should use the previous method (an argument), but sometimes the unique data is obtained or calculated in different ways for different classes, so a dedicated function can be a good solution:

** GOOD code: ***

getTransportDesc() implemented in the class Car:

Code: Select all: return "Car";

getTransportDesc() implemented in the class Motorcycle:

Code: Select all: return "Motorcycle";

printTransportInfo(), created in the ancestor class, Transport:

Code: Select all: [fragment 1] // same as [fragment 1] in printCarInfo() of class Car and printMotorcycleInfo() of class Motorcycle String transportDesc = this.getTransportDesc(); [fragment 2] // same as [fragment 2] in printCarInfo() of class Car and printMotorcycleInfo() of class Motorcycle

If the original duplicated fragments were in classes, inherited from a same super-class, then create getTransportDesc() in that super-class as protected. Make it an abstract method if your programming language supports them. If doesn't (so you are forced to implement the function on the ancestor level, which doesn't make any sense) then throw an exception which will let developers know that the ancestor's version of the function must never be called since it's only intended to be overridden.

Supplying unique data via an instance variable.

Finally, there is one more way to achieve the goal - we can populate an instance variable while initializing the instance (for example, in its constructor) - of course, if the value is know in that moment and will not change later. In our example, we could declare an object field final String transportDesc;, populate it in the constructor of each descendant, and use directly in printTransportInfo() instead of the local variable with the same name. printTransportInfo() should validate the field and throw an exception if it's empty (i.e. has been forgotten to be populated in a descendant). This method is good only if the specific data is static by its nature (like an entity description), i.e. will not change after initialization. In other circumstances it can be confusing and bugs-prone (we need to do our best to use variables with as small scope as possible), so it's better to use other methods.

------------------------------------------------------------

From the book "Clean Code":

Duplication may be the root of all evil in software. Many principles and practices have been created for the purpose of controlling or eliminating it. Consider, for example, that all of Codd's database normal forms serve to eliminate duplication in data. Consider also how object-oriented programming serves to concentrate code into base classes that would otherwise be redundant. Structured programming, Aspect Oriented Programming, Component Oriented Programming, are all, in part, strategies for eliminating duplication. It would appear that since the invention of the subroutine, innovations in software development have been an ongoing attempt to eliminate duplication from our source code.

From the book "97 Things Every Programmer Should Know":

Don't Repeat Yourself

Of all the principles of programming, Don't Repeat Yourself (DRY) is perhaps one of the most fundamental. The principle ... underlies many other wellknown software development best practices and design patterns. The developer who learns to recognize duplication, and understands how to eliminate it through appropriate practice and proper abstraction, can produce much cleaner code than one who continuously infects the application with unnecessary repetition.

Duplication Is Waste

Every line of code that goes into an application must be maintained, and is a potential source of future bugs. Duplication needlessly bloats the codebase, resulting in more opportunities for bugs and adding accidental complexity to the system. The bloat that duplication adds to the system also makes it more difficult for developers working with the system to fully understand the entire system, or to be certain that changes made in one location do not also need to be made in other places that duplicate the logic they are working on. DRY requires that "every piece of knowledge must have a single, unambiguous, authoritative representation within a system".

A Matter of Principle

Other software principles are also related to DRY. The Once and Only Once principle, which applies only to the functional behavior of code, can be thought of as a subset of DRY. The Open/Closed Principle, which states that "software entities should be open for extension, but closed for modification", only works in practice when DRY is followed. Likewise, the well-known Single Responsibility Principle, which requires that a class have "only one reason to change", relies on DRY. When followed with regard to structure, logic, process, and function, the DRY principle provides fundamental guidance to software developers and aids the creation of simpler, more maintainable, higher-quality applications.

CAUTION!

Did you pay attention that I wrote "to write a piece of code twice using copy-paste... or move it to a third place..."? I was talking about the case when you are just creating that fragment from scratch. But the situation is different if you think whether to duplicate an existing fragment (or move it to a third place), i.e. refactor an existing code to prevent duplication by you, or even refactor your all-brand-new code which will be called from such parts of the application that unwanted dependencies will be created. Please read a quote from the book "97 Things Every Programmer Should Know":

Beware the Share

It was my first project at the company. I'd just finished my degree and was anxious to prove myself, staying late every day going through the existing code. As I worked through my first feature, I took extra care to put in place everything I had learned - commenting, logging, pulling out shared code into libraries where possible, the works. The code review that I had felt so ready for came as a rude awakening - reuse was frowned upon!

How could this be? Throughout college, reuse was held up as the epitome of quality software engineering. All the articles I had read, the textbooks, the seasoned software professionals who taught me - was it all wrong?

It turns out that I was missing something critical.

Context.

The fact that two wildly different parts of the system performed some logic in the same way meant less than I thought. Up until I had pulled out those libraries of shared code, these parts were not dependent on each other. Each could evolve independently. Each could change its logic to suit the needs of the system’s changing business environment. Those four lines of similar code were accidental - a temporal anomaly, a coincidence. That is, until I came along.

The libraries of shared code I created tied the shoelaces of each foot to the other. Steps by one business domain could not be made without first synchronizing with the other. Maintenance costs in those independent functions used to be negligible, but the common library required an order of magnitude more testing.

While I'd decreased the absolute number of lines of code in the system, I had increased the number of dependencies. The context of these dependencies is critical - had they been localized, the sharing may have been justified and had some positive value. When these dependencies aren’t held in check, their tendrils entangle the larger concerns of the system, even though the code itself looks just fine.

These mistakes are insidious in that, at their core, they sound like a good idea.When applied in the right context, these techniques are valuable. In the wrong context, they increase cost rather than value. When coming into an existing codebase with no knowledge of where the various parts will be used, I’m much more careful these days about what is shared.

Beware the share. Check your context. Only then, proceed.

Avoid code duplication

Link to this posting

Ketones are a more high-octane fuel for your brain than glucose. Become a biohacker and upgrade yourself to version 2.0!