What is boxing and unboxing and what are the trade offs?

Language AgnosticBoxingGlossaryUnboxing

Language Agnostic Problem Overview


I'm looking for a clear, concise and accurate answer.

Ideally as the actual answer, although links to good explanations welcome.

Language Agnostic Solutions


Solution 1 - Language Agnostic

Boxed values are data structures that are minimal wrappers around primitive types*. Boxed values are typically stored as pointers to objects on the heap.

Thus, boxed values use more memory and take at minimum two memory lookups to access: once to get the pointer, and another to follow that pointer to the primitive. Obviously this isn't the kind of thing you want in your inner loops. On the other hand, boxed values typically play better with other types in the system. Since they are first-class data structures in the language, they have the expected metadata and structure that other data structures have.

In Java and Haskell generic collections can't contain unboxed values. Generic collections in .NET can hold unboxed values with no penalties. Where Java's generics are only used for compile-time type checking, .NET will generate specific classes for each generic type instantiated at run time.

Java and Haskell have unboxed arrays, but they're distinctly less convenient than the other collections. However, when peak performance is needed it's worth a little inconvenience to avoid the overhead of boxing and unboxing.

* For this discussion, a primitive value is any that can be stored on the call stack, rather than stored as a pointer to a value on the heap. Frequently that's just the machine types (ints, floats, etc), structs, and sometimes static sized arrays. .NET-land calls them value types (as opposed to reference types). Java folks call them primitive types. Haskellions just call them unboxed.

** I'm also focusing on Java, Haskell, and C# in this answer, because that's what I know. For what it's worth, Python, Ruby, and Javascript all have exclusively boxed values. This is also known as the "Everything is an object" approach***.

*** Caveat: A sufficiently advanced compiler / JIT can in some cases actually detect that a value which is semantically boxed when looking at the source, can safely be an unboxed value at runtime. In essence, thanks to brilliant language implementors your boxes are sometimes free.

Solution 2 - Language Agnostic

from C# 3.0 In a Nutshell:

> Boxing is the act of casting a value > type into a reference type:

int x = 9; 
object o = x; // boxing the int

> unboxing is... the reverse:

// unboxing o
object o = 9; 
int x = (int)o; 

Solution 3 - Language Agnostic

Boxing & unboxing is the process of converting a primitive value into an object oriented wrapper class (boxing), or converting a value from an object oriented wrapper class back to the primitive value (unboxing).

For example, in java, you may need to convert an int value into an Integer (boxing) if you want to store it in a Collection because primitives can't be stored in a Collection, only objects. But when you want to get it back out of the Collection you may want to get the value as an int and not an Integer so you would unbox it.

Boxing and unboxing is not inherently bad, but it is a tradeoff. Depending on the language implementation, it can be slower and more memory intensive than just using primitives. However, it may also allow you to use higher level data structures and achieve greater flexibility in your code.

These days, it is most commonly discussed in the context of Java's (and other language's) "autoboxing/autounboxing" feature. Here is a java centric explanation of autoboxing.

Solution 4 - Language Agnostic

In .Net:

Often you can't rely on what the type of variable a function will consume, so you need to use an object variable which extends from the lowest common denominator - in .Net this is object.

However object is a class and stores its contents as a reference.

List<int> notBoxed = new List<int> { 1, 2, 3 };
int i = notBoxed[1]; // this is the actual value

List<object> boxed = new List<object> { 1, 2, 3 };
int j = (int) boxed[1]; // this is an object that can be 'unboxed' to an int

While both these hold the same information the second list is larger and slower. Each value in the second list is actually a reference to an object that holds the int.

This is called boxed because the int is wrapped by the object. When its cast back the int is unboxed - converted back to it's value.

For value types (i.e. all structs) this is slow, and potentially uses a lot more space.

For reference types (i.e. all classes) this is far less of a problem, as they are stored as a reference anyway.

A further problem with a boxed value type is that it's not obvious that you're dealing with the box, rather than the value. When you compare two structs then you're comparing values, but when you compare two classes then (by default) you're comparing the reference - i.e. are these the same instance?

This can be confusing when dealing with boxed value types:

int a = 7;
int b = 7;

if(a == b) // Evaluates to true, because a and b have the same value

object c = (object) 7;
object d = (object) 7;

if(c == d) // Evaluates to false, because c and d are different instances

It's easy to work around:

if(c.Equals(d)) // Evaluates to true because it calls the underlying int's equals

if(((int) c) == ((int) d)) // Evaluates to true once the values are cast

However it is another thing to be careful of when dealing with boxed values.

Solution 5 - Language Agnostic

Boxing is the process of conversion of a value type into a reference type. Whereas Unboxing is the conversion of a reference type into a value type.

EX: int i = 123;
	object o = i;// Boxing
	int j = (int)o;// UnBoxing

Value Types are: int, char and structures, enumerations. Reference Types are: Classes,interfaces,arrays,strings and objects

Solution 6 - Language Agnostic

The .NET FCL generic collections:

List<T>
Dictionary<TKey, UValue>
SortedDictionary<TKey, UValue>
Stack<T>
Queue<T>
LinkedList<T>

were all designed to overcome the performance issues of boxing and unboxing in previous collection implementations.

For more, see chapter 16, CLR via C# (2nd Edition).

Solution 7 - Language Agnostic

The language-agnostic meaning of a box is just "an object contains some other value".

Literally, boxing is an operation to put some value into the box. More specifically, it is an operation to create a new box containing the value. After boxing, the boxed value can be accessed from the box object, by unboxing.

Note that objects (not OOP-specific) in many programming languages are about identities, but values are not. Two objects are same iff. they have identities not distinguishable in the program semantics. Values can also be the same (usually under some equality operators), but we do not distinguish them as "one" or "two" unique values.

Providing boxes is mainly about the effort to distinguish side effects (typically, mutation) from the states on the objects otherwise probably invisible to the users.

A language may limit the allowed ways to access an object and hide the identity of the object by default. For example, typical Lisp dialects has no explicit distinctions between objects and values. As a result, the implementation has the freedom to share the underlying storage of the objects until some mutation operations occurs on the object (so the object must be "detached" after the operation from the shared instance to make the effect visible, i.e. the mutated value stored in the object could be different than the other objects having the old value). This technique is sometimes called object interning.

Interning makes the program more memory efficient at runtime if the objects are shared without frequent needs of mutation, at the cost that:

  • The users cannot distinguish the identity of the objects.
    • There are no way to identify an object and to ensure it has states explicitly independent to other objects in the program before some side effects have actually occur (and the implementation does not aggressively to do the interning concurrently; this should be the rare case, though).
  • There may be more problems on interoperations which require to identify different objects for different operations.
  • There are risks that such assumptions can be false, so the performance is actually made worse by applying the interning.
    • This depends on the programming paradigm. Imperative programming which mutates objects frequently certainly would not work well with interning.
  • Implementations depending on COW (copy-on-write) to ensure interning can incur serious performance degradation in concurrent environments.
    • Even local sharing specifically for a few internal data structures can be bad. For example, ISO C++ 11 did not allow sharing of the internal elements of std::basic_string for this reason exactly, even at the cost of breaking the ABI on at least one mainstream implementation (libstdc++).
  • Boxing and unboxing incur performance penalties. This is obvious especially when these operations can be naively avoided by hand but actually not easy for the optimizer. The concrete measurement of the cost depends (on per-implementation or even per-program basis), though.

Mutable cells, i.e. boxes, are well-established facilities exactly to resolve the problems of the 1st and 2nd bullets listed above. Additionally, there can be immutable boxes for implementation of assignment in a functional language. See SRFI-111 for a practical instance.

Using mutable cells as function arguments with call-by-value strategy implements the visible effects of mutation being shared between the caller and the callee. The object contained by an box is effectively "called by shared" in this sense.

Sometimes, the boxes are referred as references (which is technically false), so the shared semantics are named "reference semantics". This is not correct, because not all references can propagate the visible side effects (e.g. immutable references). References are more about exposing the access by indirection, while boxes are the efforts to expose minimal details of the accesses like whether indirection or not (which is uninterested and better avoided by the implementation).

Moreover, "value semantic" is irrelevant here. Values are not against to references, nor to boxes. All the discussions above are based on call-by-value strategy. For others (like call-by-name or call-by-need), no boxes are needed to shared object contents in this way.

Java is probably the first programming language to make these features popular in the industry. Unfortunately, there seem many bad consequences concerned in this topic:

  • The overall programming paradigm does not fit the design.
  • Practically, the interning are limited to specific objects like immutable strings, and the cost of (auto-)boxing and unboxing are often blamed.
  • Fundamental PL knowledge like the definition of the term "object" (as "instance of a class") in the language specification, as well as the descriptions of parameter passing, are biased compared to the the original, well-known meaning, during the adoption of Java by programmers.
    • At least CLR languages are following the similar parlance.

Some more tips on implementations (and comments to this answer):

  • Whether to put the objects on the call stacks or the heap is an implementation details, and irrelevant to the implementation of boxes.
    • Some language implementations do not maintain a contiguous storage as the call stack.
    • Some language implementations do not even make the (per thread) activation records a linear stack.
    • Some language implementations do allocate stacks on the free store ("the heap") and transfer slices of frames between the stacks and the heap back and forth.
    • These strategies has nothing to do boxes. For instance, many Scheme implementations have boxes, with different activation records layouts, including all the ways listed above.
  • Besides the technical inaccuracy, the statement "everything is an object" is irrelevant to boxing.
    • Python, Ruby, and JavaScript all use latent typing (by default), so all identifiers referring to some objects will evaluate to values having the same static type. So does Scheme.
    • Some JavaScript and Ruby implementations use the so-called NaN-boxing to allow inlining allocation of some objects. Some others (including CPython) do not. With NaN boxing, a normal double object needs no unboxing to access its value, while a value of some other types can be boxed in a host double object, and there is no reference for double or the boxed value. With the naive pointer approach, a value of host object pointer like PyObject* is an object reference holding a box whose boxed value is stored in the dynamically allocated space.
    • At least in Python, objects are not "everything". They are also not known as "boxed values" unless you are talking about interoperability with specific implementations.

Solution 8 - Language Agnostic

Boxing and unboxing facilitates value types to be treated as objects. Boxing means converting a value to an instance of the object reference type. For example, Int is a class and int is a data type. Converting int to Int is an exemplification of boxing, whereas converting Int to int is unboxing. The concept helps in garbage collection, Unboxing, on the other hand, converts object type to value type.

int i=123;
object o=(object)i; //Boxing

o=123;
i=(int)o; //Unboxing.

Solution 9 - Language Agnostic

Like anything else, autoboxing can be problematic if not used carefully. The classic is to end up with a NullPointerException and not be able to track it down. Even with a debugger. Try this:

public class TestAutoboxNPE
{
	public static void main(String[] args)
	{
		Integer i = null;
		
		// .. do some other stuff and forget to initialise i
		
		i = addOne(i);           // Whoa! NPE!
	}
	
	public static int addOne(int i)
	{
		return i + 1;
	}
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionKeithView Question on Stackoverflow
Solution 1 - Language AgnosticPeter BurnsView Answer on Stackoverflow
Solution 2 - Language AgnosticChristian HagelidView Answer on Stackoverflow
Solution 3 - Language AgnosticJustin StandardView Answer on Stackoverflow
Solution 4 - Language AgnosticKeithView Answer on Stackoverflow
Solution 5 - Language AgnosticvaniView Answer on Stackoverflow
Solution 6 - Language AgnosticJonathan WebbView Answer on Stackoverflow
Solution 7 - Language AgnosticFrankHBView Answer on Stackoverflow
Solution 8 - Language AgnosticSanjay KumarView Answer on Stackoverflow
Solution 9 - Language AgnosticPEELYView Answer on Stackoverflow