Millions of 3D points: How to find the 10 of them closest to a given point?

AlgorithmGraphGraphicsKnnNearest Neighbor

Algorithm Problem Overview


A point in 3-d is defined by (x,y,z). Distance d between any two points (X,Y,Z) and (x,y,z) is d= Sqrt[(X-x)^2 + (Y-y)^2 + (Z-z)^2]. Now there are a million entries in a file, each entry is some point in space, in no specific order. Given any point (a,b,c) find the nearest 10 points to it. How would you store the million points and how would you retrieve those 10 points from that data structure.

Algorithm Solutions


Solution 1 - Algorithm

Million points is a small number. The most straightforward approach works here (code based on KDTree is slower (for querying only one point)).

Brute-force approach (time ~1 second)

#!/usr/bin/env python
import numpy

NDIM = 3 # number of dimensions

# read points into array
a = numpy.fromfile('million_3D_points.txt', sep=' ')
a.shape = a.size / NDIM, NDIM

point = numpy.random.uniform(0, 100, NDIM) # choose random point
print 'point:', point
d = ((a-point)**2).sum(axis=1)  # compute distances
ndx = d.argsort() # indirect sort 

# print 10 nearest points to the chosen one
import pprint
pprint.pprint(zip(a[ndx[:10]], d[ndx[:10]]))

Run it:

$ time python nearest.py 
point: [ 69.06310224   2.23409409  50.41979143]
[(array([ 69.,   2.,  50.]), 0.23500677815852947),
 (array([ 69.,   2.,  51.]), 0.39542392750839772),
 (array([ 69.,   3.,  50.]), 0.76681859086988302),
 (array([ 69.,   3.,  50.]), 0.76681859086988302),
 (array([ 69.,   3.,  51.]), 0.9272357402197513),
 (array([ 70.,   2.,  50.]), 1.1088022980015722),
 (array([ 70.,   2.,  51.]), 1.2692194473514404),
 (array([ 70.,   2.,  51.]), 1.2692194473514404),
 (array([ 70.,   3.,  51.]), 1.801031260062794),
 (array([ 69.,   1.,  51.]), 1.8636121147970444)]

real	0m1.122s
user	0m1.010s
sys	0m0.120s

Here's the script that generates million 3D points:

#!/usr/bin/env python
import random
for _ in xrange(10**6):
    print ' '.join(str(random.randrange(100)) for _ in range(3))

Output:

$ head million_3D_points.txt

18 56 26
19 35 74
47 43 71
82 63 28
43 82 0
34 40 16
75 85 69
88 58 3
0 63 90
81 78 98

You could use that code to test more complex data structures and algorithms (for example, whether they actually consume less memory or faster then the above simplest approach). It is worth noting that at the moment it is the only answer that contains working code.

Solution based on KDTree (time ~1.4 seconds)

#!/usr/bin/env python
import numpy

NDIM = 3 # number of dimensions

# read points into array
a = numpy.fromfile('million_3D_points.txt', sep=' ')
a.shape = a.size / NDIM, NDIM

point =  [ 69.06310224,   2.23409409,  50.41979143] # use the same point as above
print 'point:', point


from scipy.spatial import KDTree

# find 10 nearest points
tree = KDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query([point], k=10)

# print 10 nearest points to the chosen one
print a[ndx]

Run it:

$ time python nearest_kdtree.py  

point: [69.063102240000006, 2.2340940900000001, 50.419791429999997]
[[[ 69.   2.  50.]
  [ 69.   2.  51.]
  [ 69.   3.  50.]
  [ 69.   3.  50.]
  [ 69.   3.  51.]
  [ 70.   2.  50.]
  [ 70.   2.  51.]
  [ 70.   2.  51.]
  [ 70.   3.  51.]
  [ 69.   1.  51.]]]
    
real	0m1.359s
user	0m1.280s
sys	0m0.080s
Partial sort in C++ (time ~1.1 seconds)

// $ g++ nearest.cc && (time ./a.out < million_3D_points.txt )
#include <algorithm>
#include <iostream>
#include <vector>

#include <boost/lambda/lambda.hpp>  // _1
#include <boost/lambda/bind.hpp>    // bind()
#include <boost/tuple/tuple_io.hpp>

namespace {
  typedef double coord_t;
  typedef boost::tuple<coord_t,coord_t,coord_t> point_t;

  coord_t distance_sq(const point_t& a, const point_t& b) { // or boost::geometry::distance
    coord_t x = a.get<0>() - b.get<0>();
    coord_t y = a.get<1>() - b.get<1>();
    coord_t z = a.get<2>() - b.get<2>();
    return x*x + y*y + z*z;
  }
}

int main() {
  using namespace std;
  using namespace boost::lambda; // _1, _2, bind()

  // read array from stdin
  vector<point_t> points;
  cin.exceptions(ios::badbit); // throw exception on bad input
  while(cin) {
    coord_t x,y,z;
    cin >> x >> y >> z;    
    points.push_back(boost::make_tuple(x,y,z));
  }
  
  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;  // 1.14s

  // find 10 nearest points using partial_sort() 
  // Complexity: O(N)*log(m) comparisons (O(N)*log(N) worst case for the implementation)
  const size_t m = 10;
  partial_sort(points.begin(), points.begin() + m, points.end(), 
               bind(less<coord_t>(), // compare by distance to the point
                    bind(distance_sq, _1, point), 
                    bind(distance_sq, _2, point)));
  for_each(points.begin(), points.begin() + m, cout << _1 << "\n"); // 1.16s
}

Run it:

g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )
point: (69.0631 2.23409 50.4198)
(69 2 50)
(69 2 51)
(69 3 50)
(69 3 50)
(69 3 51)
(70 2 50)
(70 2 51)
(70 2 51)
(70 3 51)
(69 1 51)

real	0m1.152s
user	0m1.140s
sys	0m0.010s
Priority Queue in C++ (time ~1.2 seconds)

#include <algorithm>           // make_heap
#include <functional>          // binary_function<>
#include <iostream>

#include <boost/range.hpp>     // boost::begin(), boost::end()
#include <boost/tr1/tuple.hpp> // get<>, tuple<>, cout <<

namespace {
  typedef double coord_t;
  typedef std::tr1::tuple<coord_t,coord_t,coord_t> point_t;

  // calculate distance (squared) between points `a` & `b`
  coord_t distance_sq(const point_t& a, const point_t& b) { 
    // boost::geometry::distance() squared
    using std::tr1::get;
    coord_t x = get<0>(a) - get<0>(b);
    coord_t y = get<1>(a) - get<1>(b);
    coord_t z = get<2>(a) - get<2>(b);
    return x*x + y*y + z*z;
  }

  // read from input stream `in` to the point `point_out`
  std::istream& getpoint(std::istream& in, point_t& point_out) {    
    using std::tr1::get;
    return (in >> get<0>(point_out) >> get<1>(point_out) >> get<2>(point_out));
  }

  // Adaptable binary predicate that defines whether the first
  // argument is nearer than the second one to given reference point
  template<class T>
  class less_distance : public std::binary_function<T, T, bool> {
    const T& point;
  public:
    less_distance(const T& reference_point) : point(reference_point) {}

    bool operator () (const T& a, const T& b) const {
      return distance_sq(a, point) < distance_sq(b, point);
    } 
  };
}

int main() {
  using namespace std;

  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;

  const size_t nneighbours = 10; // number of nearest neighbours to find
  point_t points[nneighbours+1];

  // populate `points`
  for (size_t i = 0; getpoint(cin, points[i]) && i < nneighbours; ++i)
    ;
  
  less_distance<point_t> less_distance_point(point);
  make_heap  (boost::begin(points), boost::end(points), less_distance_point);

  // Complexity: O(N*log(m))
  while(getpoint(cin, points[nneighbours])) {
    // add points[-1] to the heap; O(log(m))
    push_heap(boost::begin(points), boost::end(points), less_distance_point); 
    // remove (move to last position) the most distant from the
    // `point` point; O(log(m))
    pop_heap (boost::begin(points), boost::end(points), less_distance_point);
  }

  // print results
  push_heap  (boost::begin(points), boost::end(points), less_distance_point);
  //   O(m*log(m))
  sort_heap  (boost::begin(points), boost::end(points), less_distance_point);
  for (size_t i = 0; i < nneighbours; ++i) {
    cout << points[i] << ' ' << distance_sq(points[i], point) << '\n';  
  }
}

Run it:

$ g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )

point: (69.0631 2.23409 50.4198)
(69 2 50) 0.235007
(69 2 51) 0.395424
(69 3 50) 0.766819
(69 3 50) 0.766819
(69 3 51) 0.927236
(70 2 50) 1.1088
(70 2 51) 1.26922
(70 2 51) 1.26922
(70 3 51) 1.80103
(69 1 51) 1.86361

real	0m1.174s
user	0m1.180s
sys	0m0.000s
Linear Search -based approach (time ~1.15 seconds)

// $ g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )
#include <algorithm>           // sort
#include <functional>          // binary_function<>
#include <iostream>

#include <boost/foreach.hpp>
#include <boost/range.hpp>     // begin(), end()
#include <boost/tr1/tuple.hpp> // get<>, tuple<>, cout <<

#define foreach BOOST_FOREACH

namespace {
  typedef double coord_t;
  typedef std::tr1::tuple<coord_t,coord_t,coord_t> point_t;

  // calculate distance (squared) between points `a` & `b`
  coord_t distance_sq(const point_t& a, const point_t& b);

  // read from input stream `in` to the point `point_out`
  std::istream& getpoint(std::istream& in, point_t& point_out);    

  // Adaptable binary predicate that defines whether the first
  // argument is nearer than the second one to given reference point
  class less_distance : public std::binary_function<point_t, point_t, bool> {
    const point_t& point;
  public:
    explicit less_distance(const point_t& reference_point) 
        : point(reference_point) {}
    bool operator () (const point_t& a, const point_t& b) const {
      return distance_sq(a, point) < distance_sq(b, point);
    } 
  };
}

int main() {
  using namespace std;

  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;
  less_distance nearer(point);

  const size_t nneighbours = 10; // number of nearest neighbours to find
  point_t points[nneighbours];

  // populate `points`
  foreach (point_t& p, points)
    if (! getpoint(cin, p))
      break;

  // Complexity: O(N*m)
  point_t current_point;
  while(cin) {
    getpoint(cin, current_point); //NOTE: `cin` fails after the last
                                  //point, so one can't lift it up to
                                  //the while condition

    // move to the last position the most distant from the
    // `point` point; O(m)
    foreach (point_t& p, points)
      if (nearer(current_point, p)) 
        // found point that is nearer to the `point` 

        //NOTE: could use insert (on sorted sequence) & break instead
        //of swap but in that case it might be better to use
        //heap-based algorithm altogether
        std::swap(current_point, p);
  }

  // print results;  O(m*log(m))
  sort(boost::begin(points), boost::end(points), nearer);
  foreach (point_t p, points)
    cout << p << ' ' << distance_sq(p, point) << '\n';  
}

namespace {
  coord_t distance_sq(const point_t& a, const point_t& b) { 
    // boost::geometry::distance() squared
    using std::tr1::get;
    coord_t x = get<0>(a) - get<0>(b);
    coord_t y = get<1>(a) - get<1>(b);
    coord_t z = get<2>(a) - get<2>(b);
    return x*x + y*y + z*z;
  }

  std::istream& getpoint(std::istream& in, point_t& point_out) {    
    using std::tr1::get;
    return (in >> get<0>(point_out) >> get<1>(point_out) >> get<2>(point_out));
  }
}

Measurements shows that most of the time is spent reading array from the file, actual computations take on order of magnitude less time.

Solution 2 - Algorithm

If the million entries are already in a file, there's no need to load them all into a data structure in memory. Just keep an array with the top-ten points found so far, and scan over the million points, updating your top-ten list as you go.

This is O(n) in the number of points.

Solution 3 - Algorithm

You could store the points in a k-dimensional tree (kd-tree). Kd-trees are optimized for nearest-neighbor searches (finding the n points closest to a given point).

Solution 4 - Algorithm

I think this is a tricky question that tests if you don't try to overdo things.

Consider the simplest algorithm people already have given above: keep a table of ten best-so-far candidates and go through all the points one by one. If you find a closer point than any of the ten best-so-far, replace it. What's the complexity? Well, we have to look at each point from the file once, calculate it's distance (or square of the distance actually) and compare with the 10th closest point. If it's better, insert it in the appropriate place in the 10-best-so-far table.

So what's the complexity? We look at each point once, so it's n computations of the distance and n comparisons. If the point is better, we need to insert it in the right position, this requires some more comparisons, but it's a constant factor since the table of best candidates is of a constant size 10.

We end up with an algorithm that runs in linear time, O(n) in the number of points.

But now consider what is the lower bound on such an algorithm? If there is no order in the input data, we have to look at each point to see if it's not one of the closest ones. So as far as I can see, the lower bound is Omega(n) and thus the above algorithm is optimal.

Solution 5 - Algorithm

No need to calculate the distance. Just the square of the distance should serve your needs. Should be faster I think. In other words, you can skip the sqrt bit.

Solution 6 - Algorithm

This isn't a homework question, is it? ;-)

My thought: iterate over all points and put them into a min heap or bounded priority queue, keyed by distance from the target.

Solution 7 - Algorithm

This question is essentially testing your knowledge and/or intuition of space partitioning algorithms. I'd say that storing the data in an octree is your best bet. It's commonly used in 3d engines that handle just this kind of problem (storing millions of vertices, ray tracing, finding collisions, etc.). The lookup time will be on the order of log(n) in the worst case scenario (I believe).

Solution 8 - Algorithm

Straightforward algorithm:

Store the points as a list of tuples, and scan over the points, computing the distance, and keeping a 'closest' list.

More creative:

Group points into regions (such as the cube described by "0,0,0" to "50,50,50", or "0,0,0" to "-20,-20,-20"), so you can "index" into them from the target point. Check which cube the target lies in, and only search through the points in that cube. If there are less than 10 points in that cube, check the "neighboring" cubes, and so on.

On further thought, this isn't a very good algorithm: if your target point is closer to the wall of a cube than 10 points, then you'll have to search into the neighboring cube as well.

I'd go with the kd-tree approach and find the closest, then remove (or mark) that closest node, and re-search for the new closest node. Rinse and repeat.

Solution 9 - Algorithm

For any two points P1 (x1, y1, z1) and P2 (x2, y2, z2), if the distance between the points is d then all of the following must be true:

|x1 - x2| <= d 
|y1 - y2| <= d
|z1 - z2| <= d

Hold the 10 closest as you iterate over your entire set, but also hold the distance to the 10th closest. Save yourself a lot of complexity by using these three conditions before calculating the distance for every point you look at.

Solution 10 - Algorithm

basically a combination of the first two answer above me. since the points are in a file there's no need to keep them in memory. Instead of an array, or a min heap, I would use a max heap, since you only want to check for distances less than the 10th closest point. For an array, you would need to compare each newly calculated distance with all 10 of the distances you kept. For a min heap, you have to perform 3 comparisons with every newly calculated distance. With a max heap, you only perform 1 comparison when the newly calculated distance is greater than the root node.

Solution 11 - Algorithm

Calculate the distance for each of them, and do Select(1..10, n) in O(n) time. That would the naive algorithm I guess.

Solution 12 - Algorithm

This question needs further definition.

The decision regarding the algorithms that pre-index data varies very much depending if you can hold the whole data in the memory or not.

With kdtrees and octrees you will not have to hold the data in memory and the performance benefits from that fact, not only because the memory footprint is lower but simply because you will not have to read the whole file.

With bruteforce you will have to read the whole file and recompute distances for each new point you will be searching for.

Still, this might not be important to you.

Another factor is how many times will you have to search for a point.

As stated by J.F. Sebastian sometimes bruteforce is faster even on large data sets, but take care to take into account that his benchmarks measure reading the whole data set from disk (which is not necessary once kd-tree or octree is built and written somewhere) and that they measure only one search.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionKazoomView Question on Stackoverflow
Solution 1 - AlgorithmjfsView Answer on Stackoverflow
Solution 2 - AlgorithmWillView Answer on Stackoverflow
Solution 3 - AlgorithmmipadiView Answer on Stackoverflow
Solution 4 - AlgorithmKrisView Answer on Stackoverflow
Solution 5 - AlgorithmAgnel KurianView Answer on Stackoverflow
Solution 6 - AlgorithmDavid ZView Answer on Stackoverflow
Solution 7 - AlgorithmKaiView Answer on Stackoverflow
Solution 8 - AlgorithmJeff Meatball YangView Answer on Stackoverflow
Solution 9 - AlgorithmKirk BroadhurstView Answer on Stackoverflow
Solution 10 - AlgorithmYilingView Answer on Stackoverflow
Solution 11 - AlgorithmRubysView Answer on Stackoverflow
Solution 12 - AlgorithmUnreasonView Answer on Stackoverflow