Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Queries to count points lying on arbitrary line
Suppose we have N points on XY plane, ie. (x, y) and x, y are integers and multiple queries where each query is of the form y = mx + c and m, c are integers.
Is it possible to count number of points that lie on the given line more efficiently than O(N) per query?
I believe some preprocessing might help or some data structure like QuadTree that counts points inside arbitrary rectangle.
The queries are offline so if it's possible using some batch processing technique or even SQRT optimization it would be better than naive brute force.
2 answers
Since you didn't mention memory, I'll assume O(N^2) is acceptable.
Setup pseudocode:
for every p1, p2:
ln = line(p1, p2)
points_on_line[ln.m, ln.c] += [p1, p2]
If there are at least 2 colinear points on your line, you can get them in O(1) with points_on_line[m, c]
. If not, you'll have to do brute force.
The initialization takes O(N^2) so it is practical if the number of searches you need to do is much more than N^2
(eg. 5 points, 1000 searches). Note that you can add points:
def add_point(p):
for m, c in points_on_line:
if p.y == m * p.x + c:
points_on_line[m, c] += [p]
This will be O(N), so adding new points is okay so long as you are doing a lot of searches for each point you add.
0 comment threads
Some answer-specific terminology to make describing things simpler:
- I will refer to the N points as candidate points.
- I will refer to the integer points inside the bounding box of the N points as lattice points.
Two approaches
There are two ways to approach this question:
- Check whether each of the N candidate points is on the line.
- Check whether each of the lattice points on the line is one of the N candidate points.
Which approach is most efficient will depend on the specific details of each case, and how much can be precomputed.
Example
As an example, consider the case where N = 50 and all 50 candidate points lie in the region where:
- -100 <= x <= 100
- -100 <= y <= 100.
How many lattice points in this region need to be checked to see if they are one of the 50 candidate points? There are 201 * 201 = 40,401 lattice points, but most of these do not need to be checked as the line does not pass through them. This affects which of the two approaches is more efficient:
- If the line is y = x + 1 then it will pass through 200 lattice points in this region.
It will be more efficient to check by asking whether each of the 50 candidate points satisfies the equation (rather than having to check all 200 lattice points). - If the line is y = 100x + 1 then it will only pass through 2 lattice points in this region.
It will be more efficient to check by asking whether each of the 2 lattice points is present in the set of 50 candidate points (rather than having to check all 50 candidate points).
This is simplified for the sake of example. In reality the check for presence in a set will not necessarily take the same time as the check for satisfying the equation. The point is that there will be some threshold where one approach is replaced by the other as the most efficient.
Improving the check for whether a candidate point satisfies the equation
In those cases where you are checking whether each of the N candidate points satisfies the equation, a data structure such as your suggestion of a quadtree (or perhaps a k-d tree) will allow you to exclude some of the N points, as they are in a region that the line does not pass through, so you will only have to check a subset of the N points.
Improving the check for whether a lattice point is in the set of candidate points
In those cases where you are checking whether each of the lattice points that the line passes through are in the set of N candidate points, you could avoid checking each of the N candidate points (checking for presence is O(N)) by putting them in a data structure such as a hashmap (checking for presence is O(1)).
Deciding which approach to take
Once you have made each of the two approaches as efficient as you can, you can have some method of estimating which one is best for a given line and set of candidate points. Alternatively you could start both approaches and take the answer from the approach that finishes first, terminating the other.
Which approach is more efficient will also depend on how efficiently you can determine the set of lattice points that a given line passes through within the bounding box of the N candidate points. If either the set of lines or the set of candidate points is available in advance for precomputing the lattice points or the bounding box then the resulting improved efficiency is likely to affect which approach is best for some cases.
2 comment threads