Find the intersection of two sorted arrays.

Let’s called array1 as A and array2 as B, each with size m and n.

The obvious brute-force solution is to scan through each element in A, and for each element in A, scan if that element exist in B. The running time complexity is O(m*n). Not good! Can we do better? Absolutely!

First, we know that both arrays are sorted. Can we somehow use this information to our advantage?

We can apply binary search to search if an element of A exist in B. So, the only modification from the brute-force approach is modifying linear search to binary search. This seems like a good improvement, we manage to reduce the complexity to O(m*lg(n)).

Of course, you know you can trade space for running time by using a hash table. Is it really useful? We can definitely hash each element in B to an array index (takes O(n) time). Therefore, to find if an element of A exist in B, it would require just O(1) time. The complexity improves to O(m+n).

But there is a problem, what if n is very big? (ie, n is one billion!). We have a problem here. The hash table will either requires a large amount of memory space, or there will be lots of collision in the table, which makes access time no longer O(1) time. Therefore, using a hash table is not a good general solution to this problem. Besides, using hash table DO NOT require that the array being sorted in the first place.

Here is the most important observation in order to solve this problem. Both arrays ARE sorted. This provides a very important clue. We must make full use of this information that they ARE in fact sorted.

We can have two index, which both starts at zero. Compare the two first elements of A and B. If A[0] is greater than B[0], we increase index of B by one. If B[0] is greater than A[0], we increase index of A by one. If they are equal, we know an intersection has occurred, so add it to the list and increment index of A and B by one. Once either index reaches the end of A or B, we have found all the intersections of A and B.

The complexity of this approach is still O(m+n), but it does not requires any extra space that a hash table requires. The complexity is O(m+n) because in the worse case, there would be no intersection between the two arrays, and we need to increment first index a total of m times and increment second index a total of n times, which is a total of m+n times.

Below is the C++ code for this approach:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
vector<int> findIntersection(vector<int> A, vector<int> B) { vector<int> intersection; int n1 = A.size(); int n2 = B.size(); int i = 0, j = 0; while (i < n1 && j < n2) { if (A[i] > B[j]) { j++; } else if (B[j] > A[i]) { i++; } else { intersection.push_back(A[i]); i++; j++; } } return intersection; } |

Do you think that this approach always work better? Not necessarily… Think what happens when n is very large, say one billion…

Compare this approach with the binary search approach.

O(m+n) and O(m*lg(n))

lg(n) is much smaller than n when n is very big. However, this does not necessarily means binary search is better in this case. In fact, binary search approach is only better when m << n (m is very small compared to n). For example, when m = 1 and n = one billion, which method will you choose? Binary search is definitely the winner here.
All of our above approaches assume that we have enough space to load both arrays to the memory. Here are some interesting questions to ponder about:
i) What if elements of array B is stored on disk, and the memory is limited such that you cannot load all elements into the memory at once?

ii) How will the complexity change in this case? Are there any factors you need to consider?

iii) How do you change your solution to adapt to this situation?

The questions mentioned at the end of the article are quite interesting.

Wondering are there any solutions to those?

============================================

i) What if elements of array B is stored on disk, and the memory is limited such that you cannot load all elements into the memory at once?

ii) How will the complexity change in this case? Are there any factors you need to consider?

iii) How do you change your solution to adapt to this situation?

===========================================

Thanks!

0Well, the questions are quite open-ended, so there are no the 'absolute' correct solution. The solution I provided is to divide array B into segments and load it into memory.

Then you would need to worry about dividing the segments into what size, as this would affect the complexity. You would also need to derive the complexity in this case. There are also some optimization you could do, since array B is sorted, you can apply a binary search across all segments to help finding the only needed segments to load into memory.

0The limited memory case rules out the binary search solution because you dont want to read disk back and forth.

0back and forth will not happen actually. As the first array are also sorted. Each segment of the second array will be loaded at most once.

0What is the solution for these kind of large amount data problem? If there only have limited memory.

Segment the data in hard disk could be a solution.

is there any other solution for these kind of problems? Thanks.

0I am not sure of the solution too. Anyone mind sharing some insights?

It seems like "Programming pearls" would be the perfect book to read, as it discusses solutions for large scale problems.

I should have my copy of the book soon, if I get the answer I would update my post.

0If you have a limited number range. marking bits for each number u discovered in both the arrays and just counting the set bits in logical AND of these two should also be a good solution.

1.) You read from disk each array just once.

and given a range of 1.. 1 million will still only use 250K(approx) for each array.

0Back to the original solution, if the arrays can contain duplicate numbers, you might be counting same numbers multiple times. If that is really what you wanted, and if the arrays contains different counts of same number, you are missing some combinations. These must be good questions to ask your interviewer to clarify. They might give some extra points.

+1Correct me if I'm wrong, but the given code would have a complexity of O(min(m, n)) < O(m + n)

(Your post I believe indicates that this is O(m+n))

+2Hi Debajit, consider this case: what if all the elements in array A (whose size is m, relatively small) are all very big, close to those elements in array B (whose size is n, say, 1 billion), it is not O( min(m,n)) i think

0sorry, forgot to finish the sentence, * close to those elements in the end of Array B *

0@Debajit: A = [2,4,6] B=[1,100], it's in this case O(m+n), not O(min(m,n))

0@Anonymous : i think its O(min(m,n)) as we are only finding the intersection of the elements.

for example A= [4, 5, 10] and B=[5,10,10,16,18,90], in this case intersection would be =[5,10].

so once array A is finished scanning, you dont need to check for rest of the array B, so its O(3) in this case. Correct me if i am missing anything

0if A = {0, 2, 4, 6, 8, 10} B = {1, 3, 10} It is O(m+n)

0Pingback: Intersection of two arrays. | I Live to Seek…

for i) I think B+ tree should be a good choice.

0original prob can be still optimised,

it does binary search on both the arrays and finds the intersection.

o(logn +logm)

int func(int a[],i1,j1, int b[],i2,j2)

{

while(i1<j1)

{

int m1 = j1-i1/2;

int m2 = j2-i2/2;

if(j1-i1 ==1)

{

if( (a[i1]==b[i2]) || (a[i1]==b[i2+1]) )

return a[i1];

if( (a[j1]==b[i2]) || (a[j1]==b[i2+1]) )

return a[j1];

return -1;

}

if(a[m1] == b[m2])

return a[m1]

else if(a[m1]<b[m2])

{

i1 = m1;

j2 = m2;

}

else

{

j1=m1;

i2=m2;

}

}

}

+2When using binary search we can use the fact that the arrays are sorted to narrow down the search. The key thing about binary search being it points to the index where the element being search ‘should’ be present if it was present in the searched array.

So for example if we have A = {100, 200, 300} B= {1, 10, 100,1000}. We pick A[0] = 100 and find it is present in B[2]. Now all elements in the range A[1…] would be greater than A[0] and since B is sorted too they can occur only in B[2…]. This narrows down the segment of array we are looking into. But then again the worst case scenario still remains the same – two non intersecting arrays. In such a case we would search for all nodes.

But there is a minor optimization to such non intersecting cases too. Lets say A = {0, 4, 5, 6} and B = {1, 2, 3}. When you search for A[0] = 0 you won’t find it but when you search for A[1]=4 inside B using binary search it says the element is not present AND it should be present next to B[2] but since the size of B is 3 itself you can deduce rest of the elements of A are greater than B hence there is no intersection possible any more. Then again this is an optimization which depends on actual values present in the arrays. One can always create a worst case scenario.

I observed there is a pattern which I call the ‘merge pattern’ between problems such as what you have given here, merge sort, adding two polynomials where a polynomial is given as a linked list and each node represents a term in the polynomial by a pair of co-efficient and power (so 10 + 9x + 4 x^2 + 3 x^10) is represented by (10,0)->(9,1)->(4,2)->(3,10). Or if given a number as a linked list you have to add two numbers. All of them iterate through two lists

In your post you said: “All of our above approaches assume that we have enough space to load both arrays to the memory”

This is not the case about the algorithm you wrote. You only read one record at a time and you only append to the result file in each iteration without any random access so this is actually a good solution for the huge data.

0@thequark:Interesting comments!

000merge sort or hash merge sort is the solution for problems in the end. It’s used for database query optimizations. if you have limited memory, you just read the head of two arrays one by one. Or you can just hash the smaller array if m<<n.

0I think you got wrong idea, you use a vector to store the common data, but the vector also will take space

0Pingback: Finding intersection of two sorted arrays | TechInterviewPrep

I think we can combine these two solutions. We suppose that the tow arrays are sorted INCRE. First, O(m+n) solution, we found that a[i]=b[j]. Then next time, bisearch for a[] is from a[i+1] to a[n].

Thank you ðŸ˜€

0nice approach, step by step

0External Sorting

http://en.wikipedia.org/wiki/External_sorting

0Pingback: LeetCode | Technical interview solutions

There is some ambiguity here. If there are duplicated values in the sorted array, and it happens to be the intersection point, do we count as one? Or do we need a little linear search for all the match?

If we can suppose in the each sorted array, the value is unique, (we can label a count under each value to achieve this), I think thequark’s method is good, however, we can further improve it:

First we need a binary search function that can return index where the searching value should be insert, no matter if there is an exact match. (bsearch in C would not return index if not found).

For A[m] and B[n]; start from i = j = 0;

if A[i] == B[j], it is an intersection, i++ and j++, then restart this step.

else, we choose the larger value, binary search it in the other array (the starting index should be i, so that we have shorter array), it will return index j’

update index so that j = j’. go back to step 1 to check if A[i] == B[j];

This would cause the bsearch happens not only in one array, why? Say

A[] = { 1 2 3 4 5 6 7 8 30 }

B[] = { 0 2 20 24 25 26 30 ….. }

if we keep search A[i] in B, we find 2, then loop through 3 4 5 6 7 8, which are a waste

The modified steps are:

1. i = 0, j = 0, A[0] != B[0]

2. A[0] > B[0], therefore search A[0] in B, among index range [0, n], find it should be inserted into index j’ = 1. let j = 1 now.

3. A[0] < B[1], therefore search B[1] in A, among index range [0, m], find it should be inserted into index i' = 1. let i = 1

4. A[1] = B[1] = 1, find intersection, i ++, j ++

5. A[2]

B[2], search A[8] in B, among index range [2,n], find it should be inserted into j’ = 67 Compare A[8] B[6], find a match, now update i = 9, j =7, continue

0This solution may contain duplicate entries in it. What if I want a set as the intersection output?

0write a program to produce two

modified timetables, one for Jai Bus Company and one for Veeru Bus

Company, each satisfying the following requirements:

Jai 10:15 11:10

Jai 10:10 11:00

Veeru 10:10 11:00

Veeru 16:30 18:45

Jai 12:05 12:30

Veeru 12:30 13:25

Veeru 12:45 13:25

Jai 17:25 18:01

this is the input and what will bw the output and source code for this

0