Primitive root mod p

Transire suum pectus mundoque potiri

Deleting duplicate lines from file 5 January, 2008

Filed under: Programming — Nikolas Karalis @ 10:27 pm

I’ve been fighting with a computational problem for many days, but i haven’t even come close to an acceptable solution.

The problem statement :

Given a file of n lines, return the index number of the duplicate lines (where index line is 0 for the first line, 1 for the 2nd etc.)

It may sound trivial, even silly, but i can assure you it is not. It can be relatively simple for small number of lines.

But as n raises, the problem becomes exponentially harder.

When i first faced the problem, i had to deal with a few hundred thousands of lines. So, i came up with a simple python code to do it.

When the first difficulties appeared, i came up with a faster solution.

  def duplicates(sequence):
	visited={}
	dupl=[]
	for x in sequence:
		if x in visited: dupl.append(x)
		visited[x]=1
	return dupl

However, now that i have to find the duplicate lines in files with 15 million. 26 million and more, it is impossible to use this code, since it returns memory errors.

So i found another idea, which is REALLY slow for now.

  1. Sort the file with the windows or unix command : sort
  2. Use unix command uniq -d, to get a list of the duplicated lines.
  3. Use unix command grep -n, on the unsorted file for every line in the previous list, to get which lines are duplicated.
  4. Use a simple python script to parse the result and get only the the integers we are interested in.

However, the grep part is REALLY slow for huge files. So, my problem remains. However, i reduced the problem from removing duplicated from a file, to simply getting fast the index of a line in a file or equivalently, fast iteration and comparison of the file lines.

After extended digging in the Internet, i was not able to find any efficient algorithm or implementation.

 

Leave a Reply