# The shortest possible length of the compressed representation of a string

*Edit: This is a variant of the original question, in which instead of removing K number of characters we replace them with any character.*

Last week I’ve practiced some codility tests, which turned out a bit more difficult than I expected. For those of you who never had a chance to attend such exams, they consist of 3 tasks of varying difficulty, for which you’re given 90 minutes to complete. In the last test I received tasks which all would be marked as ‘hard’ by a leetcode/hackerrank standard. Please have a look at one of them, which I believe was the most tricky.

Strings with long blocks of repeating characters take much less space if kept in a compressed representation. To obtain the compressed representation, we replace each segment of equal characters in the string with the number of characters in the segment followed by the character(e.x we replace segment “CCCC” with 4C). To avoid increasing the size, we leave the one-letter segments unchanged(the compressed representation of “BC” is the same string – “BC”). For example, the compressed representation of the string “ABBBCCDDCCC” is “A3B2C2D3C”, and the compressed representation of the string “AAAAAAAAAAABXXAAAAAAAAAA” is “11AB2X10A”. Observe that, in the second example, if we replaced the “BXX” segment from the middle of the word with “A”, we would obtain a much shorter compressed representation – “24A”. In order to take advantage of this observation, we decided to modify the compression algorithm. Now before compression, we replace exactly K letters from the input string with letters which would be optimized for the compression algorithm. We would like to know the shortest compressed form that we can generate this way.

Write a function:

```
class Solution {
public int solution(String S, int K)
{
}
}
```

that given a string S of length N and an integer K, returns the shortest possible length of the compressed representation of S after removing exactly K characters from S.

Example:

Given S = “AAAAAAAAAAABXXAAAAAAAAAA” and K=3, the function should return 3, because after removing “BXX” from S, we are left with “AAAAAAAAAAAAAAAAAAAAA”, which compresses to a representation of length 3 – “21A”.

Requirement:

```
N is an integer within the range [1...100,000];
K is an integer within the range [0...100,000];
K <= N;
string S consists only of uppercase letters (A-Z)
```

First of all, I would prefer to avoid starting a discussion on whether or not such tests are good measurement for programming skills. Leaving all doubts aside, the truth is if you want to land a job at one of the top companies, you have to practice a lot, even though writing moderately difficult algorithms probably won’t be a part of your daily routine.

So how to approach this task? I usually start by looking at the boundary conditions:

```
N is an integer within the range [1...100,000];
K is an integer within the range [0...100,000];
K <= N;
string S consists only of uppercase letters (A-Z)
```

Ok, we don’t have to consider case sensitivity in char comparisons, fair enough. We may also notice that when given string is less than 3 characters, the result is always going to be equal to its length:

original | compressed | result |

“” | “” | 0 |

a | a | 1 |

ab | ab | 2 |

bb | 2b | 2 |

Another thing worth noticing is that when K is equal or 1 less than the string length, the result will be equal to the length of the string expressed as a string + 1:

original | k | compressed | result |

aaaaa | 0 | 5a | 2 |

abcde | 4 | 5a | 2 |

abcdefghij | 10 | 10a | 3 |

abcdefghijk | 10 | 11a | 3 |

So we start with:

```
public static int findCompressedLength(string S, int K)
{
if (S.Length < 3)
return S.Count();
else if (K >= S.Count()-1)
return S.Count().ToString().Length + 1;
}
```

Boundary cases are done, so now we can focus on the main problem. First, we need to find such permutation of letters, where the sequence of the repeating char is the longest. Possibly we could use a basic brute force solution here with the O(N*K) complexity to calculate K long ranges for all N letters in the string, however, this would surely fail the codility performance criteria. There’s another approach to this type of problems called sliding window or moving window with linear complexity.

How does it work? You start by using two pointers (current position of the char in string and the window start). Then with each loop, you store the char along with its current frequency in a dictionary or hashmap.

`if (charsWithFrequencies.ContainsKey(S[i]))`

` charsWithFrequencies[S[i]]++;`

`else`

` charsWithFrequencies.Add(S[i], 1);`

`repetitions = charsWithFrequencies[S[i]];`

Then you add a condition which checks of the distance from current index to the beginning of the window – number of the current letter repetitions + 1 (because length is denoted in 0 based convention) is bigger than the number of allowed letter replacement (K). Please take your time to wrap your mind around it.

`if (i - windowStart - repetitions + 1 > K)`

`{`

` charsWithFrequencies[S[windowStart]]--;`

` windowStart++;`

`}`

If so, you move the window forward and decrease the frequency of the letter, which was placed in a string and the window beginning index.

Great, half of the task is done already. Now, since we know that the algorithm is going to return the longest sequence, each time it’s saving this sequence, let’s save few more properties:

```
if (i - windowStart + 1 > longestSequence)
{
longestSequence = i - windowStart + 1;
longestSequenceInstance.EndingIndex = i;
longestSequenceInstance.StartingIndex = i - longestSequence + 1;
longestSequenceInstance.Letter = S[i];
longestSequenceInstance.Occurences = longestSequence;
}
```

I’ve created a simple struct just for legibility, in which we’ll store the letter with the longest sequence, it’s ending and starting indiced and the number of occurences:

```
public struct longestSequenceInstance
{
public char Letter { get; set; }
public int EndingIndex { get; set; }
public int StartingIndex { get; set; }
public int Occurences { get; set; }
}
```

The rest is quite simple, we replace part of the original string with the longest subsequence:

```
var workingString = new StringBuilder(S);
var replaced = workingString.Remove(longestSequenceInstance.StartingIndex, longestSequenceInstance.Occurences - 1)
.Insert(longestSequenceInstance.StartingIndex, String.Concat(Enumerable.Repeat(longestSequenceInstance.Letter, longestSequenceInstance.Occurences-1)))
.ToString();
```

```
int occurences = 1;
char currentChar;
var strBuilder = new StringBuilder("");
for (var i = 1; i < replaced.Length; i++)
{
currentChar = replaced[i];
if (replaced[i - 1] == replaced[i])
occurences++;
else
{
if (occurences > 1)
strBuilder.Append(occurences);
strBuilder.Append(replaced[i - 1]);
occurences = 1;
}
if (i == replaced.Length - 1)
{
if (occurences > 1)
strBuilder.Append(occurences);
strBuilder.Append(currentChar);
}
}
var stringFinal = strBuilder.ToString();
return stringFinal.Count();
```

Calculation complexity equals to O(N) + O(1) + O(N). Space complexity equals to 2N + 2logN.

Here’s the full listing (C#). Please let me know in comments if you found a more efficient way of solving this task.

```
public static int findCompressedLength(string S, int K)
{
if (S.Length < 3)
return S.Count();
else if (K >= S.Count()-1)
return S.Count().ToString().Length + 1;
Dictionary<char, int> charsWithFrequencies = new Dictionary<char, int>();
var windowStart = 0;
int longestSequence = 0;
var repetitions = 0;
var longestSequenceInstance = new longestSequenceInstance();
for (var i = 0; i < S.Length; i++)
{
if (charsWithFrequencies.ContainsKey(S[i]))
charsWithFrequencies[S[i]]++;
else
charsWithFrequencies.Add(S[i], 1);
repetitions = charsWithFrequencies[S[i]];
if (i - windowStart - repetitions + 1 > K)
{
charsWithFrequencies[S[windowStart]]--;
windowStart++;
}
if (i - windowStart + 1 > longestSequence)
{
longestSequence = i - windowStart + 1;
longestSequenceInstance.EndingIndex = i;
longestSequenceInstance.StartingIndex = i - longestSequence + 1;
longestSequenceInstance.Letter = S[i];
longestSequenceInstance.Occurences = longestSequence;
}
}
var workingString = new StringBuilder(S);
var replaced = workingString.Remove(longestSequenceInstance.StartingIndex, longestSequenceInstance.Occurences - 1)
.Insert(longestSequenceInstance.StartingIndex, String.Concat(Enumerable.Repeat(longestSequenceInstance.Letter, longestSequenceInstance.Occurences-1)))
.ToString();
int occurences = 1;
char currentChar;
var strBuilder = new StringBuilder("");
for (var i = 1; i < replaced.Length; i++)
{
currentChar = replaced[i];
if (replaced[i - 1] == replaced[i])
occurences++;
else
{
if (occurences > 1)
strBuilder.Append(occurences);
strBuilder.Append(replaced[i - 1]);
occurences = 1;
}
if (i == replaced.Length - 1)
{
if (occurences > 1)
strBuilder.Append(occurences);
strBuilder.Append(currentChar);
}
}
var stringFinal = strBuilder.ToString();
return stringFinal.Count();
}
}
public struct longestSequenceInstance
{
public char Letter { get; set; }
public int EndingIndex { get; set; }
public int StartingIndex { get; set; }
public int Occurences { get; set; }
}
```

# Points lying inside a circle – duplicated point variation

Given three parameters of equal size:

- string S – points names (duplicates allowed)
- int[] X – point coordinate
- int[] Y – point coordinate

Calculate the number of points lying within the circle ranging to the closest duplicated point excluding points crossing the edge of the circle.

Looking at the chart, the points we are interested in: B (1, 1), A (1, 2) and C (-1,-1). Point A (-2, -2) is excluded because it lies on the circle’s edge. Circle radius is determined upon the closest duplicated point. There are two sets of duplicated points on the above chart: B – ([2, -3], [1,1] and A – ([-2, -2], [1, 2], [-4, -4]).

Solution:

This task is quite easy once you understand that the sign of the coordinates doesn’t matter.

First, let’s start by adding a check if we have any duplicated point groups:

```
public static int getResult(string S, int[] X, int[] Y)
{
if (S.Distinct().Count() == S.Count())
return 0;
}
```

Secondly, we need to calculate the distance of each point from the center of the plane by using Pythagorean theorem:

Let’s extract it to another method:

```
private static decimal getPointHypotenuse(int x, int y)
{
return (decimal)Math.Sqrt((Math.Pow(x, 2) + Math.Pow(y, 2)));
}
```

```
public static int getResult(string S, int[] X, int[] Y)
{
if (S.Distinct().Count() == S.Count())
return 0;
List<Tuple<char, decimal>> list = new List<Tuple<char, decimal>>();
for (int i = 0; i < S.Length; i++)
{
list.Add(new Tuple<char, decimal>(S[i], getPointHypotenuse(X[i], [i])));
}
}
```

Next, since we can have N number of any type of point, let’s group them together in a sorted list:

```
var aggregated = list.GroupBy(x => x.Item1)
.Where(x => x.Count() > 1)
.Select(g => Tuple.Create(g.Key, g.Min(t => t.Item2)))
.OrderBy(x => x.Item2)
.ToList();
```

The last step is to pick the second lowest number from any of the groups and filter the list of points with their distances to include only points below the limit.

Here’s the full listing:

```
public static class PointsWithinACircle
{
public static int getResult(string S, int[] X, int[] Y)
{
if (S.Distinct().Count() == S.Count())
return 0;
List<Tuple<char, decimal>> list = new List<Tuple<char, decimal>>();
for (int i = 0; i < S.Length; i++)
{
list.Add(new Tuple<char, decimal>(S[i], getPointHypotenuse(X[i], Y[i])));
}
var aggregated = list.GroupBy(x => x.Item1)
.Where(x => x.Count() > 1)
.Select(g => Tuple.Create(g.Key, g.Min(t => t.Item2)))
.OrderBy(x => x.Item2)
.ToList();
var result = (from s in list
join a in aggregated on s.Item1 equals a.Item1
where s.Item2 > a.Item2
orderby s.Item2
select (s.Item1, s.Item2)).First();
var numberOfPoints = list.Where(x => x.Item2 < result.Item2).Count();
return numberOfPoints;
}
private static decimal getPointHypotenuse(int x, int y)
{
return (decimal)Math.Sqrt((Math.Pow(x, 2) + Math.Pow(y, 2)));
}
}
```