Threaded Mode | Linear Mode

komame · 09-18-2023, 06:48 AM

It's not that simple. Everything you're suggesting would be applicable for the English language dictionary. In that case, even binary encoding (using 5 bits per character, assuming a maximum word length of 12 characters) would be an excellent solution, and there would be no need for any conversions.

However, in this algorithm, I had to consider diacritic characters, and this greatly complicates the whole matter. Notice that one of the main assumptions of this algorithm was to search for similarities while taking into account the Latin equivalents of diacritic characters.
So, when searching for words with the pattern "DE?" in the result, you will find a list like this:
{"DER","DES","DEY","DÈS","DÉC","DÉP","DÉS"}.

As you can see, even though the second character is specified as "E", the algorithm also matched "È" and "É" (and vice versa) in this case. In such a situation, for searching, I need a dictionary without diacritic characters (otherwise, binary search wouldn't work), but when I find a match, I return the result with diacritic characters, so I also need it.

(09-17-2023 09:13 PM)jte Wrote: My first thought with these sorts of things is that there are distinct phases. Phase 1 is setting things up: getting the dictionary, getting it into the program [writing the program…], setting up search data structures, etc. — the performance clock isn’t engaged yet! Phase 2 is benchmarking: running the code against a performance meter. (So, with the example above, the “converting text to binary” isn’t part of “benchmark time”; it would not be done with each run, but be part of “set up”.)

That's why I mentioned that to apply your suggestion with binary encoding, two dictionaries would need to be maintained (even before the program is run). In my current solution, I have only one dictionary, and the second one (without diacritic characters) is generated during program execution using the REPLACE function for strings, which is low-level and very fast (just as many calls as there are different diacritic characters, and the entire dictionary is converted => see 'convAccents'). In the case of binary encoding, we would have to perform this conversion by iterating over individual words and converting letter by letter, additionally performing many bitwise operations, which would significantly impact the algorithm's performance even before reaching binary search.

(09-17-2023 09:13 PM)jte Wrote: One possibility I was going to mention, when reading the following, was that multiple sort orders could be used so that binary search could be employed even when the search pattern started with “?”. (N sort orders for N-letter words would be the obvious one, although letter pairs come next to mind [although only half of those would be needed — N(N-1)/2 instead of N(N-1). If that’s too much precomputation, letter clumps are another approach – this could still help limit the linear traversal.)

How would you like to perform this sorting in PPL? By what method? Regardless of the method you choose (even for letter clumps), this sorting would still take several times longer than the full search algorithm in its current form. We cannot use lists here because they have a limit of 10,000 elements, and we need to sort 50,000. Additionally, after this sorting, diacritic character conversion needs to be performed, which will result in additional iterations.

Currently, my algorithm operates by first converting the dictionary with diacritic characters into a dictionary with their Latin equivalents using REPLACE, which, in most typical queries, takes up to 100ms. In the worst-case scenario for a whole 9-letter dictionary (when the pattern begins with '?'), it involves over 470,000 characters, and such conversion in 13 iterations takes approximately 500ms (at that point, it often accounts for more than half of the total time required to return the result)
Next, binary searches are performed twice because, for the given pattern, it searches for both the initial and final words in the dictionary. This process takes 4-5 ms at worst (basically negligible). Then it starts scanning from the initial word and iterates through subsequent words, checking their compliance with the pattern until it reaches the final word, at which point it finishes its operation. Therefore, the scanning time is directly related to the number of words to scan, but in the worst case, when scanning over 5000 words, it takes about ~850ms (if only first letter is specified). However, it often involves only 100-300 words, which takes less than 50ms. In practice, in most cases, the entire algorithm, from the moment it starts running, performs around 30-50 comparisons during binary search and around 300-800 comparisons during scanning. So, summarizing everything together, these times are not that terrible.

However, if I were to apply any of your suggestions (binary encoding or sorting), and I wanted it to work efficiently, I would have to have this dictionary in the program at least twice (or multiple times) to avoid conversions. If I didn't want to double the dictionary's space usage, I would have to perform the conversion during program execution without using REPLACE, which would require iterating through 50000 9-letter words, letter by letter, and converting them one by one (almost half a million iterations for individual characters in total). Such a solution would probably work tens of times slower compared to the 1000-2000 comparisons I currently perform, taking into account the constant time for dictionary conversion.

Switching to an English dictionary would simplify things a lot Wink