Welcome to Telelogic Product Support
  Home Downloads Knowledgebase Case Tracking Licensing Help Telelogic Passport
Telelogic DOORS (steve huntington)
Decrease font size
Increase font size
Topic Title: Search Algorithm
Topic Summary:
Created On: 30-Oct-2008 15:42
Status: Post and Reply
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
Quick Reply Quick Reply
Subscribe to this topic Subscribe to this topic
E-mail this topic to someone. E-mail this topic
Bookmark this topic Bookmark this topic
View similar topics View similar topics
View topic in raw text format. Print this topic.
 30-Oct-2008 15:42
User is offline View Users Profile Print this message


Louie Landale

Posts: 2070
Joined: 12-Sep-2002

I imagine that everyone that has complicated DXL has written a search routine, finding the location of the search string in the in string or buffer. The easy way to do this is to turn the search string into a regular expression, then let it do the search. Something like this:

RegExp re = regexp(SearchString)
if (re InBuff)
return(match 0))
else return(-1)

However, my tests years ago showed that the 'regexp' command was extremely time consuming, and running 1000s of searches was extremely slow. So I wrote an actual fGetOffset(Buffer bufIn, string SearchString, int StartLoc) as follows:

[1] using 'contain' look for the first character of SearchString in the buffer, starting at StartLoc.
[2] If found, look individually for the first 4 characters of SearchString in the corresponding buffer positions
[3] If found, then extract that string portion of the buffer and compare that to the complete SearchString.

It occured to me this week end (at the UGC), that perhaps I could use the RegExp method without resorting to 1000s of regexp commands. Basically storing all the RegExp used so far in a Skip, and searching the skip as follows:

Skip g_skpGetOffsetREs = createString() // Global skip list, used only by fGetOffset()
// KEY: 'string' SearchString; DATA: 'RegExp' corresponding to SearchString.
int fGetOffset(Buffer buffIn, string SearchString, int StartLoc)
{ RegExp re
if (!find(g_skpGetOffsetREs, SearchString, re))
re = regexp(SearchString) // First time searching for this, define re
else{} // Have searched for this before, used stored re
....
}

I think this lets us have our cake and eat it too. What do you think?

- Louie
Report this to a Moderator Report this to a Moderator
 4-Nov-2008 03:08
User is offline View Users Profile Print this message


Carsten Winckler

Posts: 3
Joined: 18-Aug-2008

no good idea - think it over!
Report this to a Moderator Report this to a Moderator
 4-Nov-2008 14:59
User is offline View Users Profile Print this message


Reik Schroeder

Posts: 361
Joined: 28-Jul-2003

Hi Louie,

I also think, that is not an good idea to reuse the regulars ... It looks strange :D
Attached, you will find a more efficient way of "contains".

While working on strings, I've found a very strange behaviour:
Extrating chars from string ([]) needs linear effort - that means it needs 10 times the time to get 10th char than the first one. So looping through large strings cahr by char might be very slow.
Text buffers do not have this behaviour!


Hope that will be intersting for you
Greetings
Reik

-------------------------
Evosoft GmbH
for Siemens Industry Sector


Berlin, Germany
Report this to a Moderator Report this to a Moderator
 4-Nov-2008 15:12
User is offline View Users Profile Print this message


Louie Landale

Posts: 2070
Joined: 12-Sep-2002

Your algorithm fails when the searched string contains some prefix of the search string; try this: stringContains("DoDont", "Dont"); it should find it but it doesn't.

- Louie
Report this to a Moderator Report this to a Moderator
 4-Nov-2008 15:18
User is offline View Users Profile Print this message


Reik Schroeder

Posts: 361
Joined: 28-Jul-2003

Oh, sorry, you are right ...
small correction solves this problem.


Greetings
Reik

-------------------------
Evosoft GmbH
for Siemens Industry Sector


Berlin, Germany
Report this to a Moderator Report this to a Moderator
 4-Nov-2008 18:25
User is offline View Users Profile Print this message


Louie Landale

Posts: 2070
Joined: 12-Sep-2002

Still fails, try this:
print "[" (stringContains("DoDoDnt", "DoDnt")) "]\n"

You need two loops, inner loop J for each value of I.
- Louie
Report this to a Moderator Report this to a Moderator
 4-Nov-2008 18:52
User is offline View Users Profile Print this message


ron lewis

Posts: 650
Joined: 20-Sep-2004

Don't know if findPlainText function is faster or slower but it is lot easier than algorithms proposed here.


See demo

//Demo
/*Demo finds substring*/

int iOffset=-1
int iLength
string sInput="DoDoDnt"
string sFind ="DoDnt"
bool matchCase=true
findPlainText(sInput,sFind, iOffset, iLength, matchCase)
if(iOffset>-1) print iOffset ""
else print "Not Found"

Edited: 4-Nov-2008 at 19:04 by ron lewis
Report this to a Moderator Report this to a Moderator
 4-Nov-2008 20:16
User is offline View Users Profile Print this message


Carsten Winckler

Posts: 3
Joined: 18-Aug-2008

Yes, native functions should be prefered due to much better runtime.
Especially if the function will be called as often as "stringContains" or "stringFind".
Just make your own by wrapping in case of error-catching or interface-simplification.
In case of regExps you must know how to use them - then this is also an option for efficient stringContains-functionality.

Always keep in mind how the interpreter is working.
Report this to a Moderator Report this to a Moderator
 5-Nov-2008 07:03
User is offline View Users Profile Print this message


Reik Schroeder

Posts: 361
Joined: 28-Jul-2003

Hi Louie,

it seems that you are the best tester alive


Another small chnage - hopefully you won't found an bad case again

Greetings
Reik

-------------------------
Evosoft GmbH
for Siemens Industry Sector


Berlin, Germany
Report this to a Moderator Report this to a Moderator
 5-Nov-2008 07:30
User is offline View Users Profile Print this message


Reik Schroeder

Posts: 361
Joined: 28-Jul-2003

Hi Ron, Hi Louie,

small test shows, that Ron's method seems to be the fastet one.
It will be much faster than the other ones ....

Unfortunately my method is the lamest one

Greetings
Reik

-------------------------
Evosoft GmbH
for Siemens Industry Sector


Berlin, Germany
Report this to a Moderator Report this to a Moderator
 5-Nov-2008 14:25
User is offline View Users Profile Print this message


ron lewis

Posts: 650
Joined: 20-Sep-2004

Logical extension of this discussion is a replacement function that is fast -- attached is one I use.
Report this to a Moderator Report this to a Moderator
 5-Nov-2008 14:53
User is offline View Users Profile Print this message


Louie Landale

Posts: 2070
Joined: 12-Sep-2002

When I started my DXL carreer I was tasked with changing the company accronym in all the requirement specs. I don't recall exactly the accronyms or issues, but discovered the hard way about finding and replacing strings that were partial sub-sets of each other. I ended up coming up with a now-dead function 'fReplaceStringSafe' that would replace all occurances of A with B, so long as the position of A didn't already contain B; perhaps it would replace "ITT" with "ITTN" so long as "ITTN" wasn't already there; thus preventing "ITTN" from becoming "ITTNN". Anyway, that experience was in the back of my mind reading your code.

Cannot find an example where your code fails, but [1] Modifying the main loop variable inside the loop give me major HeeBeeJeeBees. No, use nested loops where the inner J loop has a break statement. [2] Your second 'if' statement should be an 'elseif' statement. [3] you don't need to search the entire length of s1; only the first so-many characters: if both strings are the same length you need to search from only the first character since the sub-string cannot start in the 2nd character. Therefore you main loop should perhaps look like this: for (i=0; i<Len1-Len2 && j < Len2; i++). [4] you can drastically improve performance by using global but private Buffers for this function, creating them only once (in the main code) and never deleting them, just re-using them. That is, move the 'Buffer b1 = create()' statement outside the function. Since its global you'll need to give it a clever name; I would use 'gl_buf1StringContains'.

Digressing, when I started coding PL1 using punch cards in college, I punched up 1000s of cards all with a semi-colon in the last column, since I hated having to put a semi-colon at the end of a line that was OBVIOUSLY the end of the statement. I see you put semi-colons, even though DXL is clever enough to know the end of the statement. Not sure why I said that.

I'll do some testing to find the 'best' algorithm, but suspect that checking buffers as you have done will prove faster than checking strings with findPlainText. This is very important to my code since my library functions fGetOffset and fReplaceString are used 100s of 1000s of times for my big scripts.

- Louie
Report this to a Moderator Report this to a Moderator
 5-Nov-2008 15:47
User is offline View Users Profile Print this message


ron lewis

Posts: 650
Joined: 20-Sep-2004

Here is an alternate that uses buffers -- don't know if it is faster.

Perhaps Reik may want to write a test fixture to test.
Report this to a Moderator Report this to a Moderator
 5-Nov-2008 15:47
User is offline View Users Profile Print this message


ron lewis

Posts: 650
Joined: 20-Sep-2004

Here are two more alternates that uses buffers -- don't know which of the three I presented in this forum is fasts.

Perhaps Reik may want to write a test fixture to test.

Edited: 5-Nov-2008 at 16:36 by ron lewis
Report this to a Moderator Report this to a Moderator
 6-Nov-2008 07:22
User is offline View Users Profile Print this message


Reik Schroeder

Posts: 361
Joined: 28-Jul-2003

Hi Ron, Hi Louie,

first an answer to Louie's question about "semicolon" - we are using Doxygen to generate Code documentation. Doxygen needs the semicolons to detect the end of line - otherwise the resuling documention is not as expected

I testet a little bit more with the first three functions:

  • Regular expression cached in Skip list
  • linear search algorithm on Buffer
  • internal findPlainText function from Telelogic


After implementing Louie's suggestion to use global text Buffer my function was much faster than before - I do not like to have global dynamic variables but here the performance is so much faster that I could make an exception

Ron's replace functions I did not testet because the original post was about searching a string, not replacing it.

With longer strings the runing times become another ratio:
Louie's method becomes the fastest, but get problems with characters with special meaning in regular expression, e.g. .+*
My method is the second
And Ron's findPlainText is the slowest.

If you want to try your own - the code is attached to this message...

Greetings
Reik

-------------------------
Evosoft GmbH
for Siemens Industry Sector


Berlin, Germany

Edited: 6-Nov-2008 at 07:25 by Reik Schroeder
Report this to a Moderator Report this to a Moderator
 7-Nov-2008 16:10
User is offline View Users Profile Print this message


ron lewis

Posts: 650
Joined: 20-Sep-2004

Reik, Either of the following two attached function appears to be faster if you want to test.

int FastestFind (Buffer b1, string s2)
{ if(matches(s2,tempStringOf(b1))) return start 0
return -1
}

int FastFind (string s1, s2)
{ if(matches(s2,s1)) return start 0
return -1
}
Report this to a Moderator Report this to a Moderator
Statistics
20925 users are registered to the Telelogic DOORS forum.
There are currently 1 users logged in.
The most users ever online was 15 on 15-Jan-2009 at 16:36.
There are currently 0 guests browsing this forum, which makes a total of 1 users using this forum.
You have posted 0 messages to this forum. 0 overall.

FuseTalk Standard Edition v3.2 - © 1999-2009 FuseTalk Inc. All rights reserved.