|


|
|
Administrating DS 4
Indexing
Indexing/URLs This page lets you assign or register URLs
to be indexed, URLs not to be indexed, external filters, and file
extensions to be indexed.
- Indexing URLs
|
URLs to be added/modified
|
Type in new or modified URLs of sites
you wish to index in the box. Then choose from the following
options.
|
|
Options
|
- Bypass web robot rules (1): Check
this box if you want DeepSearch to index URLs even if they
have web robot rules (i.e.-robot.txt)
1http://www.namo.com
- Index CGI pages (2): Check this box if you want DeepSearch to follow
and index
CGI pages or files (i.e.-bulletin boards, guest book)
2http://www.namo.com
- Do not Index (4): Check this box if you do not wish to index the site
4http://www.namo.com
If you check
all the boxes and then click add or modified, the new or
modified
URL will have the value 7 (1+2+4) assigned to it (i.e.-7http://www.namo.com) |
|
Specify CGI Handler
|
If you wish to index non-HTML files
in the database, you need to specify the URL of a CGI program
that connects DeepSearch and the database to be indexed.
In most cases, the program will be in the /cgi-bin/ directory
and the URL will be http://www.namo.com/cgi-bin/. If you
are using DeepSearch with a web hosting service, your CGI
URL may be in the form of http://xxx.xxx.xxx/~userid/cgi-bin/.
Refer to CGI-Handler link in http://www.namo.com/deepsearch/
for more information on what are CGI-Handlers and how to
make them.
|
- Use Exclusion Rules
Type in the URLs, directories, or files you do not wish to
index with the appropriate variables and operators. For example,
if you wish to exclude http://www.namo.com/download/index.html from URLs to be indexed, type in the following exclusion rule
in the box:
$SITE=www.namo.com&$PATH=/download/&$FILE=index.html
The above rule will exclude http://www.namo.com/download/index.html.
DeepSearch will recognize capital letters and non-capital letters as the same.
If you wish
to simply use the $URL variable, type in the following exclusion
rule in the box:
$URL=www.namo.com/download/index.html
If you wish to exclude two or more sites, directories, or files, use operators
such as & (AND), | (OR), * (Wild
Card), and ^ (Sub String).
-
External Filters
By assigning external filters, you can view non-HTML files
on web browsers as well. DeepSearch already assigns filter
programs for common document formats such as DOC, XLS, PPT,
and PDF files. If there is a file format you wish to add to
the list, type in its filename extension in the box and assign
a filter program from the drop-down menu. If there is a specific
document that needs a specific filter program, use “File
Manager” to upload it.
-
File Extensions to be Indexed
In this section, you can add, delete, or modify file extensions
to be indexed.
Indexing/Options Indexing/Options is composed of Basic, Advanced, Level Check,
and Automatic Indexing options. By setting these various options,
DeepSearch can perform its search operations more accurately and
swiftly.
|
Basic Options
|
|
|
Indexer
|
Choose the type of indexer. You can
either choose to use the default or the advanced indexer
(if installed).
|
|
Web Robot Speed
|
Choose the speed at which the robot
will retrieve the pages from the target website for indexing.
If the target website and DeepSearch are on the same computer
(server), choose 0 for the fastest speed. If the target
website is on a remote server, choose 2 or greater. (0=fastest,
9=slowest)
|
|
Indexing Method: Full indexing takes a lot of time. Therefore you can choose
to incrementally
index (Incremental) if you want DeepSearch to only
index added or modified pages. You can also ask DeepSearch
to only index those pages that are visited most often
using artificial intelligence (Smart).
|
|
Robot Threads
|
Here you set the number of robots you wish to use when
operating indexing. Increasing the number of threads will
increase the indexing speed but it will increase target
web sites’ traffic as well.
|
|
HTTP Header Information (name:value)
|
Assign a value that would give permissions to index sites
that require authentications. Administrators could regulate
permissions by setting the same value as the DeepSearch
HTTP Header value in the authentication algorithm.
|
|
CGI Handler Page
|
Set the amount of time and the number of tries for following
through CGI Pages.
|
|
Advanced Options
|
|
|
Index Numbers
|
Check this box if you wish to index
numerical characters as well.
|
|
Follow Commented Links: Check
this box if you wish to follow through and index the
commented
links such as <!-- <a href=aa.html”xxxxx</a> -->.
|
|
Index Words in Stop
Lists
|
Check this box if you want DeepSearch
to index very common words such as is, and, or,
etc.
|
|
Follow Javascript/Flash
Links: Check this box if you wish
to index javascripts and Flash links.
|
|
Number of DB files
|
DeepSearch saves indexed files in the
form of a database. Here you assign the number of databases
you wish to use. In most cases, but if there are a lot
of pages to be indexed, increase the number of DB files
to use. If you choose two or more DB files when there arent
many pages to be indexed, you can slow down the indexing
or searching speed.
|
|
Detect Image Sizes
|
In order to display the original size
(width and height) of the image file on the result page,
you may choose to either record the image size while indexing,
while searching, or do not record them at all.
|
|
Size of Indexing
Buffer
|
Assign the number of words you wish
to temporarily save in buffer memory (RAM) during indexing.
If you assign a large number, the indexing processes will
be faster but use more memory space. If you assign a small
number, the indexing process will be slower but use less
memory space as well.
|
|
Directory Depth
Limit
|
DeepSearch indexing works by following
trails of links starting from the target URL. However,
the indexing operation will never stop if there are circular
links or some CGI programs produce incorrect links infinitely.
To prevent this, you may choose to limit directory depth
that DeepSearch will index. The default value 10 means
to stop the indexing operation after following the links
10 times.
|
|
Number of Batch
Files
|
Assign how many documents DeepSearch
should process at once for indexing. Assigning a lesser
value will decrease the processing speed but use less hard-disk
(drive) space. If you do not have enough web server space,
it is highly recommended to decrease this value.
|
|
Maximum File Size
|
Here you can limit the size of files
to index. Usually, web documents are fairly small. If an
HTML file is very large like 10 megabytes, it is probably
a non-HTML file (like a multimedia file) whose file extension
has been designated as *.html. If DeepSearch tries to index
such files, an error will occur.
** Performing incremental indexing only can lead to inaccurate
search results. To prevent this, you need to perform full
indexing after X number of incremental indexing.
|
|
Level Check Some
sites restrict access by requiring authorizations or logging
in. There are many levels of authorizations and DeepSearch
supports from level 0 to level 15 (16 levels). To access
these pages, you must use the CGI handler (*.php, *.asp,
*.jsp, etc.) to find out the cookie level value and save
this information into DeepSearch indexing database.
- Cookie Name: Type in the variable name of the CGI handler
that would give cookie level information.
For example, the following CGI-handler, login_proc.php,
will have a cookie name cookie_level:

- Program Name: Type in the file name
of a program (no extensions) that would decode cookie level
values if the values were coded.
An example source of a decoding program:

|
|
Automatic Indexing
In order to keep search results up to
date, it is highly recommended to index as often as possible.
Namo DeepSearch features automatic indexing because it is
inconvenient to manually initiate indexing every time.
- Specify Time of Day: Specify the desired day of the week
and the time of day.
- Specify Interval: Specify the desired interval in minutes
between indexing jobs.
|
Indexing/Categories
This
option lets you structure the contents of your website
into categories. By doing so, users can better understand
the organization
of your web sites.
|