FEATURES

  • Realtime indexing with HTTP calls and JSON data format
  • Redis request buffering and asynchronous statement execution to increase throughput and reduce application latency
  • Batch insert for faster index rebuilding
  • Searching with a simplified JSON interface for specifying filters, sorting and grouping
  • Highlighted excerpts that comply to Sphinx query syntax
  • Manage your index and searchd configuration in a relational MySQL schema
  • High concurrency with the Nginx web server
  • Ability to extend by writing and attaching scripts to the Python Django Web framework
  • Lightweight with simple deployment and installation

HOW IT WORKS

The logic behind Techu is pretty straightforward, as you can see in the flow diagram to the right. Application code sends an HTTP request that corresponds to a specific action. There are 2 major groups of operations; indexing & searching. Indexing involves inserting or deleting a document from the index or modifying attributes or text fields for a document. Searching on the other hand involves performing full-text searches and retrieving highlighted excerpts.

In the current beta version, most request data are passed via a single data parameter with a JSON-formatted value, with some exceptions usually involving requests for the Sphinx configurations handling, but in the future all requests will be following this protocol for simplicity and uniformity. Oh, and yes, now you can keep you Sphinx configurations in order, by storing them in Techu's MySQL DB schema (although this feature can be bypassed also). On each regeneration command Techu will automatically restart the corresponding searchd.

After the application dispatches a request, Nginx receives it and the Django Python Web framework processes the request. As you can see in the diagram, indexing operations can be optionally queued for asynchronous execution. In that case, a Redis key is returned as a response and the request is later converted to SphinxQL and sent to Sphinx with the script referred as applier (we probably should find a better name for this).

If no queueing is required, then the data are converted to SphinxQL statement (or statements if you are batch inserting documents). For a searching operation, either full-text search or highlighted excerpts, the response can either originate directly from Redis (cache) or if there is no cache entry, the attribute filters and the query will be converted to SphinxQL and retrieved from Sphinx directly.

WHY REDIS?

Key-value storage, optionally persistent, with very large value length limit (512M) capable of storing a lot of text. Redis list and hash structures are key components of the caching and queueing sub-systems.

WHY SPHINXQL?

It is faster than the API.

WHY NGINX?

Faster web server, ensures high concurrency and low latency

WHY THESE COMPONENTS OVERALL?

Every component is well established software and excels in its area. Also they can be commonly found in most stacks. We wouldn't like to reinvent the wheel, plus there is no need for some exotic configuration for you to learn or setup!

INSTALLATION

UBUNTU PACKAGES

  • 1.
    apt-get install python-setuptools build-essential
  • 2.
    apt-get install mysql-server mysql-client
  • 3.
    apt-get install redis-server
  • 4.
    apt-get install nginx
  • 5.
    apt-get install python-mysqldb python-flup
  • 6.
    apt-get install git

PYTHON PACKAGES (REQUIRED)

  • 1.
    easy_install redis
  • 2.
    easy_install django
  • 3.
    easy_install django_graceful

PYTHON PACKAGES (OPTIONAL)

  • 1.
    easy_install hiredis
  • 2.
    easy_install beautifulsoup4

SPHINX

  • 1.
    wget http://sphinxsearch.com/files/sphinx-2.1.1-beta.tar.gz
  • 2.
    tar -zxvf sphinx-2.1.1-beta.tar.gz
  • 3.
    cd sphinx-2.1.1-beta
  • 4.
    ./configure && make && make install

CONFIGURATION

  • 1.
    git clone https://github.com/georgepsarakis/techu-search-server.git
  • 2.
    vim /etc/nginx/sites-available/techu
Add the domain techu (or techu.local) in your /etc/hosts file, pointing to your server's internal IP, not localhost in order to connect to Sphinx searchd with the MySQL protocol on a specific port.
server {
    listen 81;
    listen 443 ssl;
    server_name techu;
    access_log /var/log/nginx/techu.access.log;
    error_log /var/log/nginx/techu.error.log;
    
    ssl on;
    ssl_certificate /etc/nginx/ssl/server.crt;
    ssl_certificate_key /etc/nginx/ssl/server.key;
     
    location / {
        include fastcgi_params;
        fastcgi_pass unix:/home/techu-search-server/run/fastcgi.socket;
        fastcgi_split_path_info ^()(.*)$;
    }

    location /admin/static/ {
      autoindex on;
      alias /home/techu-search-server/techu/admin/static/;
    }
}

  • 3.
    vim techu-search-server/techu/settings.py
Change database settings in Django and setup MySQL schema
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', 
        'NAME': 'techu',                              
        'USER': 'root',
        'PASSWORD': '',
        'HOST': 'localhost',                  
        'PORT': '3306',                       # Set to empty string for default.
    },
}
  • 4.
    cat techu-search-server/techu/sql/schema.latest.sql | mysql -uroot -p
  • 5.
    cd techu-search-server && ./manage.py update #Start FastCGI processes for Nginx

EXAMPLES

CREATE A NEW CONFIGURATION
  • 1.
    curl -XPOST 'http://techu:81/configuration/' --data-urlencode name='stackoverflow' --data-urlencode description='StackOverflow Posts Indexing Configuration'
{{ new-configuration.result.json }}
CREATE A SEARCHD AND ASSOCIATE WITH THE NEW CONFIGURATION
  • 1.
    curl -XPOST 'http://techu:81/searchd/' --data-urlencode name='stackoverflow' --data-urlencode configuration_id='25'
{{ new-searchd.result.json }}
SEARCHD OPTIONS
  • 1.
    curl 'http://techu:81/option/searchd/6/' --data-urlencode data='{
      "listen" : [ 
        "9312", 
        "9306:mysql41" 
        ], 
      "workers" : "threads", 
      "pid_file" : "/var/run/stackoverfow_searchd.pid", 
      "max_matches" : 1000 
      }'
[
    {
        "fields": {
            "date_inserted": "2013-05-15T20:24:25", 
            "date_modified": "2013-05-15T20:24:25", 
            "is_active": null, 
            "sp_option_id": 138, 
            "sp_searchd_id": 6, 
            "value": "9312", 
            "value_hash": "b6dfd41875bc090bd31d0b1740eb5b1b"
        }, 
        "model": "techu.searchdoption", 
        "pk": 14
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-15T20:24:25", 
            "date_modified": "2013-05-15T20:24:25", 
            "is_active": null, 
            "sp_option_id": 138, 
            "sp_searchd_id": 6, 
            "value": "9306:mysql41", 
            "value_hash": "73da79255eec41caa827f711f4d287a0"
        }, 
        "model": "techu.searchdoption", 
        "pk": 15
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-15T20:24:25", 
            "date_modified": "2013-05-15T20:24:25", 
            "is_active": null, 
            "sp_option_id": 148, 
            "sp_searchd_id": 6, 
            "value": "1000", 
            "value_hash": "a9b7ba70783b617e9998dc4dd82eb3c5"
        }, 
        "model": "techu.searchdoption", 
        "pk": 16
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-15T20:24:25", 
            "date_modified": "2013-05-15T20:24:25", 
            "is_active": null, 
            "sp_option_id": 147, 
            "sp_searchd_id": 6, 
            "value": "/var/run/stackoverfow_searchd.pid", 
            "value_hash": "c3a29d1b495b5f6a29a2f84f51bb935c"
        }, 
        "model": "techu.searchdoption", 
        "pk": 17
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-15T20:24:25", 
            "date_modified": "2013-05-15T20:24:25", 
            "is_active": null, 
            "sp_option_id": 165, 
            "sp_searchd_id": 6, 
            "value": "threads", 
            "value_hash": "0919fe44fdbbd233e5e2e8587006b7b2"
        }, 
        "model": "techu.searchdoption", 
        "pk": 18
    }
]
CREATE AN INDEX AND ASSOCIATE WITH THE NEW CONFIGURATION
  • 1.
    curl 'http://techu:81/index/' --data-urlencode name='so_posts_rt' --data-urlencode configuration_id='25'
[
    {
        "fields": {
            "date_inserted": "2013-05-15T18:21:11",
            "date_modified": "2013-05-15T18:21:11",
            "index_type": 1,
            "is_active": 1,
            "name": "so_posts_rt",
            "parent_id": 0
        },
        "model": "techu.index",
        "pk": 28
    }
]
INDEX OPTIONS
  • 1.
    curl 'http://techu:81/option/index/28/' --data-urlencode data='{
      "rt_field" : [ 
        "title", 
        "body"
        ], 
      "rt_attr_timestamp" : [ 
        "creation_date", 
        "last_activity_date"
        ], 
      "rt_attr_uint" : [ 
        "is_answer", 
        "user_id"
        ], 
      "rt_attr_bigint" : "score",
      "type" : "rt", 
      "path" : "/usr/local/sphinx/data/so_posts_rt"
    }'
[
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 3, 
            "value": "/usr/local/sphinx/data/so_posts_rt", 
            "value_hash": "911f381580aa9d7f3b92ba810a7b0603"
        }, 
        "model": "techu.indexoption", 
        "pk": 39
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 59, 
            "value": "creation_date", 
            "value_hash": "8424d087ffe39bb2ee8db173c7e07ba5"
        }, 
        "model": "techu.indexoption", 
        "pk": 40
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 59, 
            "value": "last_activity_date", 
            "value_hash": "bf84b3567d0bdc4db28495bbbd52f728"
        }, 
        "model": "techu.indexoption", 
        "pk": 41
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 54, 
            "value": "is_answer", 
            "value_hash": "ee945c9fbab205b74e166342a3f53218"
        }, 
        "model": "techu.indexoption", 
        "pk": 42
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 54, 
            "value": "user_id", 
            "value_hash": "e8701ad48ba05a91604e480dd60899a3"
        }, 
        "model": "techu.indexoption", 
        "pk": 43
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 55, 
            "value": "score", 
            "value_hash": "ca1cd3c3055991bf20499ee86739f7e2"
        }, 
        "model": "techu.indexoption", 
        "pk": 44
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 53, 
            "value": "title", 
            "value_hash": "d5d3db1765287eef77d7927cc956f50a"
        }, 
        "model": "techu.indexoption", 
        "pk": 45
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 53, 
            "value": "body", 
            "value_hash": "841a2d689ad86bd1611447453c22c6fc"
        }, 
        "model": "techu.indexoption", 
        "pk": 46
    }, 
    {
        "fields": {
            "date_inserted": "2013-05-17T20:58:28", 
            "date_modified": "2013-05-17T20:58:28", 
            "is_active": null, 
            "sp_index_id": 28, 
            "sp_option_id": 1, 
            "value": "rt", 
            "value_hash": "822050d9ae3c47f54bee71b85fce1487"
        }, 
        "model": "techu.indexoption", 
        "pk": 47
    }
]
GENERATE CONFIGURATION FILE AND (RE)START SEARCHD
  • 1.
    curl -XPOST 'http://techu:81/generate/25/'
{
    "configuration": "index so_posts_rt {
      type                           = rt
      path                           = /usr/local/sphinx/data/so_posts_rt
      rt_field                       = body
      rt_field                       = title
      rt_attr_uint                   = user_id
      rt_attr_uint                   = is_answer
      rt_attr_bigint                 = score
      rt_attr_timestamp              = creation_date
      rt_attr_timestamp              = last_activity_date
    }
    searchd {
      listen                         = 9306:mysql41
      listen                         = 9312
      pid_file                       = /var/run/stackoverfow_searchd.pid
      max_matches                    = 1000
      workers                        = threads
    }
    ",
    "started": {
        "command": "searchd --config /home/techu-search-server/techu/sphinx-conf/stackoverflow.conf --iostats --cpustats",
        "status": true
    },
    "stopped": {
        "command": "searchd --config /home/techu-search-server/techu/sphinx-conf/stackoverflow.conf --stopwait",
        "status": true
    }
}
INSERT A DOCUMENT TO THE INDEX
  • 1.
    curl -XPOST 'http://techu:81/indexer/insert/28/' --data-urlencode data='{
      "body": "I have in my Symfony 2.1 RC app a simple Comment model (using Doctrine 2). Every comment has a user and a message.
        Currently, the CommentBundle manages comments on articles. I\\'d like it to be more generic to be able to comment any kind of entity without copying code across different bundles dedicated to comments...
        For this to work, I also need a way to reference any entity from the comment one. I think having two fields entity_type and entity_id can be a nice solution. However, I can\\'t get the object from these without mapping entity_type to classes manually and using the find method.
        So how do I reference an entity from a comment ? And how can I create generic behavior working  on several entities ?", 
        "user_id": 893390, 
        "title": "Generic comment system in Symfony2", 
        "last_activity_date": 1368868178, 
        "creation_date": 1346167729, 
        "score": 1, 
        "is_answer": 0, 
        "id": 12162609
        }'
{ 
    "searchd" : "ok" 
}
DELETE A DOCUMENT FROM THE INDEX
  • 1.
    curl 'http://techu:81/indexer/delete/28/16528355'
{
  "searchd" : "ok" 
}
UPDATE A DOCUMENT IN THE INDEX
  • 1.
    curl -XPOST 'http://techu:81/indexer/update/28/' --data-urlencode data='{ 
      "title" : "Generic comment system in Symfony2 framework", 
      "last_activity_date" : 1368868278 
    }'
{ 
                "searchd" : "ok" 
                }
SEARCH
  • 1.
    curl -XPOST 'http://techu:81/search/28/' --data-urlencode data='{
      "q" : "(mysql php issue) | python" 
    }'
{
    "meta": [
        {
            "Value": "40", 
            "Variable_name": "total"
        }, 
        {
            "Value": "40", 
            "Variable_name": "total_found"
        }, 
        {
            "Value": "0.001", 
            "Variable_name": "time"
        }, 
        {
            "Value": "1.284", 
            "Variable_name": "cpu_time"
        }, 
        {
            "Value": "0.000", 
            "Variable_name": "agents_cpu_time"
        }, 
        {
            "Value": "0.000", 
            "Variable_name": "io_read_time"
        }, 
        {
            "Value": "0", 
            "Variable_name": "io_read_ops"
        }, 
        {
            "Value": "0.0", 
            "Variable_name": "io_read_kbytes"
        }, 
        {
            "Value": "0.000", 
            "Variable_name": "io_write_time"
        }, 
        {
            "Value": "0", 
            "Variable_name": "io_write_ops"
        }, 
        {
            "Value": "0.0", 
            "Variable_name": "io_write_kbytes"
        }, 
        {
            "Value": "0.000", 
            "Variable_name": "agent_io_read_time"
        }, 
        {
            "Value": "0", 
            "Variable_name": "agent_io_read_ops"
        }, 
        {
            "Value": "0.0", 
            "Variable_name": "agent_io_read_kbytes"
        }, 
        {
            "Value": "0.000", 
            "Variable_name": "agent_io_write_time"
        }, 
        {
            "Value": "0", 
            "Variable_name": "agent_io_write_ops"
        }, 
        {
            "Value": "0.0", 
            "Variable_name": "agent_io_write_kbytes"
        }, 
        {
            "Value": "mysql", 
            "Variable_name": "keyword[0]"
        }, 
        {
            "Value": "28", 
            "Variable_name": "docs[0]"
        }, 
        {
            "Value": "78", 
            "Variable_name": "hits[0]"
        }, 
        {
            "Value": "php", 
            "Variable_name": "keyword[1]"
        }, 
        {
            "Value": "104", 
            "Variable_name": "docs[1]"
        }, 
        {
            "Value": "467", 
            "Variable_name": "hits[1]"
        }, 
        {
            "Value": "issue", 
            "Variable_name": "keyword[2]"
        }, 
        {
            "Value": "89", 
            "Variable_name": "docs[2]"
        }, 
        {
            "Value": "115", 
            "Variable_name": "hits[2]"
        }, 
        {
            "Value": "python", 
            "Variable_name": "keyword[3]"
        }, 
        {
            "Value": "39", 
            "Variable_name": "docs[3]"
        }, 
        {
            "Value": "88", 
            "Variable_name": "hits[3]"
        }
    ], 
    "results": [
        {
            "creation_date": 1360030771, 
            "id": 14699117, 
            "is_answer": 0, 
            "last_activity_date": 1368863555, 
            "score": 0, 
            "user_id": 1248745
        }, 
        {
            "creation_date": 1287060242, 
            "id": 3933197, 
            "is_answer": 0, 
            "last_activity_date": 1368867016, 
            "score": 0, 
            "user_id": 412528
        }, 
        {
            "creation_date": 1368864670, 
            "id": 16622092, 
            "is_answer": 0, 
            "last_activity_date": 1368865202, 
            "score": 0, 
            "user_id": 637888
        }, 
        {
            "creation_date": 1368827484, 
            "id": 16618456, 
            "is_answer": 0, 
            "last_activity_date": 1368859423, 
            "score": 1, 
            "user_id": 974369
        }, 
        {
            "creation_date": 1368861382, 
            "id": 16621724, 
            "is_answer": 0, 
            "last_activity_date": 1368861382, 
            "score": 0, 
            "user_id": 2396228
        }, 
        {
            "creation_date": 1368866680, 
            "id": 16622305, 
            "is_answer": 0, 
            "last_activity_date": 1368866680, 
            "score": 0, 
            "user_id": 1119216
        }, 
        {
            "creation_date": 1253101777, 
            "id": 1432480, 
            "is_answer": 0, 
            "last_activity_date": 1368868136, 
            "score": 18, 
            "user_id": 130758
        }, 
        {
            "creation_date": 1330010318, 
            "id": 9415785, 
            "is_answer": 0, 
            "last_activity_date": 1368865096, 
            "score": 2, 
            "user_id": 842837
        }, 
        {
            "creation_date": 1368830820, 
            "id": 16618945, 
            "is_answer": 0, 
            "last_activity_date": 1368866858, 
            "score": 0, 
            "user_id": 2386518
        }, 
        {
            "creation_date": 1368858021, 
            "id": 16621351, 
            "is_answer": 0, 
            "last_activity_date": 1368866236, 
            "score": 2, 
            "user_id": 2395938
        }, 
        {
            "creation_date": 1368865881, 
            "id": 16622217, 
            "is_answer": 0, 
            "last_activity_date": 1368866056, 
            "score": 2, 
            "user_id": 1742632
        }, 
        {
            "creation_date": 1368861397, 
            "id": 16621726, 
            "is_answer": 0, 
            "last_activity_date": 1368863461, 
            "score": 0, 
            "user_id": 728286
        }, 
        {
            "creation_date": 1368865418, 
            "id": 16622171, 
            "is_answer": 1, 
            "last_activity_date": 1368865418, 
            "score": 3, 
            "user_id": 2096752
        }, 
        {
            "creation_date": 1368866802, 
            "id": 16622317, 
            "is_answer": 1, 
            "last_activity_date": 1368866802, 
            "score": 0, 
            "user_id": 2389851
        }, 
        {
            "creation_date": 1368867520, 
            "id": 16622423, 
            "is_answer": 0, 
            "last_activity_date": 1368867520, 
            "score": 0, 
            "user_id": 2351696
        }, 
        {
            "creation_date": 1368856133, 
            "id": 16621152, 
            "is_answer": 1, 
            "last_activity_date": 1368863075, 
            "score": 0, 
            "user_id": 320726
        }, 
        {
            "creation_date": 1368859602, 
            "id": 16621526, 
            "is_answer": 0, 
            "last_activity_date": 1368859784, 
            "score": 0, 
            "user_id": 2125893
        }, 
        {
            "creation_date": 1368863067, 
            "id": 16621926, 
            "is_answer": 0, 
            "last_activity_date": 1368863067, 
            "score": 0, 
            "user_id": 1460235
        }, 
        {
            "creation_date": 1368865015, 
            "id": 16622132, 
            "is_answer": 0, 
            "last_activity_date": 1368865015, 
            "score": 0, 
            "user_id": 1578927
        }, 
        {
            "creation_date": 1368806615, 
            "id": 16613501, 
            "is_answer": 0, 
            "last_activity_date": 1368866968, 
            "score": 0, 
            "user_id": 1044110
        }
    ]
}

WHAT'S NEXT

  • Automatic batching of queued operations &
    cancelling of overlapping statements.
  • Cache structures in Redis
    (index names, fields & attributes, connection data)
  • Search & Highlighted Excerpts cache invalidations and TTL
  • Faceting
  • Scripting
  • Streaming responses (aggregated data or search results)
  • Clustering & distributed indexing/searching
  • Request Hooks
  • Server Statistics Web Page
    (Cache hit/miss ratios, Requests/sec, Requests in queue, Executed Statements/sec)
  • Much more elaborate documentation on each request parameters!