Planet SMC

September 20, 2021

Rajeesh K Nambiar

A new set of OpenType shaping rules for Malayalam script

TLDR; research and development of a completely new OpenType layout rules for Malayalam traditional orthography.

Writing OpenType shaping rules is hard. Writing OpenType shaping rules for advanced (complex) scripts is harder. Writing OpenType shaping rules without causing any undesired ligature formations is even harder.


The shaping rules for SMC fonts abiding v2 of Malayalam OpenType specification (mlm2 script tag) were written and polished in large part by me over many years, fixing shaping errors and undesired ligature formations. It still left some hard to fix bugs. Driven by the desire to fix such difficult bugs in RIT fonts and the copyright fiasco, I have set out to write a simplified OpenType shaping rules for Malayalam from scratch. Two major references helped in that quest: (1) a radically different approach I have tried few years ago but failed with mlym script tag (aka Windows XP era shaping); (2) a manuscript by R. Chithrajakumar of Rachana Aksharavedi who culled and compiled the ‘definitive character set’ for Malayalam script. The idea of ‘definitive character set’ is that it contains all the valid characters in a script and it doesn’t contain any (invalid) characters not in the script. By the definition; I wanted to create the new shaping rules in such a way that it does not generate any invalid characters (for e.g. with a detached u-kar). In short: it shouldn’t be possible to accidentally generate broken reformed orthography forms.

Fig. 1. Samples of Malayalam definitive character set listing by R. Chithrajakumar, circa 1999. Source: K.H. Hussain.

“Simplify, simplify, simplify!”

Henry David Thoreau

It is my opinion that a lot of complexity in the Malayalam shaping largely comes from Indic OpenType shaping specification largely follows Devanagari, which in turn was adapted from ISCII, which has (in my limited understanding) its root in component-wise metal type design of ligature glyphs. Many half, postbase and other shaping rules have their lineage there. I have also heard similar concerns about complexity expressed by others, including Behdad Esfahbod, FreeFont maintainer et al.


As K.H. Hussain once rightly noted, the shaping rules were creating many undesired/unnecessary ligature glyphs by default, and additional shaping rules (complex contextual lookups) are written to avoid/undo those. A better, alternate approach would be: simply don’t generate undesired ligatures in the first place.

“Invert, always invert.”

Carl Gustav Jacob Jacobi

Around December 2019, I set out to write a definitive set of OpenType shaping rules for traditional script set of Malayalam. Instead of relying on many different lookup types such as pref, pstf, blwf, pres, psts and myriad of complex contextual substitutions, the only type of lookup required was akhn — because the definitive character set contains all ligatures of Malayalm and those glyphs are designed in the font as a single glyph — no component based design.

The draft rules were written in tandem with RIT-Rachana redesign effort and tested against different shaping engines such as HarfBuzz, Allsorts, XeTeX, LuaHBTeX and DirectWrite/Uniscribe for Windows. Windows, being Windows (also being maintainers of OpenType specification), indeed did not work as expected adhering to the specification. Windows implementation clearly special cased the pstf forms of യ (Ya, 0D2F) and വ (Va, 0D35). To make single set of shaping rules work with all these shaping engines, the draft rules were slightly amended, et voila — it worked in all applications and OSen that use any of these shaping engines. It was decided to drop support for mlym script which was deprecated many years ago and support only mlm2 specification which fixed many irreparable shortcomings of mlym. One notable shaping engine which doesn’t work with these rules is Adobe text engine (Lipika?), but they have recently switched to HarfBuzz. That covers all major typesetting applications.

Testing fonts developed using this new set of shaping rules for Malayalam indeed showed that they do not generate any undesired ligatures in the first place. In addition, compared to the previous shaping rules, it gets rid of 70+ lines of complex contextual substitutions and other rules, while remaining easy to read and maintain.

Old vs new shaping rules in Rachana
Fig. 3. Old vs new shaping rules in RIT Rachana.

Application support

This new set of OpenType layout rules for Malayalam is tested to work 100% with following shaping engines:

  1. HarfBuzz
  2. Allsorts
  3. DirectWrite/Uniscribe (Windows shaping engine)

And GUI toolkits/applications:

  1. Qt (KDE applications)
  2. Pango/GTK (GNOME applications)
  3. LibreOffice
  4. Microsoft Office
  5. XeTeX
  6. LuaHBTeX
  7. Emacs
  8. Adobe InDesign (with HarfBuzz shaping engine)
  9. Adobe Photoshop
  10. Firefox, Chrome/Chromium, Edge browsers


In addition, the advantages of the new shaping rules are:

  1. Adheres to the concept of ‘definitive character set’ of the language/script completely. Generate all valid conjunct characters and do not generate any invalid conjunct character.
  2. Same set of rules work fine without adjustments/reprogramming for ‘limited character set’ fonts. The ‘limited character set’ may not contain conjunct characters as extensive in the ‘definitive character set’; yet it would always have characters with reph and u/uu-kars formed correctly.
  3. Reduced complexity and maintenance (no complex contextual lookups, reverse chaining etc.). Write once, use in any fonts.
  4. Open source, libre software.

This new OpenType shaping rules program was released to public along with RIT Rachana few months ago, and also used in all other fonts developed by RIT. It is licensed under Open Font License for anyone to use and integrate into their fonts, please ensure the copyright statements are preserved. The shaping rules are maintained at RIT GitLab repository. Please create an issue in the tracker if you find any bugs; or send a merge request if any improvement is made.

by Rajeesh at September 20, 2021 05:30 AM

May 08, 2021

Rajeesh K Nambiar

Letsencrypt certificate renewal: Nginx with reverse-proxy

Let’s Encrypt revolutionized the SSL certificate management for websites in a short span of time — it directly improved the security of users of the world wide web by: (1) making it very simple to deploy SSL certificates to websites by administrators and (2) make the certificates available free of cost. To appreciate their efforts, compare to what hoops one had to jump through to obtain a certificate from a certificate authority (CA) and how much money and energy one would have to spend on it.

I make use of letsencrypt in all the servers I manitain(ed) and in the past used the certbot tool to obtain & renew certificates. Recent versions of certbot are only available as a snap package, which is not something I’d want to or able to setup in many cases.

Enter acme. It is shell script that works great. Installing acme will also setup a cron job, which would automatically renew the certificate for the domain(s) near its expiration. I have recently setup using nginx as a reverse proxy to a lexonomy service and acme for certificate management. The cron job is supposed to renew the certificate on time.

Except it didn’t. Few days ago received a notification from about imminent expiry of the certificate. I have searched the interweb quite a bit, but didn’t find a simple enough solution (“make the proxy service redirect the request”…). What follows is the troubleshooting and a solution, may be someone else find it useful.


acme was unable to renew the certificate, because the HTTP-01 authentication challenge requests were not answered by the proxy server where all traffic was being redirected to. In short: how to renew letsencrypt certificates on an nginx reverse-proxy server?

Certificate renewal attempt by acme would result in errors like:

# --cron --home "/root/" -w /var/www/html/
[Sat 08 May 2021 07:28:17 AM UTC] <strong>===Starting cron===</strong>
[Sat 08 May 2021 07:28:17 AM UTC] <strong>Renew: ''</strong>
[Sat 08 May 2021 07:28:18 AM UTC] Using CA:
[Sat 08 May 2021 07:28:18 AM UTC] Single domain=''
[Sat 08 May 2021 07:28:18 AM UTC] Getting domain auth token for each domain
[Sat 08 May 2021 07:28:20 AM UTC] Getting webroot for domain=''
[Sat 08 May 2021 07:28:21 AM UTC] Verifying:
[Sat 08 May 2021 07:28:24 AM UTC] <strong> error:Invalid response from https://<strong>my.domain</strong>.org/.well-known/acme-challenge/Iyx9vzzPWv8iRrl3OkXjQkXTsnWwN49N5aTyFbweJiA [NNN.NNN.NNN.NNN]:</strong>
[Sat 08 May 2021 07:28:24 AM UTC] <strong>Please add '--debug' or '--log' to check more details.</strong>
[Sat 08 May 2021 07:28:24 AM UTC] <strong>See:</strong>
[Sat 08 May 2021 07:28:25 AM UTC] <strong>Error renew <strong>my.domain</strong>.org.</strong>


The key error to notice is

Verify error:Invalid response from [NNN.NNN.NNN.NNN]

Sure enough, the resource .well-known/acme-challenge/… is not accessible. Let us try to make that accessible, without going through proxy server.


First, create the directory if it doesn’t exist. Assuming the web root as /var/www/html:

# mkdir -p /var/ww/html/.well-known/acme-challenge

Then, edit /etc/nginx/sites-enabled/ and before the proxy_pass directive, add the .well-known/acme-challenge/ location and point it to the correct location in web root. Do this on both HTTPS and HTTP server blocks (otherwise it didn’t work for me).

 6 server {
 7   listen 443 default_server ssl;
43   server_name;
44   location /.well-known/acme-challenge/ {
45     root /var/www/html/;
46   }
48   location / {
49     proxy_pass http://myproxyserver;
50     proxy_redirect off;
51   }
83 server {
84   listen 80;
85   listen [::]:80;
87   server_name;
89   location /.well-known/acme-challenge/ {
90     root /var/www/html/;
91   }
93   # Redirect to HTTPS
94   return 301 https://$server_name$request_uri;

Make sure the configuration is valid and reload the nginx configuration

nginx -t && systemctl reload nginx.service

Now, try to renew the certificate again:

# --cron --home "/root/" -w /var/www/html/
[Sat 08 May 2021 07:45:01 AM UTC] Your cert is in  /root/ 
[Sat 08 May 2021 07:45:01 AM UTC] Your cert key is in  /root/ 
[Sat 08 May 2021 07:45:01 AM UTC] v2 chain.
[Sat 08 May 2021 07:45:01 AM UTC] The intermediate CA cert is in  /root/ 
[Sat 08 May 2021 07:45:01 AM UTC] And the full chain certs is there:  /root/ 
[Sat 08 May 2021 07:45:02 AM UTC] _on_issue_success


by Rajeesh at May 08, 2021 10:22 AM

January 01, 2021

Rajeesh K Nambiar

Panmana: new Malayalam body text font

Rachana Institute of Typography starts the new year 2021 with the release of a new body-text Malayalam Unicode font named ‘Panmana’.

Fig. 1: ‘Panmana’ font specimen.

The font is named after and dedicated to Prof. Panmana Ramachandran Nair who steadfastly voiced for the original script of Malayalam. It is designed by K.H. Hussain with inputs from Ashok Kumar and CVR and font engineering by Rajeesh (your correspondent); maintained by RIT.

‘Panmana’ is released under Open Font License, free to use and share. Truetype and Web font can be downloaded from the website. A flyer about the font is available. If you spot any issues, please report those in the source repository.

by Rajeesh at January 01, 2021 04:02 AM

December 26, 2020

Rajeesh K Nambiar

RIT Rachana: a classic typeface reimagined

It was around 2006 I started reading and writing Malayalam (my native language) text widely on the computer, thanks to Unicode and proliferation of Malayalam blogs. It was also at the same time that I noticed Malayalam text was not ‘shaped’ correctly in many cases on my primary operating system — GNU/Linux. A number of Unicode fonts were available under libre license, of which I liked Rachana the most.

Cut to chase: few years later, I ended up co-maintaining Rachana, trying to fix all the known bugs and succeeded to a large extent; among many other things.

In 2020, with new insights into the design metrics of Malayalam fonts, the designer of Rachana — KH Hussain redrew all the glyphs of Rachana, completely overhauled Bold variant and freshly designed Italic & BoldItalic styles. All fonts in the new typeface contain more than 1100 glyphs with entire Malayalam characters encoded in Unicode version 13.0 and all conjuncts/ligatures in the definitive character set of Malayalam traditional orthography. The Latin glyphs are adapted from TeX Gyre Schola with express permission from GUST. These make the font suitable to typeset contemporary text, novels, poetry, scholarly works, Sanskrit text, Bible, archaic books and everything in between.

Fig. 1: RIT Rachana glyph redesign samples.

Not satisfied with solutions on how to fix some remaining shaping bugs in Rachana, I have researched and ventured to try radically different approach to complex advanced text shaping rules for traditional Malayalam script fonts. Following the v2 version of Indic OpenType specification (mlm2), a completely new set of shaping rules were written from the scratch. Though it was bit of a struggle to get Uniscribe/Windows with its idiosyncrasies to shape correctly, and Adobe InDesign need this fix, it proved to be a great success. The new set of rules fix all known shaping bugs to my knowledge. The development of this shaping rule program is a blog entry for another time.

A comparison of problematic shaping combinations can be found in Fig. 2. Note that two “സ്വാതന്ത്ര്യം” differ in code points in their order of “ര്യ” and “യ്ര” and their shaping should be different.

Fig. 2: RIT Rachana improved shaping of problematic conjuncts.

In the process, the build script and test cases were also written from scratch.

The result is a new font named RIT Rachana, released under libre Open Font License free to download and use by individuals, designers, organizations, institutions, government departments and media houses.

Fig. 3: RIT Rachana variants/styles.

All four variants of RIT Rachana can be downloaded from the website for desktop and web usage. If you notice any issues, report them at the source repository.

The typeface is a fruit of months of labour of many, including the designers and developers, early users and testers and feedbacks from those: especially Ashok Kumar, CVR, Sayahna typesetters and sysadmins.

by Rajeesh at December 26, 2020 07:25 AM

November 11, 2020

Rajeesh K Nambiar

New packages in Fedora/EPEL: screenkey & python-secure_cookie

TLDR: Two software packages — screenkey and python-secure_cookie are available in Fedora and EPEL repositories.


Screenkey is a tool that displays the keys one type, on the screen. It is quite useful for screen recording/casting for video tutorials and such. I use it particularly to record tutorial sessions on Vim where keystrokes are important.

Fig. 1: Screenkey in action. Source: screenkey.


The Python module secure-cookie — which provides secure session and cookie management — is split from Werkzeug WGSI module as of version 1.0. Odoo depends on python-werkzeug and currently keeps a vendor copy of the functionality in 14.0; they haven’t migrated to use secure-cookie mostly because many distros including Arch and Fedora — who have a reputation to ship latest software — haven’t packaged secure-cookie yet.

I have packaged both software for Fedora & EPEL and will be hitting release version repositories soon.

by Rajeesh at November 11, 2020 10:50 AM

November 09, 2020

Rajeesh K Nambiar

HarfBuzz shaping engine in InDesign

Since the release of Ezhuthu, I received a few reports that in Adobe InDesign/Photoshop some matras appear outside the margin and the ു‘u’/ൂ‘uu’ matras appear disjoint from the base conjunct/consonant. First issue is worked around in the font (in version 1.1); but the second issue cannot be worked around per se.

Fig. 1: InDesign shaping issues. Source: Abdul Azeez Vengara, CC-BY-SA.

Adobe products use their own shaping engine known as ‘lipika’ for advanced text layout. The ‘world ready composer’ uses it by default when text with complex scripts such as Malayalam (and other Indic scripts) is used.

Lipika has various issues in properly shaping advanced conjunct forms and has its own quirks. Certain issues are worked around in fonts, but certain issues cannot be. There were reports that Adobe products might eventually integrate the gold standard of shaping engines — libre software HarfBuzz.

Since mid July 2020, HarfBuzz shaping engine can be used instead of lipika shaper in Adobe InDesign. To enable it, follow these steps:

  1. Download this file: HarfbuzzOverride.js
  2. Copy it to ../Scripts/Scripts Panel of InDesign root folder
  3. Close InDesign first. Open InDesign and go to WindowUtilitiesScripts
  4. Double click on HarfbuzzOverride.js to enabled HarfBuzz shaper
  5. Use the traditional script Malayalam fonts from RIT with perfect advanced text shaping.
  6. If you have already laid out text, you may need to reapply style/font to see the effect.
Fig. 2: Ezhuthu font with perfect advanced text shaping.

by Rajeesh at November 09, 2020 04:36 AM

November 01, 2020

Rajeesh K Nambiar

Announcing ‘Ezhuthu/എഴുത്ത് ’ — a handwriting/script style font for Malayalam

To celebrate the birth of Kerala on November 1st, the southernmost state in India with around 35 million people speaking its language Malayalam; Rachana Institute of Typography is announcing the release of ‘Ezhuthu/എഴുത്ത്’ — a handwriting/script style Unicode font with traditional orthography.

Fig. 1. Ezhuthu font specimen.

The glyphs are drawn by famed calligrapher Narayana Bhattathiri. The hand-drawn characters are turned to vector graphics which are then transformed to font shapes and this typography is done by Hussain KH who lead the Rachana Aksharavedi movement and designed popular fonts such as Rachana, Meera, Meera Inimai etc. The fine tuned typeface need OpenType shaping to correctly shape and render advanced conjunct character formations in Malayalam. The OpenType feature development, integration and technical infrastructure is worked on by me, Rajeesh. Ashok Kumar and CVR made key contributions to the font development.

A PDF document specimen is available at

Ezhuthu is made available under libre license — Open Font License, which makes it free to download, use, distribute and enhance without restrictions by the general public, designers, institutions and government departments. We are proud to add a unique font to the collection of freely available Malayalam fonts. If you find any issues, report those in the source repository.

by Rajeesh at November 01, 2020 06:29 AM

October 26, 2020

Rajeesh K Nambiar

Malayalam fonts: Beyond Latin font metrics

This year’s annual international conference organized by TeX Users Group — TUG2020 — was held completely online due to the raging pandemic. In TUG2020, I have presented a talk on some important Malayalam typeface design factors and considerations.

The idea and its articulation of the talk originated with K.H. Hussain, designer of well-known fonts such as Rachana, Meera, Meera Inimai, TNJoy etc. In a number of discussions that ensued, this idea was developed and later presented at TUG2020.

Opening keynote to TUG2020 was delivered by Steve Matteson, about the design of Noto fonts. He mentioned that Noto was originally envisaged to be developed as a single font containing all Unicode scripts; but that was changed due to a couple of reasons: (1) huge size of resulting font and (2) the design of many South/South-East Asian characters do not fit well within its Latin font metrics.

This second point set up the stage nicely for my talk, in which we argued that a paradigm shift from established Latin font metrics is necessary in designing and choosing font metrics for Indic scripts, in particular with Malayalam as a case study.

Indic scripts have abundant conjunct characters (basic characters combined to form a distinct shape). The same characters may join ‘horizontally’ (e.g. ത്സ/thsa) or ‘vertically/stacked’ (e.g. സ്ത/stha); and Malayalam script in particular has plenty of stacked conjuncts even in contrast with other Indic scripts. This peculiarity also makes the glyph design of fonts challenging — to balance aesthetics, legibility/readability and leading/line spacing. Specifically, following the usual x-height/cap-height/ascender/descender metrics used in Latin fonts put a lot of constraints in the design of stacked conjuncts. We propose to break away from this conventional metrics and adopt different proportions of the above- and below-base glyphs (even if they are the same characters, e.g. സ in the double conjunct സ്സ), still conforming to the aesthetics of the script yet managing the legibility and leading.

Fig. 1: Malayalam stacked conjuncts beyond conventional Latin font metrics.

Details of this study, argument and proposal can be found in the slides of the presentation available at the program details as well as the recorded talk now available on TUG YouTube channel.

TUG2020 presentation.

The conference paper, edited by Barbara Beeton and Karl Berry will be published in the next issue of TUGBoat journal.

by Rajeesh at October 26, 2020 10:21 AM

September 20, 2020

Rajeesh K Nambiar

Okular 20.08 — redesigned annotation tools

Last year I wrote about some enhancements made to Okular’s annotation tool and in one of those, Simone Gaiarin commented that he was working on redesigning the Annotation toolbar altogether. I was quite interested and was also thinking of ‘modernizing’ the tool — only, I had no idea how much work it would be.

The existing annotation tool works, but it had some quirks and had many advanced options which were documented pretty well in the Handbook but not obvious to an unscrupulous user. For instance, if the user would like to highlight some part of the text, she selects (single-clicks) the highlighter tool, applies it to a block of text. When another part of text is to be highlighted, you’d expect the highlighter tool to apply directly; but it didn’t ‘stick’ — tool was unselected after highlighting the first block of text. There is an easy way to make the annotation tool ‘stick’ — instead of single-click to select the tool, simply double-click, and it persists. Another instance is the ‘Strikeout’ annotation which is not displayed by default, but can be added to the tools list.

Simone, with lots of inputs, testing and reviews from David Hurka, Nate Graham and Albert Astals Cid et al., has pulled off a magnificent rewrite of Okular’s annotation toolbar. To get an idea of the amount of work went into this, see this phabricator task and this invent code review. The result of many months of hardwork is a truly modern, easy to explore-and-use annotation support. I am not aware of any other libre PDF reader with such good annotation features.

Annotation toolbar in Okular 20.08.

Starting from the left, default tools are: Highlight (brush icon), Underline (straight line) and Squiggle (wobbly line), Strike out, Insert text (Typewriter), Inline note, Popup note, Freehand drawing and Shapes (arrows, lines, rectangles etc.). The line thickness, colour, opacity and font of the tools can be customized easily from the drawer. Oh, and the selected annotation tool ‘sticks’ by default (see the ‘pin’ icon at the right end of toolbar).

When upgrading to okular-20.08 from a previous version, it will preserve the customized annotation tools created by the user and make those available under ‘Quick annotations’, and these can be quickly applied using Alt+n (Alt-1, Alt-2 etc.) short cuts. It did reset my custom shortcuts keys for navigation (I use Vim keys gg to go to the first page and G to go to the last page), which can be manually added back.

Custom tools (Quick annotations) can be applied with short cuts.

Here is the new toolbar in action.

by Rajeesh at September 20, 2020 08:27 AM

May 19, 2020

Rajeesh K Nambiar

Complex text shaping fixed in Konsole 20.08

Konsole was one of the few terminal emulators with proper complex text shaping support. Unfortunately, complex text (including Malayalam) shaping was broken around KDE Applications release 18.08 (see upstream bug 401094 for details).

Broken Malayalam text shaping in Konsole 20.04

Mariusz Glebocki fixed the code in January this year which I tested to work correctly. There’s a minor issue of glyphs with deep vertical components being cut-off (notice rendering of “സ്കൂ”), but otherwise the shaping and rendering is good. The patches are also merged upstream and will be part of the KDE Applications Bundle 20.08.

Proper Malayalam text shaping in Konsole 20.04 with shaping fixes.

If you don’t want to wait that long, I have made a 20.04 release with the fixes on top available for Fedora 31 & 32 in this COPR.

by Rajeesh at May 19, 2020 09:44 AM

February 17, 2020

Sreenadh T C

Eulogy to my best friend from childhood

This probably is a very late eulogy. This also means, it took me that long to find the nerve to put together words without breaking down or loosing my composure.

So here goes the story of two little friends who “are” brothers (from two close families) for a lifetime.

Photo by sudip paul from Pexels

I was almost a year old when he arrived (Hari Krishnan, referred to as Kichus from here on).

We both grew up sharing toys, getting new dresses together for Onam, buying crackers together for Vishu, fighting for penalties and 6s (and ofcz making amends the very next day).

I took for granted that this kid is going to be with me forever, see me graduate high school, class 10th, class 12th, see me become an Engineer. But destiny as we call it had other plans.

Kichu was diagnosed with Blood Cancer at the age of 16.

Little did we know that, we started losing each other way before he even turned 16.

Let me tell you how I remained helpless while everything around me pulled me down into a rabbit hole.

I was off for school that morning, to attend my last half-yearly exam. Something was off that day from the time I woke up. My mom was acting weird and she was in a hurry to push me off for school. Given this was my exams, I felt this urge was justified. So I walk away from the gate and I could see my mom peeking through the kitchen window, making sure I was not being stopped by anyone to tell me what had happened. I somehow reach the bus-stop, and the bus that comes on time everyday, was no where to be seen.

I could see my mom outside my house now, with those extended neck looking out for me, checking if I safely got into the bus. At this point I knew something wasn’t right. We all knew Kichu wasn’t gonna make it, coz I saw him a month before this day.

He had his fare share of chemo done by then, and had lost all his hair. It was hard for me to face him and look him in the eyes that was searching for a bit of hope, coz he knew it all the way.

How did he know you may ask.

Coz he had seen his elder sister take the same path to death when he was probably 6.

Every time I saw him, he was holding on to that smile making sure his mom never saw him suffer the pain he had within. Kichu was strong and he asked me to be strong alongside him and keep my shit together.
He wanted to do so much thing, and he had very little time.
We started swimming lessons, we went for painting class, he got himself a gaming PC and we played NFS all day or till he was tired.
He couldn’t play any more for his heart was weak, but he watched me score goals. Even when he was cheering from the sidelines, I was hoping for that one day when I could celebrate another goal with him.

So the bus finally arrived and I am off for school. Mom is relieved for the time being.

I write my exam, thinking about what had happened in the morning. I walk back home in the afternoon, and I open the front door.

I could see my mom had cried the whole day, and her eyes were so red and dry. I could see my grandma numb and looking at me with those helpless eyes.

Mom finally said: “ശ്രീ, നമ്മടെ കിച്ചു പോയെടാ / Kichu is no longer with us”

I don’t remember anything but just one answer. I asked mom if she was hiding this from me in the morning.


I felt so much anger and pain, I wanted to smash the front window glass. I went straight to my room upstairs, shut the door from behind, grabbed a pillow, and bit it like an angry dog and screamed for a long time as far as I can remember.

I had to be strong.

How can I be strong, when I hear the friend I thought I had for a lifetime, had gone away for ever.

How can I be strong when the last image of him I had in my head was of the kid who hoped to live a healthy funny life.

How can I be strong when I could not even say a final goodbye. I couldn’t even see his body for one last time.

Days, weeks, and months pass by. I wanted to accept the reality, but till this very day, I wake up on most days empty and have all these thoughts about how we grew up as brothers.

Chemistry paper was out after valuation, and I still remember the then chemistry teacher asking me in-front of the whole class about what was wrong with me. She wasn’t expecting me to do this poorly in the exam for she knew my mom who also happens to teach the same subject.

As much as I wanted to shout to the whole class that I just lost my best friend, I kept quite with my head down. I felt so much pain that day, that I tore up the answer paper and threw it away on my way back home. I don’t think any of my classmates knew about the whole Kichu scene.

I was scared to talk about it and have only told this to a close friend of mine, once. I am still scared and it hurts hell to write this draft, which I don’t know if I would be able to publish.

Almost 10 years have gone by, and when I look back at my childhood, at least I can still picture the little kid with a bright smile who always had my back.

This is a Eulogy for you buddy:

You have shown me the courage to fight with hope. You are my brother, I miss you very badly and I’ll always carry you with me. I wish you could see me now. I wasn’t ready for you to go yet, but you left me with no other choice. Growing up into adulthood without you was hard, and am still finding it hard to believe its been 10 years since you left.

I feel so proud and honored to have shared the kind of brotherhood and love we had for each other for 16 years, but how I wish I could get more of those.

I know you tried hard and I understand why you had to give up. I know you faked a lot of smile towards the end but I know you did it for a reason. If something life has taught me from all that you went through, its that “there are some people who always find reason to make others happy, even when they know that they are dying”. I don’t really believe in after life and stuff. For all the people who know me, should now know why I gave up on the concept of God, for God wasn’t there when I needed. I don’t trust someone who doesn’t show up when you need them to. So God for me died with Kichu.

Love you, my brother.


NB. This post is for remembering my friend and also to help me let some of the longing feeling of pain and heavy heart. This post doesn’t really tell half the pain I still have and I could never write something that does. For people who have been in my shoes or are currently in it, please find that strength by holding on to good memories with the ones you lost.

by Sreenadh T C at February 17, 2020 09:14 PM

January 24, 2020

Rajeesh K Nambiar

Odoo in a root-less container

The main workstation running Fedora 31 now, devoid of any trace of python2, I had to either spin up a virtual machine (which I happily did in the past using qemu and kvm [no libvirt or GNOME Boxes]) or get the hands dirty on containers this time to develop on Odoo [1] version 10 which depends on python2. Faced with the challenge^Wopprotunity, I started to learn to use containers.

Never tried to use docker, even though I am familiar with its technology and at times I wanted to try and have hands on experience on the technology. Fast forward, podman and buildah came along with the possibility to run root-less containers and they’re available in Fedora.


Install and setup podman, optionally buildah. Consult documentation at Red Hat developer blog [2] posts [3].

$ su -c "dnf install -y podman buildah
#Make sure your user is present in subuid and subgid
$ su -c "usermod --add-subuids 10000-75535 $(whoami); \
  usermod --add-subgids 10000-75535 $(whoami)"
#Logout and login back for the normal user to take effect.

Setup and run postgresql using podman. The documentation [4] on docker hub and at Red Hat [5] will help, also Dan Walsh’s post [6]. You’d want persistent storage for database and the application.

#Create a persistent storage location for DB
$ su -c "mkdir -p /var/container/pgsql10/data"
#Make sure to give ownership of the directory to 'postgres' user
#in _container_. In host system, id of 'postgres' user is '26',
#which maps to id '10025' in container. 
$ su -c "chown -R 10025:10025 !$"  # directory created in previous step
#As normal user
$ podman run -d --name pg10 -e POSTGRESQL_USER=odoo \
  -e POSTGRESQL_PASSWORD=odoopassword -e POSTGRESQL_ADMIN_PASSWORD=postgrespassword \
  -e POSTGRESQL_DATABASE=postgres -p 9432:5432 \
  -v /var/container/pgsql10/data:/var/lib/pgsql/data \
  -m=1g rhscl/postgresql-10-rhel7
#Check logs
$ podman logs -f pg10
#Connect the database and grant privileges to 'odoo' user
$ psql -U postgres -h -p 9342 -d postgres

These steps warrant some comments. To setup persistent storage for database, create a directory and give ownership of that directory to the user in container (refer [6] for details) . The user id for chown should be the mapped id of user within the container. For example, if the id of postgres user is 25 in host system and the id of same user in container will usually be 10025.

Next you can pull the postgresql docker image and run it. The environment variables using -e option are passed into the container. The -p option creates a port mapping between host and container. The -v option provides volume (persistent storage) mapping between host directory and container directory. The -m option provides memory restrictions required for postgres to auto-tune. If everything goes well, container named pg10 is created and run as daemon process. Check the status using podman ps -a or logs using podman logs -f pg10.

We are running postgresql as user odoo and this user should be able to create databases. Login to the database using super user postgres and the password specified in POSTGRESQL_ADMIN_PASSWORD, connecting to the IP address of host machine (localhost doesn’t work) on host port 9432. Then grant CREATEDB privilege to the odoo user.

Once postgresql is running successfully, let us create another container to run odoo. We also want odoo container to use the database server running in pg10 container, connected using same POSTGRESQL_USER user. There are couple of ways to connect to another container — one is using host port mapping and the other is using pod, see [7] for details. I chose the first option. Before running the container, create volume mapping for configuration and addons directories.

$ mkdir -p $HOME/odoo10_conf
$ cat > $HOME/odoo10_conf/odoo.conf << EOF 
; podman postgresql communication using port mapping
db_host =
db_port = 9432
db_user = odoo
db_password = odoopassword
addons_path = /mnt/extra-addons
data_dir = /var/lib/odoo
#Create and run odoo container
$ podman run -d -v $HOME/odoo10_conf:/etc/odoo \
  -v /opt/odoo/addons/odoo10:/mnt/extra-addons -p 9010:8069 \
  --name odoo10  odoo:10

We need to be able to control and pass the Odoo configuration from host system. Create a directory and place the configuration file in there and map it to /etc/odoo/odoo.conf in container. Similarly, you would do addons development in your host machine, so map the addons directory which is expected at /mnt/extra-addons by the container. The host port 9010 is mapped to container port 8069 used by Odoo.

That’s it.

Connect to Odoo by going to localhost:9010 and build your next application.

Oh — you can stop the container using podman stop pg10 and start using podman start odoo10 etc.

Update (15-Feb-2020)

What if you like to run both these containers in a pod to provide easier network access between them? This might be desired for various reasons, such as the IP address of the host machine changes and yet want to access the database from Odoo container without adjusting the IP of db_host.

The solution is to put both database and application (Odoo) containers in a single “pod”. A new pod can be created using podman pod create --infra -p hostport:containerport <podname> and while creating containers using podman run, pass the <podname> as podman run --pod <podname> .... It is important to specify all the ports you need to access from the host while creating this pod — it is not possible to add port mappings afterwards. Since I’d only need to access Odoo from the host, it would suffice to specify the port mapping -p 7069:8069. In short:

$ podman pod create --infra -p 7069:8069 --name odb
$ podman run -d --pod odb --name pg10 -e ...
$ # Make following changes to odoo.conf file
# db_host = localhost
# db_port = 5432
$ podman run -d --pod odb --name odoo10 ...


  1. Odoo,
  2. Red Hat developer blog, Introduction to podman,
  3. Red Hat developer blog, Podman and buildah for docker users,
  4. Docker hub, PostgreSQL 10 on CentOS 7,
  5. Red Hat documentation, Software collections docker images — PostgreSQL,
  6. Dan Walsh, Does root-less podman make sense?
  7. Red Hat, Configuring container networking with podman,

by Rajeesh at January 24, 2020 12:29 PM

January 01, 2020

Balasankar C

FOSS contributions in 2019


I have been interested in the concept of Freedom - both in the technical and social ecosystems for almost a decade now. Even though I am not a harcore contributor or anything, I have been involved in it for few years now - as an enthusiast, a contributor, a mentor, and above all an evangelist. Since 2019 is coming to an end, I thought I will note down what all I did last year as a FOSS person.


My job at GitLab is that of a Distribution Engineer. In simple terms, I have to deal with anything that a user/customer may use to install or deploy GitLab. My team maintains the omnibus-gitlab packages for various OSs, docker image, AWS AMIs and Marketplace listings, Cloud Native docker images, Helm charts for Kubernetes, etc.

My job description is essentially dealing with the above mentioned tasks only, and as part of my day job I don’t usually have to write and backend Rails/Go code. However, I also find GitLab as a good open source project and have been contributing few features to it over the year. Few main reasons I started doing this are

  1. An opportunity to learn more Rails. GitLab is a pretty good project to do that, from an engineering perspective.
  2. Most of the features I implemented are the ones I wanted from GitLab, the product. The rest are technically simpler issues with less complexity(relates to the point above, regarding getting better at Rails).
  3. I know the never-ending dilemma our Product team goes through to always maintain the balance of CE v/s EE features in every release, and prioritizing appropriate issues from a mountain of backlog to be done on each milestone. In my mind, it is easier for both them and me if I just implemented something rather than asked them to schedule it to be done by a backend team, so that I cane enjoy the feature. To note, most of the issues I tackled already had Accepting Merge Requests label on them, which meant Product was in agreement that the feature was worthy of having, but there were issues with more priority to be tackled first.

So, here are the features/enhancements I implemented in GitLab, as an interested contributor in the selfish interest of improving my Rails understanding and to get features that I wanted without much waiting:

  1. Add number of repositories to usage ping data
  2. Provide an API endpoint to get GPG signature of a commit
  3. Add ability to set project path and name when forking a project via API
  4. Add predefined CI variable to provide GitLab FQDN
  5. Ensure changelog filenames have less than 99 characters
  6. Support notifications to be fired for protected branches also
  7. Set X-GitLab-NotificationReason header in emails that are sent due to explicit subscription to an issue/MR
  8. Truncate recommended branch name to a sane length
  9. Support passing CI variables as push options
  10. Add option to configure branches for which emails should be sent on push

Swathanthra Malayalam Computing

I have been a volunteer at Swathanthra Malayalam Computing for almost 8 years now. Most of my contributions are towards various localization efforts that SMC coordinates. Last year, my major contributions were improving our fonts build process to help various packaging efforts (well, selfish reason - I wanted my life as the maintainer of Debian packages to be easier), implementing CI based workflows for various projects and helping in evangelism.

  1. Ensuring all our fonts build with Python3
  2. Ensuring all our fonts have proper appstream metadata files
  3. Add an FAQ page to Malayalam Speech Corpus
  4. Add release workflow using CI for Magisk font module


I have been a Debian contributor for almost 8 years, became a Debian Maintainer 3 years after my first stint with Debian, and have been a Debian Developer for 2 years. My activities as a Debian contributor this year are:

  1. Continuing maintenance of fonts-smc-* and hyphen-indic packages.
  2. Packaging of gopass password manager. This has been going on very slow.
  3. Reviewing and sponsoring various Ruby and Go packages.
  4. Help GitLab packaging efforts, both as a Debian Developer and a GitLab employee.

Other FOSS projects

In addition to the main projects I am a part of, I contributed to few FOSS last year, either due to personal interest, or as part of my job. They are:

  1. Calamares - I initiated and spearheaded the localization of Calamares installer to Malayalam language. It reached 100% translated status within a month.
  2. Chef
    1. Fix openSUSE Leap and SLES detection in Chef Ohai 14
    2. Make runit service’s control commands configurable in Chef Runit cookbook
  3. Mozilla - Being one of the Managers for Malayalam Localization team of Mozilla, I helped coordinate localizations of various projects, interact with Mozilla staff for the community in clarifying their concerns, getting new projects added for localization etc.


I also gave few talks regarding various FOSS topics that I am interested/knowledgeable in during 2019. List and details can be found at the talks page.

Overall, I think 2019 was a good year for the FOSS person in me. Next year, I plan to be more active in Debian because from the above list I think that is where I didn’t contribute as much as I wanted.

January 01, 2020 06:00 AM

November 25, 2019

Rajeesh K Nambiar

Public statement by Rachana Institute of Typography on the copyright/credit issue of SMC and RIT fonts

About us

We — KH Hussain, CV Radhakrishnan, PK Ashok Kumar and KV Rajeesh — are the copyright holders of TN Joy font. Many of us have worked on free/libre/open source software for years in our spare time and contributed code, design, fonts, documentation, localization and financial support to various free software projects. Our contributions can be found easily on the Web and elsewhere.

A copyright/‘credit’ issue

Immediately after the font ‘TN Joy’ was released to public by Rachana Institute of Typography (RIT), on 2-Oct-2019, Santhosh Thottingal raised a question in a forum with enough number of participants to qualify as a public discussion:

@rajeeshknambiar there are lot (sic) of contributions from me, Kavya in the build scripts, tests, and feature files in Consider giving credit.

On 14-Oct-2019, Santhosh followed up again.

@rajeeshknambiar did not reply to my request for giving credits in their font.


Ask hussain sir to give credits for font testing and building framework. Crediting anivar alone is not enough.”

To which Rajeesh responded on 19-Oct-2019, to discuss with all the copyright holders of TN Joy about the issue:

“Noted. I will try to take it up for discussion and let you know.

On 29-Oct-2019, Santhosh again followed up:

അങ്ങനെ എഴുതുകയും ചെയ്യുകയും ചെയ്ത ഫോണ്ടിന്റെ കാര്യങ്ങൾക്ക് ക്രെഡിറ്റ് കിട്ടിയില്ലെന്നാണ് പറയുന്നത് അനിവർ:) sundar, and janayugam fonts. ഇതിൽ രാജാജിയുടെ ഹെൽപ്പൊന്നും വേണ്ട. even @rajeeshknambiar can just fix it


During the first week of Nov-2019 for the summit organized by Kerala Media Academy, all the copyright holders of TN Joy font met and discussed the issue raised by Santhosh.

As free software developers and users, it was not our intention to violate copyright or appropriate credit of another free software developer’s work at all. Not only in intention, but we strived to achieve that in all our projects by acts. So, this accusation came as a surprise to us and we decided to take a deeper look at how this issue originated and what the root cause is, to address it properly.

We did a detailed analysis and documented the following details.

Technical background

  1. A Malayalam Unicode font has two essential parts — the Glyphs (അക്ഷരരൂപങ്ങൾ) and the OpenType shaping lookup rules. Unlike Latin fonts, both of these are necessary for proper shaping. The final TTF/OTF/WOFF2  contains both Glyphs and OpenType shaping rules to make a Malayalam Unicode font usable software. Without either, such a software is not usable.
Figure 1: Malayalam text without shaping (left) and with correct shaping (right).
  1. The Malayalam opentype features (GSUB and GPOS ‘lookup rules’) used in font ‘TN Joy’ developed by Rachana Institute of Typography (RIT) are adapted from that of font ‘Sundar’ which in turn are adapted from feature file of ‘Rachana’.
  2. To develop the feature file of Rachana over the years, many have contributed including the original author Hussain KH , Suresh P, Santhosh Thottingal, Rajeesh KV, Kavya Manohar et al. [1].
  3. Hussain KH invented and implemented the glyph naming conventions (‘k1’ for ‘ ക’, ‘xx’ for ‘ ്’ etc. instead of names like ‘uni0D15’ etc.), which made font featuring highly comprehensible for programming and much easier to maintain. This naming scheme is followed by all fonts maintained by Swathanthra Malayalam Computing (SMC) and RIT. This was also the naming scheme in fonts developed by ATPS and when it was pointed out that those fonts were derived from SMC’s, the immediate change made was renaming the glyphs and lookup rules [2, 3, 4].
  4. Rajeesh is the original author of lookup rules of SMC’s fonts for revised ‘mlm2’ OpenType specification for Malayalam, and made it possible to support both ‘mlym’ and ‘mlm2’ specification in a single font. This resulted in making a single font work well with Windows XP, Pango/Qt4 era applications and Uniscribe, HarfBuzz era applications [5].
  5. In 2015, Santhosh split the comprehensive lookup rules from the Fontforge SFD file of Rachana into a separate feature file, but the copyright statements were not preserved [6]. It is our opinion that removing copyright statements is violation of copyright act (hence a crime) and immoral in the free software world. This is also the root cause for missing copyright in the OpenType lookup rules and build script, of the fonts in question.
  6. The same lookup rules in 2, 4 and 5 are used and adapted by subsequent fonts developed by SMC and RIT, such as Chilanka, Manjari, Sundar, Gayathri, TN Joy etc. Rajeesh did not claim for credit or copyright when Manjari or Gayathri was released.

RIT’s statement

With this background,

  1. Fonts developed, maintained and distributed by both SMC  and RIT, specifically its OpenType lookup rules + fontforge based build tool + test cases are at the heart of this issue. This is caused by the change introduced by Santhosh in [6].
  2. The copyright holders of TN Joy font were made aware of such ‘credit’ issue — the definition of which Santhosh has not clarified and in RIT’s understanding is sufficient and limited to ‘copyright’. Thanks for bringing light into such a potential legal and moral risk that affects the users and organizations using these fonts.
  3. RIT  would like to acknowledge the copyright of Santhosh Thottingal and Kavya Manohar for the development of ‘Sundar’ and ‘TN Joy’ in the areas of lookup rules, the ‘build script’ and comprehensive ‘test file’. RIT  is willing to add the missing copyright notice to these files;

and RIT asked Santhosh to consider:

  1. Preserve the copyright of the original authors of the ‘lookup rules’ and Naming convention (notation for Glyphs) in all these fonts. The copyright and license statement should read:

“Copyright: Digitized data copyright (c) 2004–2005 Rachana Akshara Vedi (Chitrajakumar R, Hussain KH, Gangadharan N, Vijayakumaran Nair, Subash Kuraiakose), (c) 2006–2016 Hussain KH, Suresh P, Santhosh Thottingal, Rajeesh K Nambiar, Swathanthra Malayalam Computing ( This file is licensed under OFL  1.1.”

  1. The Fontforge based ‘build script’ added by Santhosh used to generate TTF/OTF/WOFF/WOFF2 files is adapted from that of Amiri font by Khaled Hosny [7] without preserving copyright or attribution. RIT requests to credit the original author[s] of this tool. It is of our opinion that removing copyright statements from a free software program code is illegal and immoral. It is also hypocritical when a person who asserts one’s own credit does this crime to other well-known and respected free software developer[s].
  2. Test cases in the ‘test file’ are contributed by various contributors, RIT  request to add the attribution of such contributors to the extent possible (Kavya Manohar, Santhosh Thottingal, Rajeesh KV). Santhosh has responded to this request as “test cases were mainly prepared by Kavya and no need to have attribution”, but RIT  firmly believes the copyright statements of the contributors must be added.
  3. The original author of ‘mlym.sty’ file [8] to typeset Unicode Malayalam using XeTeX is Suresh P, which was enhanced by Rajeesh KV with inputs from Hussain KH. Due to frequent requests on how to typeset Malayalam using Unicode, in 2013 Rajeesh wrote a wiki page [9] with basic details, which was later extended by other developers with instructions to install and setup XeTeX packages. This wiki article was later extended by Santhosh by adding matter from Wikipedia. This article was then copied and published in Santhosh’s blog [10] without attributing the authors, and [10] is frequently provided by Santhosh as the first response to general public asking for documentation on how to typeset Malayalam using XeTeX. It is shockingly hypocritical that plagiarism is practised by a well known free software developer who asserts one’s own credit without any respect to others copyright or credit. RIT  would like Santhosh to either: (a) redact [10] and redirect to [9] instead, or (b) credit the original authors in [10].

RIT  stopped the analysis and investigation of Santhosh’s claim at this point, as we have identified the root cause of missing copyrights and these are the important topics directly affecting RIT  developers.


RIT  tried to resolve the issue in private discussion with Santhosh Thottingal but unfortunately it did not succeed. Santhosh has not agreed to reinstate the copyright statement of original authors. Santhosh did not respond to many of the pointed questions we raised and deflected on answering others. Santhosh also refused to clarify what he means by ‘credit’ despite repeated pointed questions. Santhosh withdrew his claim for credit in one of the emails; and it is possible that he could change the mind any time and the issue could resurface. This surrounds the fonts by SMC  and RIT  in Fear, Uncertainty and Doubt (which the corporate proprietary companies successfully used against free software for years) and put all the individual users and organizations and developers using these fonts under legal risk and moral ambiguity.


  1. RIT  has added proper copyright statements to all the software used in building its fonts, viz. ‘Sundar’ and ‘ TN Joy’ [11,12].
  2. RIT  believes that our primary responsibility is towards the individual and institutional users of our fonts and developers depending on our tools; and they should be able to use our fonts and tools without any legal risk or moral ambiguity. RIT, to the best of its knowledge, has fulfilled that responsibility and strive to do so.
  3. RIT also understands that as with any issue in the free software world, the community would be divided, and it is a painful thing. RIT request the community to carefully consider all the facts before making a choice.

This will be the final public statement of RIT on the copyright issue raised by Santhosh Thottingal.


  • KH Hussain
  • CV Radhakrishnan
  • PK Ashok Kumar
  • KV Rajeesh


  1. Rachana font commit history, URL…
  2. Kathir font licensing issue (1), 2014, URL…
  3. Kathir font licensing issue (2), 2014, URL…
  4. ATPS  fonts licensing issue, 2015, URL…
  5. Introducing and integrating ‘mlm2’ OpenType shaping rules, 2013, URL…
  6. Split Glyphs and OpenType shaping rules, 2015, URL…
  7. Amiri font build tool, URL…
  8. XeTEX Malayalam style file for ‘Logbook of an Observer’, 2012, URL…
  9. Typesetting Malayalam using XeTEX, SMC  Wiki page history, 2013, URL…
  10. 2014,…
  11. Sundar font, reinstate copyright and license statements, 2019, URL…
  12. TN Joy font, reinstate copyright and license statements, 2019, URL…

Profile of the signatories

  • KH Hussain
    Library and information scientist by training and profession, font designer and developer of several fonts including, Rachana, Meera, Meera Inimai, TN Joy, RSugathan, Janayugom, Keraleeyam, Uroob, etc., free software activist, released all fonts under Open Font License. Played an important role in the migration of Janayugom daily to free software based production technologies.
  • CV Radhakrishnan
    Free software activist and TeX programmer, one of the founders of the Free Software Foundation of India and Indian TeX Users Group. Organized two annual meetings of the TeX Users Group in Trivandrum in 2002 and 2011. Wrote several packages (libraries) in LaTeX and released under free license (LPPL) at Comprehensive TeX Archive Network (CTAN).
  • PK Ashok Kumar
    Typesetter by profession and training, has four decades of extensive experience in typesetting right from the age of metal typefaces through digitized typesetting including TeX and LaTeX. Free content activist and principal tester for fonts developed by RIT, played a major role in the migration of production of Janayugom daily using free software.
  • KV Rajeesh
    Free software developer and user. Fedora project developer since 2008 and KDE  developer since 2011. Font maintainer and language computing contributor to Swathanthra Malayalam Computing since 2008. Member of Indic testing team for HarfBuzz. Google Summer of Code mentor. Contributes to various free software projects including Qt, GNOME, VLC, Odoo, Fontforge, SILE, ConTeXt, Okular, etc.

by Rajeesh at November 25, 2019 04:48 AM

November 15, 2019

Rajeesh K Nambiar

On data encoding and complex text shaping

As part of the historical move of Janayugom news paper migrating into a completely libre software based workflow, Kerala Media Academy organized a summit on self-reliant publishing on 31-Oct-2019. I was invited to speak about Malayalam Unicode fonts.

The summit was inaugurated by Fahad Al-Saidi of the Scribus fame, who was instrumental in implementing complex text layout (CTL). Prior to the talks, I got to meet the team who made it possible to switch Janayogom’s entire publishing process on to free software platform — Kubuntu based ThengOS, Scribus for page layout, Inkspace for vector graphics, GIMP for raster graphics, CMYK color profiling for print, new Malayalam Unicode fonts with traditional orthography etc. It was impressive to see that entire production fleet was transformed, team was trained and the news paper is printed every day without delay.

I also met Fahad later and pleasantly surprised to realize that he already knows me from open source contributions. We had a productive discussion about Scribus.

My talk was on data encoding and text shaping in Unicode Malayalam. The publishing industry in Malayalam is at large still trapped in ASCII which causes numerous issues now, and many are still not aware of Unicode and its advantages. I tried to address that in my presentation with examples — so the preface of my talk filled half of the session; while the second half focused on font shaping. Many in the industry seems to be aware of Unicode and traditional Malayalam orthography can be used in computers now; but many in the academia still has not realized it — evident from the talk of the moderator of the discussion, who is director of the school of Indian languages. There was a lively discussion with the audience in the Q&A session. After the talk, a number of people gave me feedback and requested the slides be made available.

Slides on data encoding and complex text shaping are available under CC-BY-NC license here.

by Rajeesh at November 15, 2019 07:37 AM

October 07, 2019

Rajeesh K Nambiar

WatchData PROXKey digital signature using emSigner in Fedora 30

TL;DR — go to Howto section to make WatchData PROXKey work with emSigner in GNU/Linux system.


Hardware tokens with digital signature are used for filing various financial documents in Govt of India portals. The major tokens supported by eMudhra are WatchData ProxKey, ePass 2003, Aladdin, Safenet, TrustKey etc. Many of these hardware tokens come (in CDROM image mode) with drivers and utilities to manage the signatures, unfortunately only in Windows platform.

Failed attempts

Sometime in 2017, I tried to make these tokens work for signing GST returns under GNU/Linux, using the de-facto pcsc tool. I got a WatchData PROXKey, which doesn’t work out-of-the-box with pcsc. Digging further brings up this report and it seems the driver is a spinoff of upstream (LGPL licensed), but no source code made available, so there is no hope of using these hardware tokens with upstream tools. The only option is depending on vendor provided drivers, unfortunately. There are some instructions by a retailer to get this working under Ubuntu.

Once you download and install that driver (ProxKey_Redhat.rpm), it does a few things — installs a separate pcsc daemon named pcscd_wd, installs the driver CCID bundles and certain supporting binaries/libraries. (The drawback of such custom driver implementations is that different drivers clash with each other (as each one provides a different pcscd_wd binary and their installation scripts silently overwrite existing files!). To avoid any clashes with this pcscd_wd daemon, disable the standard pcscd daemon by systemctl stop pcscd.service.

Plug in the USB hardware token and to the dismay observe that it spews the following error messages in journalctl:

Oct 06 09:16:51 athena pcscd_wd[2408]: ifdhandler.c:134:IFDHCreateChannelByName() failed
Oct 06 09:16:51 athena pcscd_wd[2408]: readerfactory.c:1043:RFInitializeReader() Open Port 0x200001 Failed (usb:163c/0417:libhal:/org/freedesktop/Hal/devices/usb_device_163c_0417_serialnotneeded_if1)
Oct 06 09:16:51 athena pcscd_wd[2408]: readerfactory.c:335:RFAddReader() WD CCID UTL init failed.

This prompted me to try different drivers, mostly from the eMudhra repository — including eMudhra Watchdata, Trust Key and even ePass (there were no *New* drivers at this time) — none of them seemed to work. Many references were towards Ubuntu, so I tried various Ubuntu versions from 14.04 to 18.10, they didn’t yield different result either. At this point, I have put the endeavour in the back burner.

A renewed interest

Around 2019 September, KITE announced that they will start supporting government officials using digital signatures under GNU/Linux, as most of Kerala government offices now run on libre software. KITE have made the necessary drivers, signing tools and manuals available.

I tried this in a (recommended) Ubuntu 18.04 system, but the pcscd_wd errors persisted and NICDSign tool couldn’t recognize the PROXKey digital token. Although, their installation methods gave me a better idea of how these drivers are supposed to work with the signing middleware.

Couple of days ago, with better understanding of how these drivers work, I thought that these should also work in Fedora 30 system (which is my main OS), I set out for another attempt.

How to

  1. Removed all the wdtokentool-proxkey, wdtokentool-trustkey, wdtokentool-eMudhra, ProxKey_Redhat and such drivers, if installed; to start from a clean slate.
  2. Download WatchData ProxKey (Linux) *New* driver from eMudhra.
  3. Unzip and install wdtokentool-ProxKey-1.1.1 RPM/DEB package. Note that this package installs the TRUSTKEY driver (usr/lib/WatchData/TRUSTKEY/lib/, not ProxKey driver (/usr/lib/WatchData/ProxKey/lib/ and it seems the ProxKey token only works with TRUSTKEY driver!
  4. Start pcscd_wd.service by systemctl start pcscd_wd.service (only if not auto-started)
  5. Plug in your PROXKey token. (journalctl -f would still show the error message, but — lesson learned — this error can be safely ignored!)
  6. Download emsigner from GST website and unzip it into your ~/Documents or another directory (say ~/Documents/emSigner).
  7. Ensure port 1585 is open in firewall settings: firewall-cmd --add-port=1585/tcp --zone=FedoraWorkstation (adjust the firewall zone if necessary). Repeat the same command by adding --permanent to make this change effective across reboot).
  8. Go to ~/Documents/emSigner in shell and run ./ (make sure to chmod 0755, or double-click on this script from a file browser).
  9. Login to GST portal and try to file your return with DSC.
  10. f you get the error Failed to establish connection to the server. Kindly restart the Emsigner when trying to sign, open another tab in browser window and go to https://localhost:1585 and try signing again.
  11. You should be prompted for the digital signature PIN and signing should succeed.

It is possible to use this digital token also in Firefox (via Preferences → Privacy & Security → Certificates → Security Devices → Load with Module filename as /usr/lib/WatchData/TRUSTKEY/lib/ as long as the key is plugged in. Here again, you can skip the error message unable to load the module.

by Rajeesh at October 07, 2019 08:19 AM

August 21, 2019

Sreenadh T C

How Dockup tracks online status of remote agents using Phoenix Presence

“Is our agent online? Let’s ask Phoenix Presence!”

Dockup is a tool that helps engineering teams spin up on-demand environments. We have a UI that talks to several agents which are installed on remote servers. The UI sends commands to these agents over WebSocket connections using Phoenix channels.

What if agent went down?

The commands to spin up and manage environments are sent over to agents running on remote servers. For this to work, we need to make sure our agents are online and ready to receive the commands. In order to do this, we need to keep track of agents assigned to our users and also show the agent’s online status in the UI.

Our first implementation

In the UI, we show if the agent for the organization is online and ready to receive the commands. It is an old school synchronous “ping” to agent behind a Retry module, where we ask for “pong” from agent to relay that back to our UI. This has a problem.

Consider the agent went down due to some unexpected error in the remote server, or suppose the organization has not yet been configured with a proper agent. If the user now opens the page that shows the agent status, the request would be blocked until the “ping” to the agent times out. Unfortunately this would take some time and would be terrible UX. No user would want to see an empty loading screen, only to find out that their agent is actually down!

Using Phoenix Presence

Phoenix Presence is a feature which allows you to register process information on a topic and replicate it transparently across a cluster. It’s a combination of both a server-side and client-side library which makes it simple to implement. A simple use-case would be showing which users are currently online in an application.

If we can track online statuses of users in chat-rooms, it should be possible to track online statuses of our agents too. That’s exactly what we did and here’s a step-by-step guide on how to do it

Firstly, we need to add Phoenix Presence under the App supervision tree as explained in the official docs.

We then configure our agents channel to use Presence to track the agents that connect with Dockup. After this, we can then simply ask Presence if there is a presence of an agent in our app!

Let’s add the lines that tells the user if their Agent is up. Now we’ll call the function

in our template to render the status in the UI.

Why this is great

By the time user actually visits the settings page, Presence would already have the info whether that specific agent has already joined the topic or not. Since this is a very basic key-value lookup, it is going to be super quick. We no longer need to play ping-pong with the agent to know the presence!

Earlier, this page would, in worst case scenario, take around 30–50 seconds to render, simply because the agent was down.

[info] Received GET /settings
[info] Sent 200 response in 40255.44ms

Using Presence, the response time came down to around 40–50ms, or even lower.

[info] Received GET /settings
[info] Sent 200 response in 17.86ms
[info] Received GET /settings
[info] Sent 200 response in 49.65ms
[info] Received GET /settings
[info] Sent 200 response in 65.62ms
[info] Received GET /settings
[info] Sent 200 response in 31.12ms
[info] Received GET /settings
[info] Sent 200 response in 51.73ms

The most interesting thing about solving this issue for us was that the PR that went in was tiny (just +40/-1), but the impact it had was significant, something we’ve seen time and again with Elixir!

How Dockup tracks online status of remote agents using Phoenix Presence was originally published in Dockup on Medium, where people are continuing the conversation by highlighting and responding to this story.

by Sreenadh T C at August 21, 2019 06:42 PM

July 31, 2019

Sreenadh T C

How to run E2E tests on on-demand environments

On-demand environments for running end-to-end tests

Be more confident about your code changes by adding end-to-end tests that run for each deployment you create on Dockup.

End-to-end testing is a technique used to verify the correctness of an application’s behavior when it works in integration with all its dependencies.
Running end-to-end tests have become exceedingly complicated over time as companies embrace service oriented architecture and monoliths turn into micro-services.

In this blog post, we’ll see how to use Dockup to automatically spin up on-demand environments to run end-to-end tests for every pull request.

We will be explaining this based on but you can follow the similar steps for configuring your favorite E2E tool to run alongside Dockup deployments.

For ease of understanding, let’s use a simple VueJS app that implements a TodoMVC.

We will keep this source under a common project folder, say todomvc-app/ and also create another folder, say todomvc-app/e2e/. We will write our tests inside this. Cypress test specs are kept under a sub-directory called "cypress" and we will have a cypress.json file inside our e2e folder. See more on how Cypress tests are written from their docs.

Once you have your test specs ready, we need to add a Dockerfile and the whole directory structure would look something like this:

|---- src/
| |---- index.html
| |---- app.js
|---- e2e/
| |----cypress/
| | |---- fixtures/
| | |---- integration/
| | | |---- add_todo_spec.js
| | | |---- mark_todo_spec.js
| | |---- plugins/
| | |---- support/
| |
| |---- cypress.json
| |---- Dockerfile
|---- package.json
|---- Dockerfile

Since the test cases are to be run for several deployments, we will be keeping the baseUrl config value for cypress as an initial dummy URL, and then we will override it with env variables. This is documented by Cypress here.

Container for the actual todo-app

Assuming that you have already added the container for the actual app while creating a Dockup Blueprint for your todomvc-app (as shown in figure above), we will add a new container holding the image source details. If you are new to Dockup Blueprint, head over here to read more about creating one.

Take care about the Dokcerfile path here, as this is the one which resides inside our e2e folder.

We will also have to add CYPRESS_BASE_URL env for Cypress to receive a public endpoint for the deployment. This can be done using the Environment Variables Substitution feature ( Refer DOCKUP_PORT_ENDPOINT_ ) in Dockup.

The cypress container would exit with the overall number of error the test had.

Container form for e2e

That is all you need to do to have a working cypress end-to-end test running alongside each of the deployments.

Since containers inside a Dockup deployment spin up when they are ready and not sequentially, you will need a shell script that waits for the UI endpoint to be live before you start to run tests. The script can simply fail when the endpoint is not live, upon which the Dockup container would restart.
set -x
set -e
echo "Checking if the endpoint for testing is ready..."
response=$(curl --write-out %{http_code} --silent --output /dev/null "$CYPRESS_BASE_URL")
if [[ $response != 200 ]]; then
# exit the script
exit 1

Cypress has its own docker image configured to run on several CI tools, which you can use it on Dockup as well, without much changes. All you have to do is, put the cypress/folder in the same level as the Dockerfile as their images look for it in the root directory, and as soon as the containers spin up, cypress run command would run. It is however not recommended to do this on Dockup due to the reason mentioned above. Instead, have the script take care of running the cypress command when the endpoint is up.

Now you can go ahead and deploy this blueprint and have it run the E2E tests for you. Your containers should spin up and the e2e tests should start running. While you wait for them to complete, you can also take a look at the logs.

Image builds are ready

A successful deployment with your e2e tests passed would look something like this.

E2E test has passed, and hence the container has a success check

Checks also send updates to GitHub if the deployments are triggered by PRs.

An example of how Dockup sends updates to GitHub PRs

In the case of cypress, it would exit out with a non-zero exit code when there are failures, and thus the container would also fail, suggesting there are failed test cases.

Not using Dockup already? Click here to start for free.

How to run E2E tests on on-demand environments was originally published in Dockup on Medium, where people are continuing the conversation by highlighting and responding to this story.

by Sreenadh T C at July 31, 2019 08:43 AM

How to create on-demand environments for WordPress

Spin up on-demand staging environment to test out your custom plugins and themes for WordPress

Setting up a staging environment and maintaining it for every theme/plugin project for WordPress can be very daunting. Quite often when website developers work on design implementations or content creators try to add articles to their website, they tend to seek approval from team members more often than one can imagine. This can be a tedious amount of work and also time consuming if the team is limited by availability of staging environments.

Dockup helps you mitigate this problem by providing on-demand staging environments for you WordPress site. Your changes would automatically be made available across your team as and when you update it, while letting you concentrate on the design or the article.

In this article, we’ll see how you can use Dockup to automatically spin up on-demand copies of your WordPress site.

How can Dockup help?

Dockup can automatically spin up a staging environment every time you open a PR for your WordPress site. This way, you will have an environment ready at your disposal, with all the changes from the PR. All you have to do is, push code, test your changes for the theme, and perhaps show it your team.

You can also deploy your branches manually on Dockup. This can be super useful when a non-tech team member wants to test how the site looks for any commit or branch. Let’s see how to set this up.

Setting up Dockup

Assuming you have prior knowledge on how/where themes and plugins fits in WordPress, let me quickly setup a sample plugin for the sake of this documentation. If you don’t have any current project, tag along the next step to have a simple source code which we can deploy on Dockup to test things.

This plugin for WordPress will append a line to each of the post that we create. This can be used to write some thanks message or a goodbye message to the end of each post.

Setup the project folder as below:

  • We have a root project folder called “ending-line-wp-plugin”
  • A file with the plugin code, “ending-line-wp-plugin/ending-line/ending-line.php”
  • Dockerfile for building images for Dockup
Project structure

Copy paste the following code in the ending-line.php file:

Now let’s dockerise this one. Its pretty straightforward here. All you have to do is copy this one project folder in to a plugins folder inside the “wp-content”.

FROM wordpress:php7.3-apache
WORKDIR /var/www/html
COPY ending-line/ wp-content/plugins/ending-line/

Note that we are not using any scripts to start a MySQL server before we run the actual WordPress server. Dockup lets us spin up both containers separately and we will connect them using the environment variables.

Let’s create a Dockup Blueprint for this project and see how we can stage our plugin project.

We will need two containers here:

  1. MariaDB
  2. The GitHub source using which we will build image.
Make sure that you have set the env variables for the database container. Refer the ones below for a start
Container for database
Note that we are using GitHub as the source here, but you can also use a pre-built docker image of the plugin/theme you are developing.
Also, double check the env variables that your project might need. For this one, we need three. To know what env variables are supported by WordPress, head over here

Some important env variables are:


In case you are wondering what the DOCKUP_SERVICE is, please read more about Environment Variable Substitution available in Dockup.

Container for WordPress server

Looks good, let’s try deploying this environment.

Successfully deployed WordPress

Now, you have a staging environment ready for you to test the plugin you just wrote!

Don’t forget to activate your plugin from the admin panel of WordPress. If you are following this sample app, remember to add the text to be appended to every post via Settings > Ending Line Plugin

Want Dockup on-demand environments for your WordPress sites? Click here to get started.

How to create on-demand environments for WordPress was originally published in Dockup on Medium, where people are continuing the conversation by highlighting and responding to this story.

by Sreenadh T C at July 31, 2019 08:41 AM

How to create on-demand environments for Jekyll Blogs

How to create on-demand environments for Jekyll sites

“See how your Jekyll site and articles look like before you publish them.”

Want to see how your site will turn out before you publish? Just open a PR on your repo and Dockup will spin up a live site for you!

Assuming that you have a Jekyll blog in place, let’s see how we can dockerise it and create a Dockup Blueprint. Here’s the Jekyll site we’ll use: Minima.

We’ll add a couple of files to the root directory:

  1. Dockerfile to build the docker image of our site.
  2. nginx.conf to serve the static site using Nginx.

Create Blueprint

Now that we have a Dockerfile added to the source, let’s create a Dockup Blueprint.

Make sure you have configured you GitHub account with Dockup. If you haven’t done it yet, you can head over to Dockup Settings
Container for Jekyll

And that’s all! Wasn’t that easy?

Now every time you open a pull request, Dockup will stage that branch and give you a new deployment as shown below:

Deployed successfully

You can follow similar steps to create on-demand environments for other static site generators, say for e.g. Hugo.

Would you like to test it out on your blog? Click here to get started.

How to create on-demand environments for Jekyll Blogs was originally published in Dockup on Medium, where people are continuing the conversation by highlighting and responding to this story.

by Sreenadh T C at July 31, 2019 08:40 AM

July 03, 2019

Rajeesh K Nambiar

SMC Malayalam fonts updated in Fedora 30

The Fedora package smc-fonts has a set of Malayalam fonts (AnjaliOldLipi, Kalyani, Meera, Rachana, RaghuMalayalamSans and Suruma) maintained by SMC. We used to package all these fonts as a single zip file hosted at These fonts were last updated in 2014 for Fedora, leaving them at version 6.1.

Since then, a lot of improvements were made to these fonts — glyph additions/corrections, opentype layout changes, fontTools based build system and separate source repository for each font etc.. There were lengthy discussions on the release management of the fonts, and it was partially the reason fonts were not updated in Fedora. Once it was agreed to follow different version number for each font, and a continuous build+release system was put in place at Gitlab, we could ensure that fonts downloaded from SMC website were always the latest version.

To reflect the updates in Fedora, we had to decide how to handle the monolithic source package at version 6.1 versus the new individual releases (e.g. Rachana is at version 7.0.1 as of this writing). In a discussion with Pravin Satpute, we agreed to obsolete the existing fonts package and give each font its own package.

Vishal Vijayaraghavan kindly stepped up and did the heavy lifting of creating the new packages, and we now even build the ttf font file from the source. See RHBZ#1648825 for details.

With all that in place, in Fedora 30, all these fonts are in latest version — for instance, see Rachana package. The old package smc-fonts no longer exists, instead each individual package such as smc-rachana-fonts or smc-meera-fonts can be installed. Our users will now be able to enjoy the improvements made over the years — including updated Unicode coverage, new glyphs, improved existing glyphs, much better opentype shaping etc.

by Rajeesh at July 03, 2019 07:02 AM

June 08, 2019

Santhosh Thottingal

Markov chain for Malayalam

I have been trying to generate a Markov chain for Malayalam content. A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.(wikipedia). For natural language, it represents a probabilistic model of words- the probability that one word can come after another word. This model can be prepared by feeding large amount of text to system that learns the probabilities of each words.

For Malayalam, I used the SMC Malayalam corpora. I used the markovchain python library as the tool to build the model. I had to do some bug fixes and customization to get it working for Malayalam, but the developer of the library was generous to merge my pull requests.

A markov chain is not interesting to a general user since as such it does not provide any direct benefits. But this is a foundation for many applications like speech recognition, handwriting recognition, automatic text generation etc. Mainly, it is used a tool that predict the next word given a prompt word. So I built a web application and web api that predict the next Malayalam word

This application is available at and source code is at

Another interesting application is automatic text generation. Some sample texts generated:

നാളെ വീണ്ടും ഉപേക്ഷിയ്ക്കപ്പെടുകതന്നെയായിരിക്കില്ലേ അവരുടെ ഉല്പന്നങ്ങളെക്കുറിച്ചുള്ള വിശദാംശങ്ങൾ പ്രസിദ്ധീകരിക്കാനായി കമ്പനിയെ സമ്മതിപ്പിക്കാൻ നമുക്കാകുന്നുണ്ടു്. ചിലപ്പോൾ സമൂഹവുമായി സഹകരിച്ചും നമ്മുടെ കമ്പ്യൂട്ടറുകളിലും ഡിജിറ്റൽ.’

നാളെ കാലത്തു കുറച്ചു വെള്ളം കോരിയൊഴിച്ചു കുടം നിറച്ചു കഞ്ഞിയുണ്ടായി. അതു വരുമ്പോൾ കുട്ടികളുടെ ഒന്നും ചേർന്നു തന്നെ. വരികളോർമ്മിച്ച് ആസ്വദിച്ച് കൊണ്ടുള്ള കഞ്ഞിയോ പുഴുക്കോ ആയിരുന്നു വലിയൊരു കൂട്ടത്തിന്റെ വിലാപത്തിന്റെ സംഗീതികതന്നെയായി മാറുകയാണ് ഈ.

ഇനിയും വല്ലതും തിന്നുകയും ചെയ്തതിന്റെശേഷം കൊട്ടാരംവക ആനയെ അതുവരെ ഇവിടെ വന്നു തുടങ്ങി. എങ്കിലും നിന്റെ കമ്പ്യൂട്ടറിനെ അനുഗ്രഹിക്കുന്നു കുട്ടീ. വികസിപ്പിക്കാവുന്ന ടെക്സ്റ്റ് ബുക്കായി ഉപയോഗിക്കാവുന്ന തരത്തിൽ അതിനെപറ്റി സങ്കൽപ്പിക്കാൻ സാധ്യമല്ല. അതുകൊണ്ട്, ആസന്നമായിരിക്കുന്നുവെന്ന് എല്ലാ ജില്ലകളിലും കളക്ടർമാരുടെ നേതൃത്വത്തിൽ നടത്തിയ നിക്ഷേപവുമാണു്, അല്ലാതെ മറ്റൊരു സുഖം. സ്കൂൾജീവിതം കഴിഞ്ഞപ്പോൾ അതു് നിങ്ങളുടെ പിന്തുണ ഉറപ്പാക്കാനായിട്ടില്ല. ഇതു് നിസ്സാരകാര്യമല്ല. ഭാരതി എയർടെൽ സീറോയുടെ ഭാഗമായി ബിയർ പാർലറിന്റെ ചുമരിടിച്ചു തകർത്താണ് സർജെന്റ് ഐസക്കും, കൂട്ടാളികളും ചെക്കോസ്ലോവാക്യൻ മണ്ണിൽ പിറക്കണമെന്ന് ജനിക്കാനിരിക്കുന്ന പെൺകുഞ്ഞ് ഭീതി കലർന്ന വാർത്തകൾ വിശ്വസിച്ച് ഈ വിവരങ്ങൾ നിങ്ങൾക്കു നശിപ്പിക്കാം, തോല്പിക്കാനാവില്ല എന്ന ചോദ്യം 3: പ്രോലിറ്റേറിയന്മാർ എക്കാലത്തുമുണ്ടായിരുന്നില്ലെന്നല്ലേ ഇതിന്റെ ഏഴിരട്ടിയുണ്ടെന്നോർക്കുക. ചുറ്റോടുചുറ്റുമുള്ള കടലോരങ്ങളുടെ ചാരുത മുതൽ അവസാനംവരെ അവന്റെ കചക്കയറിന്മേൽ കെട്ടി ചിലപ്പോഴൊക്കെ നമ്മളെ ഭയപ്പെടുത്തുന്നതാണെന്നു് നാം ഭൂമിക്കുചുറ്റും മണിക്കൂറിൽ 1600 – ൽ കൂടുതൽ.

Have fun!

by Santhosh Thottingal at June 08, 2019 04:23 AM

Updated web interface for mlmorph

The web interface of Malayalam morphology analyser(mlmorph) is updated. You can see new interface at The new web application is written in vuejs using vuetify UI framework. The backend is flask. Source code is available at

Morphology analysis
Morphology generator
Named entity recognition
Number spellout

by Santhosh Thottingal at June 08, 2019 03:48 AM

June 05, 2019

Santhosh Thottingal

Chilanka version 1.400 released

A new version of Chilanka typeface is available now. Version 1.400 is available for download from SMC’s font download and preview site

For users, there is not much changes, but the source and code build system got a major upgrade.

  • Source code updated to UFO format from fontforge sfd format. This allows to work with modern font editors.
  • Use cubic beziers for master design, generate OTF along with TTF. The original drawings for Chilanka was using cubic beziers.
  • fontmake is used for building the ttf and otf, similar to the latest font projects by SMC
  • fontbakery is used for tests, all tests are passing now
  • Added a few important latin glyphs missing, reported by fontbakery

by Santhosh Thottingal at June 05, 2019 02:58 PM

May 26, 2019

Santhosh Thottingal

Lexicon Curation for Mlmorph

One of the key components of Mlmorph is its lexicon. The lexicon contains the root words categorized as nouns, verbs, adjectives, adverbs etc. These are the components used with morphological rules to generate the vocabulary of Malayalam. I collected initial lexicon with about 100,000 words from various sources such as Wikipedia, CLDR and many targeted web crawls. One problem with such collected words is they often contains spelling mistakes. Secondly, classifying these words is not possible without the tedious task of a person going through each and every words.

So, I was thinking of a solution which consists of:

  • A crawler or multiple targeted crawlers looking for candidate words. For example, I can write script to look for the entire Malayalam wikipedia dump and look for words that are most probably nouns or inflected nouns or words derived out of nouns. This is possible with some kind of pattern matching. For example, a word ending with -യുടെ, -ിന്റെ, -ിൽ, -യെ are most probably noun(we don’t know whether it is pronoun or place name or person name- that require human curation). A word ending with -ക്കുക, -ച്ചു, -ട്ട്, -ിരുന്നു, is most probably a verb.
  • A database and an application that helps a person to quickly approve the prediction, remove the misspelled word, edit the word to correct mistakes, choose a correct POS tagging
  • A set of scripts that will take the curated words to the lexicon of mlmorph. Also as mlmorph learn new root words, the database will require a refresh since mlmorph start recognizing words related to the new words learned.

Over the last few days, I was working to implement this system. Interestingly, I was also learning and practicing Vuejs. I was amazed by the productivity it gives to quickly build clean and fast modern web applications. So I decide to use that for my curator application. For database I found firebase with Vuefire will be a perfect fit. Vuetify helped to do quick UI styling. Without writing any specific code for database management I got the whole system working.

Screenshot of the lexicon curator application. The two words shown here are misspelled that I can quickly remove. The prediction for these two words is Verb.

The mobile friendly application allows me to do this otherwise tedious task as a leisure activity. After adding some user authentication, I will make it public and share with some friends. Source code: Thr mlmorph scripts are at

by Santhosh Thottingal at May 26, 2019 09:18 AM

May 19, 2019

Rajeesh K Nambiar

Okular: another improvement to annotation

Continuing with the addition of line terminating style for the Straight Line annotation tool, I have added the ability to select the line start style also. The required code changes are committed today.

Line annotation with circled start and closed arrow ending.

Currently it is supported only for PDF documents (and poppler version ≥ 0.72), but that will change soon — thanks to another change by Tobias Deiminger under review to extend the functionality for other documents supported by Okular.

by Rajeesh at May 19, 2019 01:40 PM

May 07, 2019

Rajeesh K Nambiar

Okular: improved PDF annotation tool

Okular, KDE’s document viewer has very good support for annotating/reviewing/commenting documents. Okular supports a wide variety of annotation tools out-of-the-box (enable the ‘Review’ tool [F6] and see for yourself) and even more can be configured (such as the ‘Strikeout’ tool) — right click on the annotation tool bar and click ‘Configure Annotations’.

One of the annotation tools me and my colleagues frequently wanted to use is a line with arrow to mark an indent. Many PDF annotating software have this tool, but Okular was lacking it.

So a couple of weeks ago I started looking into the source code of okular and poppler (which is the PDF library used by Okular) and noticed that both of them already has support for the ‘Line Ending Style’ for the ‘Straight Line’ annotation tool (internally called the TermStyle). Skimming through the source code for a few hours and adding a few hooks in the code, I could add an option to configure the line ending style for ‘Straight Line’ annotation tool. Many line end styles are provided out of the box, such as open and closed arrows, circle, diamond etc.

An option to the ‘Straight Line’ tool configuration is added to choose the line ending style:

New ‘Line Ending Style’ for the ‘Straight Line’ annotation tool.

Here’s the review tool with ‘Open Arrow’ ending in action:

‘Arrow’ annotation tool in action.

Once happy with the outcome, I’ve created a review request to upstream the improvement. A number of helpful people reviewed and commented. One of the suggestions was to add icon/shape of the line ending style in the configuration options so that users can quickly preview what the shape will look like without having to try each one. The first attempt to implement this feature was by adding Unicode symbols (instead of a SVG or internally drawn graphics) and it looked okay. Here’s a screen shot:

‘Line End’ with symbols preview.

But it had various issues — some symbols are not available in Unicode and the localization of these strings without some context would be difficult. So, for now it is decided to drop the symbols.

For now, this feature works only on PDF documents. The patch is committed today and will be available in the next version of Okular.

by Rajeesh at May 07, 2019 01:40 PM

March 28, 2019

Rajeesh K Nambiar

Meera font updated to fix issue with InDesign

I have worked to make sure that fonts maintained at SMC work with mlym (Pango/Qt4/Windows XP era) opentype specification as well as mlm2 (Harfbuzz/Windows Vista+ era) specification, in the same font. These have also been tested in the past (2016ish) with Adobe softwares which use their own shaping engine (they use neither Harfbuzz nor Uniscribe; but there are plans to use Harfbuzz in the future — the internet tells me).

Some time ago, I received reports that typesetting articles in Adobe InDesign using Meera font has some serious issues with Chandrakkala/Halant positioning in combination with conjuncts.

When the Savmruthokaram/Chandrakkala ് (U+0D4D) follows a consonant or conjunct, it should be placed at the ‘right shoulder’ of the consonant/conjunct. But in InDesgin (CC 2019), it appears incorrectly on the ‘left shoulder’. This incorrect rendering is highlighted in figure below.

Wrong chandrakkala position before consonant in InDesign.

The correct rendering should have Chandrakkala appearing at the right of as in figure below.

Correct chandrakkala position after consonant.

This issue manifested only in Meera, but not in other fonts like Rachana or Uroob. Digging deeper, I found that only Meera has Mark-to-Base positioning GPOS lookup rule for Chandrakkala. This was done (instead of adjusting leftt bearing of the Chandrakkala glyph) to appear correctly on the ‘right shoulder’ of consonant. Unfortunately, InDesign seems to get this wrong.

To verify, shaping involving the Dot Reph ൎ (U+0D4E) (which is also opentype engineered as Mark-to-Base GPOS lookup) is checked. And sure enough, InDesign gets it wrong as well.

Dot Reph position (InDesign on left, Harfbuzz/Uniscribe on right)

The issue has been worked around by removing the GPOS lookup rules for Chandrakkala and tested with Harfbuzz, Uniscribe and InDesign. I have tagged a new version 7.0.2 of Meera and it is available for download from SMC website. As this issue has affected many users of InDesign, hopefully this update brings much joy to them to use Meera again. Windows/InDesign users make sure that previous versions of the font are uninstalled before installing this version.

by Rajeesh at March 28, 2019 08:38 AM

March 14, 2019

Rajeesh K Nambiar

New package in Fedora: python-xslxwriter

XlsxWriter is a Python module for creating files in xlsx (MS Excel 2007+) format. It is used by certain python modules some of our customers needed (such as OCA report_xlsx module).

This module is available in pypi but it was not packaged for Fedora. I’ve decided to maintain it in Fedora and created a package review request which is helpfully reviewed by Robert-André Mauchin.

The package, providing python3 compatible module, is available for Fedora 28 onwards.

by Rajeesh at March 14, 2019 09:42 AM

March 10, 2019

Santhosh Thottingal

LibreOffice Malayalam spellchecker using mlmorph

A few months back, I wrote about the spellchecker based on Malayalam morphology analyser. I was also trying to intergrate that spellchecker with LibreOffice. It is not yet ready for any serious usage, but if you are curious and would like to help me in its further development, please read on.

Blog post on spellchecker approach and pla

Current status

The libreoffice spellchecker for Malayalam is available at You need to get the code using git checkout or download the master version as zip file

You need LibreOffice 4.1 or later. Latest version is recommended. In the source code directory, run make install to install the extension.

Open libreoffice writer, add some Malayalam text. Make sure to select the language as Malayalam by choosing it from the menu or bottom status bar. You should see the spelling check in action… if everything goes as expected 😉

LibreOffice language settings, You can see mlmorph listed.
Spellchecker in action- libreoffice writer.

How can you help?

Theoretically, the extension should work in non-Linux platforms as well. But I have not tested it. The extension need python3 and python-hfst for the operating system. But python-hfst is not available for Windows 64 bit python installation. If you test and get the extension working, please add documentation and if anything missing to make the installation more easy, let me know.

As the mlmorph project get wider support for Malayalam vocabulary, the quality of spellchecker improves automatically.

by Santhosh Thottingal at March 10, 2019 10:16 AM

Malayalam Named Entity Recognition using morphology analyser

Named Entity Recognition, a task of identifying and classifying real world objects such as persons, places, organizations from a given text is a well known NLP problem. For Malayalam, there were several research papers published on this topic, but none are functional or reproducible research.

The morphological characteristics of Malayalam has been always a challenge to solve this problem. When the named entities appear in an inflected or agglutinated complex word, the first step is to analyse such words and arrive at the root words.

As the Malayalam morphology analyser is progressing well, I attempted to build a first version of Malayalam NER on top of it. Since mlmorph gives the POS tagging and analysis, there is not much to do in NER. We just need to look for tags corresponding to proper nouns and report.

You can try the system at

Malayalam named entity recognition example using

Known Limitations

  • The recognition is limited by the current lexicon of mlmorph. To recognize out of lexicon entities, a POS guesser would be needed. But this is a general problem not limited to NER. A morphology analyser should also have a POS guesser. In other words as the mlmorph improves, this system also improves automatically.
  • Currently the recognition is at word level. But sometimes, the entities are written in multiple consecutive words. To resolve that we will need to write a wrapper on top of word level detection system.
  • The current system is a javascript wrapper on top the mlmorph analyse api. I think NER deserve its own api.

by Santhosh Thottingal at March 10, 2019 09:25 AM

March 02, 2019

Santhosh Thottingal

Scribus gets hyphenation support for 11 Indian languages

Support for hyphenating in 11 Indian languages is now available in Scribus, desktop publishing system. Two years back I had written how Malayalam hyphenation support was added to Scribus. Later, I had filed a bug to add support for more Indian languages. That is now fixed.

Scribus has a new way to download and use these hyphenation dictionaries. You can now use this feature right away in your installed scribus. The languages with hyphenation support are the following:

  • Malayalam
  • Tamil
  • Telugu
  • Kannada
  • Marathi
  • Hindi
  • Bengali
  • Gujarati
  • Assamese
  • Panjabi
  • Odia

How to Add Hyphenation Dictionary?

Navigate to Windows -> Resources in the menu bar. You will see a window as given below. You may want to press “Update Available List”. Then you can see all the languages with hyphenation dictionaries available. Select the download checkbox and press “Download” button. The dictionary will get installed to your system.

Scribus Resource Manager

How to use?

  • Start a new document. Add text frames and content. You may need narrow columns to have wordbreaking contexts.
  • Select the text and set appropriate font(Unicode) for your language. Make sure the language is selected as your preferred language.
  • In Hyphenation properties, set hyphenation character as blank, otherwise visible hyphens will appear.
  • Set the text justified.
  • From menu Extras->Hyphenate text. Done.
Hyphenated two column content

How does it work?

The resource manager based hyphenation libraries are easier way to add new hyphenation dictionaries. Earlier, these files need to add to Scribus source code. Now these files are defined in scribus server – It maps the languages to files to download. So if I update the dictionaries in the github repo, a new installation will take that updated file.

Reporting issues

If you find any issues in the hyphenation rules, you can file at

by Santhosh Thottingal at March 02, 2019 04:49 AM

February 21, 2019

Santhosh Thottingal

Gayathri – New Malayalam typeface

Swathanthra Malayalam Computing is proud to announce Gayathri – a new typeface for Malayalam. Gayathri is designed by Binoy Dominic, opentype engineering by Kavya Manohar and project coordination by Santhosh Thottingal.

This typeface was financially supported by Kerala Bhasha Institute, a Kerala government agency under cultural department. This is the first time SMC work with Kerala Government to produce a new Malayalam typeface.

Gayathri is a display typeface, available in Regular, Bold, Thin style variants. It is licensed under Open Font License. Source code, including the SVG drawings of each glyph is available in the repository. Gayathri is available for download from

Gayathri has soft, rounded terminals, strokes with varying thickness and good horizontal packing. Gayathri has large glyph set for supporting Malayalam traditional orthography, which is the new trend in contemporary Malayalam. With a total of 1124 glyphs, Gayathri also has basic latin coverage. All Malayalam characters defined till Unicode 11 is supported.

There are not much Malayalam typefaces designed for titles and large displays. We hope Gayathri will fill that gap.

This is also the first typeface by Binoy Dominic. He had proved his lettering skills in his profession as graphic designer, working on branding with Malayalam content for his clients.

Binoy prepared all glyphs in SVGs, our scipts converted it to UFO sources. Trufont was used for small edits. Important glyph information like bearings, names, were defined in yaml configuration. Build scripts generated valid UFO sources and fontmake was used to build OTF output. Of course, there were lot of cycles of design fine tuning. Gitlab CI was used for running the build chain and testing. Fontbakery was used for quality assurance. UFO Normalizer, UFO Lint tools were also part of build system.

by Santhosh Thottingal at February 21, 2019 06:40 AM

February 08, 2019

Santhosh Thottingal

How to setup DNS over TLS using systemd-resolved

DNS over TLS is a security protocol that forces all connections with DNS servers to be made securely using TLS. This effectively keeps ISPs from seeing what website you’re accessing.

For the GNU/Linux distributions using systemd, you can setup this easily by following the below steps.

First, edit the /etc/systemd/resolved.conf and changed the value of DNSOverTLS as :


Now, configure your DNS servers. You need to use DNS server that support DNS over TLS. Examples are Cloudflare DNS or Google DNS also support it. To configure you can use Network manager graphical interface.

Then restart the systemd-resolved using:

sudo systemctl restart systemd-resolved

You are done. To check whether settings are correctly applied, you can try:

$ resolvectl status
       LLMNR setting: no
MulticastDNS setting: no
  DNSOverTLS setting: opportunistic

If you really want to see how DNS resolution requests are happening, you may use wireshark and inspect port 53 – the usual DNS port. You should not see any traffic on that port. Instead, if you inspect port 853, you can see DNS over TLS requests.

by Santhosh Thottingal at February 08, 2019 05:36 AM

January 15, 2019

Santhosh Thottingal

വിക്കിപീഡിയയ്ക്ക് പതിനെട്ട്. നാലുലക്ഷം തർജ്ജമകളും

വിക്കിപീഡിയയുടെ പതിനെട്ടാം പിറന്നാളാണിന്ന്. അമ്പത്തെട്ടുലക്ഷം ലേഖനങ്ങളോടെ ഇംഗ്ലീഷ് വിക്കിപീഡിയയും അറുപതിനായിരത്തോളം ലേഖനങ്ങളോടെ മലയാളം വിക്കിപീഡിയയും ഒരുപാടു പരിമിതികൾക്കും വെല്ലുവിളികൾക്കുമിടയിൽ യാത്ര തുടരുന്നു.

292 ഭാഷകളിൽ വിക്കിപീഡിയ ഉണ്ടെങ്കിലും ഉള്ളടക്കത്തിന്റെ അനുപാതം ഒരുപോലെയല്ല. വിക്കിമീഡിയ ഫൗണ്ടേഷനിൽ കഴിഞ്ഞ നാലുവർഷമായി എന്റെ പ്രധാനജോലി ഭാഷകൾ തമ്മിൽ മെഷീൻ ട്രാൻസ്‌ലേഷന്റെയും മറ്റും സഹായത്തോടെ ലേഖനങ്ങൾ പരിഭാഷപ്പെടുത്തുന്ന സംവിധാനത്തിന്റെ സാങ്കേതികവിദ്യയ്ക്ക് നേതൃത്വം കൊടുക്കലായിരുന്നു.

ഇന്നലെ ഈ സംവിധാനത്തിന്റെ സഹായത്തോടെ പുതുതായി കൂട്ടിച്ചേർത്ത ലേഖനങ്ങളുടെ എണ്ണം നാലുലക്ഷമായി.

by Santhosh Thottingal at January 15, 2019 06:57 AM

January 13, 2019

Santhosh Thottingal

Swanalekha input method now available for Windows and Mac

The Swanalekha transliteration based Malayalam input method is now available in Windows and Mac platforms. Thanks to Ramesh Kunnappully, who wrote the keyman implementation.

I wrote this input method in 2008. At those days SCIM was the popular input method for Linux. Later it was rewritten for M17N and used with either IBus or FCITX. A few years later, this input method was made to available in Android using Indic keyboard. Last year, due to requests from Windows and Mac users, Chrome and Firefox extensions were prepared. Thanks to SIL Keyman, now we made it available in those operating systems as well.

By this, Swanalekha Malayalam becomes an input method you can use in all operating systems and phones.

Detailed documentation, downloads are available in Swanalekha website. Source code: A small video illustrating the installation, configuration and use in Windows 10 given below.

Update: The keyboard is now served by keyman from their website. And the supported platforms also increased.

Download options from

by Santhosh Thottingal at January 13, 2019 04:22 AM

January 09, 2019

Rajeesh K Nambiar

Smarter tabular editing with Vim

I happen to edit tabular data in LaTeX format quite a bit. Being scientific documents, the table columns are (almost) always left-aligned, even for numbers. That warrants carefully crafted decimal and digit alignment on such columns containing only numbers.

I also happen to edit the text (almost) always in Vim, and just selecting/changing a certain column only is not easily doable (like in a spreadsheet). If there are tens of rows that needs manual digit/decimal align adjustment, it gets even more tedious. There must be another way!

Thankfully, smarter people already figured out better ways (h/t MasteringVim).

With that neat trick, it is much more palatable to look at the tabular data and edit it. Even then, though, it is not possible to search & replace only within a column using Visual Block selection. The Visual Block (^v) sets mark on the column of first row till the column on last row, so any :<','>s/.../.../g would replace any matching text in-between (including any other columns).

To solve that, I’ve figured out another way. It is possible to copy the Visual Block alone and pasting any other content over (though cutting it and pasting would not work as you think). Thus, the plan is:

  • Copy the required column using Visual Block (^v + y)
  • Open a new buffer and paste the copied column there
  • Edit/search & replace to your need in that buffer, so nothing else would be unintentionally changed
  • Select the modified content as Visual Block again, copy/cut it and come back to the main buffer/file
  • Re-select the required column using Visual Block again and paste over
  • Profit!

Here’s a short video of how to do so. I’d love to hear if there are better ways.

Column editing in Vim
Demo of column editing in Vim

by Rajeesh at January 09, 2019 11:44 AM

December 23, 2018

Santhosh Thottingal

പത്തുവർഷത്തെ കോഡ്

ഭാഷാകമ്പ്യൂട്ടിങ്ങുമായി ബന്ധപ്പെട്ട സ്വതന്ത്ര സോഫ്റ്റ്‌വെയർ വികസനപ്രവർത്തനങ്ങളിൽ ഏർപ്പെടാൻ തുടങ്ങിയിട്ട് പത്തുവർഷമാകുന്നു. 2008 ൽ ഒക്കെയാണ് ഈ മേഖലയിൽ സജീവമാകുന്നതും പലതരം പ്രോജക്ടുകൾക്കായി സമയം നീക്കിവെച്ചു തുടങ്ങുന്നതും. കഴിഞ്ഞ പത്തുവർഷത്തെ എന്റെ സംഭാവനകൾ ഗിറ്റ്‌ഹബ്ബിൽ ഉള്ള കോഡിന്റെ അടിസ്ഥാനത്തിൽ ചിത്രീകരിച്ചിരിക്കുകയാണിവിടെ.

Generated using for my github username santhoshtr

ഇതിലെ ഓരോ കള്ളിയും ഒരു ദിവസമാണ്. പച്ച നിറത്തിലുള്ള കള്ളിയുള്ള ദിവസങ്ങളിൽ കോഡ്, ബഗ്ഗ് റിപ്പോർട്ടുകൾ, മറ്റുള്ളവരുടെ കോഡ് റിവ്യൂ ചെയ്യൽ അങ്ങനെയെന്തെങ്കിലും രീതിയിലുള്ള പ്രവർത്തനം ചെയ്തുവെന്നർത്ഥം. ഇളം പച്ചയിൽ നിന്നും കടുംപച്ചയിലേക്ക് പോകുന്തോറും അതിന്റെ എണ്ണം കൂടുന്നു.

ഒരു ഡയറി പോലെത്തന്നെ എന്റെ ജീവിതത്തിലെ മധുരവും കയ്പ്പും എനിക്കിതിൽ വായിച്ചെടുക്കാം. പലപ്പോളായി കാണുന്ന നീണ്ട ഇടവേളകൾ യാത്രകളോ, വ്യക്തിപരമായ നല്ലതോ മോശമോ ആയ വിട്ടുനിൽക്കലുകളാണ്. ഇക്കാര്യത്തിൽ 2016 വളരെ മോശമായിരുന്നെന്നു കാണാം. 2013 ഏപ്രിലിലെ ഇടവേള എന്റെ വിവാഹത്തെ കാണിക്കുന്നു. ഇടയ്ക്ക് ഇടവേളകളില്ലാതെ 100 ദിവസം എന്തെങ്കിലും ചെയ്യുക എന്ന ഒരു ചലഞ്ചും ചെയ്തിരുന്നു- github streak – 2014 സെപ്റ്റംബർ മുതൽ അത് കാണാം.

അഭിമാനിക്കാവുന്ന ഒരു കാര്യം എന്റെ കരിയർ മുന്നോട്ടുപോകുന്തോറും എൻജിനിയറിങ്ങിൽ കൂടുതൽ സംഭാവനകൾ ചെയ്യാൻ കഴിയുന്നുണ്ട് എന്നതാണ്. പൊതുവിൽ ഐടി മേഖലയിൽ പ്രവർത്തിക്കുന്നവർക്കറിയാം, ആദ്യ പത്ത് വർഷങ്ങൾ പിന്നിടുമ്പോൾ മിക്കവാറും എൻജിനിയറിങ്ങ് സ്വഭാവമുള്ള ജോലിയിൽ നിന്നും മാനേജ്മെന്റ് സ്വഭാവമുള്ള ജോലിയിലെത്തിയിരിക്കും. ഞാൻ ആ പാത തിരഞ്ഞെടുത്തില്ല.

2011 ൽ വിക്കിമീഡിയ ഫൌണ്ടേഷനിൽ ഭാഷാ സാങ്കേതികവിദ്യാ വിഭാഗത്തിൽ ജോലിക്ക് ചേർന്നതോടെ, പൊതുജനങ്ങൾക്കായുള്ള കോഡ് എഴുതുന്നത് വളരെയേറെ കൂടി. അതേ സമയം വാരാന്ത്യങ്ങളിലും മറ്റ് ഒഴിവുസമയങ്ങളിലും മലയാളഭാഷയുമായി ബന്ധപ്പെട്ട പ്രവർത്തനങ്ങളിലും ഏർപ്പെട്ടു. അതുകൊണ്ടാണ് ഈ ഗ്രാഫിൽ ശനിയും ഞായറുമൊക്കെ പച്ച നിറം കാണുന്നത്.

അഭിമാനിക്കാവുന്ന മറ്റൊരു കാര്യം എന്റെ പ്രൊഫഷനിൽ, പൊതുജനങ്ങൾക്കായുള്ള കോഡ് എഴുതേണ്ടിവന്നപ്പോഴൊക്കെ അത് സ്വതന്ത്ര സോഫ്റ്റ്‌വെയറായി ചെയ്യാൻ സാധിച്ചുവെന്നതാണ്. അതായത് ഒരു ലൈൻ കോഡുപോലും ഞാൻ മറച്ചുവെച്ചിട്ടില്ല. ഞാൻ ചെയ്ത ഓരോ സംഭാവനയും കാര്യകാരണസഹിതം തുറന്നുവെച്ചിരിക്കുന്നു. ആർക്കും എപ്പോഴും പരിശോധിക്കാവുന്ന, പഠിക്കാവുന്ന, ഉപയോഗിക്കാവുന്ന വിധം. അതാണ് സ്വതന്ത്ര സോഫ്റ്റ്‌വെയർ.

ഇതിലെ ചില പ്രവർത്തനങ്ങളുടെ ഫലങ്ങളെങ്കിലും നിങ്ങൾ ഒരു മലയാളിയാണെങ്കിൽ മിക്കവാറും നിത്യജീവിതത്തിൽ ഏതെങ്കിലും വിധത്തിൽ ഉപയോഗിക്കുന്നുണ്ടാവും. അതേസമയം തുടക്കകാലങ്ങളിൽ എഴുതിയ പലതും ഒരു ടെക്നോളജി പരീക്ഷണത്തിൽ നിന്ന് പുറത്ത് കടന്നു ഉപയോഗപ്രദമായ ഒരു സോഫ്റ്റ്‌വെയർ ആക്കുന്നതിൽ പരാജയപ്പെട്ടിട്ടുമുണ്ട്. പക്ഷേ അതൊക്കെ സ്വാഭാവികമായും പിന്നത്തേക്കുള്ള പാഠങ്ങളായിരുന്നു.

by Santhosh Thottingal at December 23, 2018 03:11 PM

December 19, 2018

Balasankar C

DebUtsav Kochi 2018


Been quite some time since I wrote about anything. This time, it is Debutsav. When it comes to full-fledged FOSS conferences, I usually am an attendee or at most a speaker. I have given some sporadic advices and suggestions to few in the past, but that was it. However, this time I played the role of an organizer.

DebUtsav Kochi is the second edition of Debian Utsavam, the celebration of Free Software by Debian community. We didn’t name it MiniDebConf because it was our requirement for the conference to be not just Debian specific, but should include general FOSS topics too. This is specifically because our target audience aren’t yet Debian-aware to have a Debian-only event. So, DebUtsav Kochi had three tracks - one for general FOSS topics, one for Debian talks and one for hands-on workshops.

As a disclaimer, the description about the talks below are what I gained from my interaction with the speakers and attendees, since I wasn’t able to attend as many talks as I would’ve liked, since I was busy with the organizing stuff.

The event was organized by Free Software Community of India, whom I represented along with Democratic Alliance for Knowledge Freedom (DAKF) and Student Developer Society (SDS). Cochin University of Science and Technology were generous enough to be our venue partners, providing us with necessary infrastructure for conducting the event as well as accommodation for our speakers.

The event span across two days, with a registration count around 150 participants. Day 1 started with a keynote session by Aruna Sankaranarayanan, affiliated with OpenStreetMap. She has been also associated with GNOME Project, Wikipedia and Wikimedia Commons as well as was a lead developer of the Chennai Flood Map that was widely used during the floods that struck city of Chennai.

Sruthi Chandran, Debian Maintainer from Kerala, gave a brief introduction about the Debian project, its ideologies and philosophies, people behind it, process involved in the development of the operating system etc. An intro about DebUtsav, how it came to be, the planning and organizations process that was involved in conducting the event etc were given by SDS members.

After these common talks, the event was split to two parallel tracks - FOSS and Debian.

In the FOSS track, the first talk was by Prasanth Sugathan of Software Freedom Law Centre about the needs of Free Software licenses and ensuring license compliance by projects. Parallely, Raju Devidas discussed about the process behind becoming an official Debian Developer, what does it mean and why it matters to have more and more developers from India etc.

After lunch, Ramaseshan S introduced the audience to Project Vidyalaya, a free software solution for educational institutions to manage and maintain their computer labs using FOSS solutions rather than the conventional proprietary solutions. Shirish Agarwal shared general idea about various teams in Debian and how everyone can contribute to these teams based on their interest and ability.

Subin S showed introduced some nifty little tools and tricks that make Linux desktop cool, and improve the productivity of users. Vipin George shared about the possibility of using Debian as a forensic workstation, and how it can be made more efficient than the proprietary counterparts.

Ompragash V from RedHat talked about using Ansible for automation tasks, its advantages over similar other tools etc. Day 1 ended with Simran Dhamija talking about Apache SQOOP and how it can be used for data transformation and other related usecases.

In the afternoon session of Day 1, two workshops were also conducted parallel to the talks. First one was by Amoghavarsha about reverse engineering, followed by an introduction to machine learning using Python by Ditty.

We also had an informal discussion with few of the speakers and participants about Free Software Community of India, the services it provide and how to get more people aware of such services and how to get more maintainers for them etc. We also discussed the necessity of self-hosted services, onboarding users smoothly to them and evangelizing these services as alternatives to their proprietary and privacy abusing counterparts etc.

Day 2 started with a keynote session by Todd Weaver, founder and CEO of Purism who aims at developing laptops and phones that are privacy focused. Purism also develops PureOS, a Debian Derivative that consists of Free Software only, with further privacy enhancing modifications.

On day 2, the Debian track focused on a hands-on packaging workshop by Pirate Praveen and Sruthi Chandran that covered the basic workflow of packaging, the flow of packages through various suites like Unstable, Testing and Stable, structure of packages. Then it moved to the actual process of packaging by guiding the participants through packaging a javascript module that is used by GitLab package in Debian. Participants were introduced to the tools like npm2deb, lintian, sbuild/pbuilder etc. and the various debian specific files and their functionalities.

In the FOSS track, Biswas T shared his experience in developing, a website that was heavily used during the Kerala Floods for effective collaboration between authorities, volunteers and public. It was followed by Amoghavarsha’s talk on his journey from Dinkoism to Debian. Abhijit AM of COEP talked about how Free Software may be losing against Open Source and why that may be a problem. Ashish Kurian Thomas shed some knowledge on few *nix tools and tricks that can be a productivity booster for GNU/Linux users. Raju and Shivani introduced Hamara Linux to the audience, along with the development process and the focus of the project.

The event ended with a panel discussion on how Debian India should move forward to organize itself properly to conduct more events, spread awareness about Debian and other FOSS projects out there, prepare for a potential DebConf in India in the near future etc.

The number of registrations and enthusiasms of the attendees for the event is giving positive signs on the probability of having a proper MiniDebConf in Kerala, followed by a possible DebConf in India, for which we have bid for. Thanks to all the participants and speakers for making the event a success.

Thanks to FOSSEE, Hamara Linux and GitLab for being sponsors of the event and thus enabling us to actually do this. And also to all my co-organizers.

A very special thanks to Kiran S Kunjumon, who literally did 99% of the work needed for the event to happen (as you may recall, I am good at sitting on a chair and planning, not actually doing anything. :D ).

Group photo

December 19, 2018 06:00 AM

November 25, 2018

Santhosh Thottingal

Malayalam morphology analyser – First release

I am happy to announce the first version of Malayalam morphology analyser.

After two years of development, I tagged version 1.0.0

In this release

In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser.

A python interface is released to make the usage of library very easy for developers. The library is available in – Installing it is very easy:

Installing it is very easy:

pip install mlmorph

It avoids all difficulties of compiling the sfst formalism and installing the required hfst, sfst packages.

For detailed python api documentation and command line utility refer


There are lot of known limitations with the current release. I plan to address them in future releases.

  • Expand lexicon further: The current lexicon was compiled by testing various text and adding missing words found in it. Preparing the coverage test corpora also helped to increase the lexicon. But it still need more improvement
  • Many language specific constructs which are commonly used, but consisting of multiple conjunctions, adjectives are not well covered. Some examples are മറ്റൊരു, പിന്നീട്, അതുപോലെത്തന്നെ, എന്നതിന്റെ etc.
  • Optimizing the weight calculation: As the lexicon size is increased, many rarely used words can become alternate parts in agglutination of the words. For example, പാലക്കാട് can have an analysis of പാല്, അക്ക്, ആട് -Even though this is grammatically correct, it should get less preference than പാലക്കാട്<proper noun>.
  • Standardization of POS tags: mlmorph has its own pos tags definition. These tags need documentation with examples. I tried to use universal dependencies as much as possible, but it is not enough to cover all required tags for malayalam.
  • Documentation of formalism and tutorials for developers. So far I am the only developer for the project, which I am not happy about. The learning curve for this project is too steep to attract new developers. Above average understanding of Malayalam grammar is a difficult requirement too. I am planning to write down some tutorials to help new developers to join.


The project is meaningful only when practical applications are built on top of this.

by Santhosh Thottingal at November 25, 2018 10:55 AM

October 24, 2018

Rajeesh K Nambiar

Powerline git dirty status without powerline_gitstatus

With git-prompt it is possible to display the dirty state (when a tracked file is modified) by setting the env variable GIT_PS1_SHOWDIRTYSTATE=true. Powerline can display the status of a git repository, such as number of commits ahead/behind, number of modified files etc. using the powerline_gitstatus module. Unfortunately, Fedora doesn’t have it packaged. I did some digging in, and found that there’s colour highlighting for branch_dirty and powerline.segments.common.vcs.branch function (which displays the current branch name) takes 2 parameters  to modify its behaviour. Modify the shell theme /etc/xdg/powerline/themes/shell/default.json under the left segment (because only left works in shell) then as follows:
    {   "function": "powerline.segments.common.vcs.branch",
        "args": {"ignore_statuses": ["U"], "status_colors": true},
        "priority": 20
The branch will now be highlighted if a tracked file is modified (ignore_statuses = ["U"] causes untracked files to be ignored). Clean repository:
Clean repo
Once a tracked file is modified:
Dirty repo

by Rajeesh at October 24, 2018 05:22 AM

September 27, 2018

Santhosh Thottingal

Malayalam Script LGR rules for public review

The Malayalam and Tamil Root Zone Label Generation Rules for International Domain names have been released for public comments. See the announcement from ICANN. This was drafted by the Neo-Brahmi Script Generation Panel (NBGP), in which I am also a member.

Your comments on the proposal for the Malayalam Script Label Generation Rules for the Root Zone (LGR [XML, 18 KB] and supporting documentation [PDF, 998 KB]) can be submitted at the feedback form till Nov 7 2018.

My earlier blog post on Internationalized Top Level Domain Names in Indian Languages has some detailed information about this.

by Santhosh Thottingal at September 27, 2018 11:53 AM

September 08, 2018

Santhosh Thottingal

Malayalam spellchecker – a morphology analyser based approach

My first attempt to develop a spellchecker for Malayalam was in 2007. I was using hunspell and a word list based approach. It was not successful because of rich morphology of Malayalam. Even though I prepared a  manually curated 150K words list, it was nowhere near to cover practically infinite words of Malayalam. For languages with productive morphological processes in compounding and derivation that are capable of generating dictionaries of infinite length, a morphology analysis and generation system is required. Since my efforts towards building such a morphology analyser is progressing well, I am proposing a finite state transducer based spellchecker for Malayalam. In this article, I will first analyse the characteristics of Malayalam spelling mistakes and then explain how an FST can be used to implement the solution.

What is a spellchecker?

The spellchecker is an application that tells whether the given word is spelled correctly as per the language or not. If the word is not spelled correctly, the spellchecker often gives possible alternatives as suggestion to correct the misspelled word. The word can be spellchecked independently or in the context of a sentence. For example, in the sentence “à´…à´¸àµ�തമയസൂരàµ�യൻ കടലയിൽ à´®àµ�à´™àµ�ങിതàµ�താഴàµ�à´¨àµ�à´¨àµ�”, the word “കടലയിൽ” is spelled correctly if considered independently. But in the context of the sentence, it is supposed to be “കടലിൽ”.

The correctness of the word is tested by checking if that word is in the language model. The language model can be simply a list of all known words in the language. Or it can be a system which knows how a word in a language will look like and tell whether the given word is such a word. In the case of Malayalam, we saw that the finite dictionary is not possible. So we will need a system which is ‘aware’ of all words in the language. We will see how a morphology analyser can be such a system.

If the word is misspelled, the system need to give correction. To generate the correctly spelled words from a misspelled word form, an error model is needed. The most common error model is Levenshtein edit distance. In the edit distance algorithm, the misspelling is assumed to be a finite number of operations applied to characters of a string: deletion, insertion, change, or transposition. The number of operations is known as ‘edit distance‘. Any word from the known list of words in the language, with a minimal distance is a candidate for suggestion. Peter Norvig explains such a functional spellchecker in his article “How to Write a spelling corrector?

There are multiple problems with the edit distance based correction mechanism

  • For a query word, to generate all candidates after applying the four operations, we can calculate the number of words we need to generate and test its correctness. For a word of length n, an alphabet size a, an edit distance d=1, there will be n deletions, n-1 transpositions, a*n alterations, and a*(n+1) insertions, for a total of 2n+2an+a-1 terms at search time. In the case of Malayalam, a is 117 if we consider all encoded characters in Unicode version 11. If we remove all archaic characters, we still need about 75 characters. So, for edit distance d=1, a=75, for a word with 10 characters, 2*10+2*75*10+75-1 = 1594 and much larger for larger d. So, you will need to do 1594 lookups(spellchecks) in the language model to get possible suggestions.
  • The concept that the 4 edit operations are the cause for all spelling mistakes is not accurate for Malayalam. There are many common spelling mistakes in Malayalam that are 3 or 4 edit distance from the original word. Usually the edit distance based corrections won’t go beyond d=2 since the number of candidates increases.

The problems with hunspell based spellchecker and Malayalam

Hunspell has a limited compounding support, but limited to two levels. Malayalam can have more than 2 level compounding and sometimes the agglutinated words is also inflected. Hunspell system has an affix dictionary and suffix mapping system. But it is very limited to support complex morphology like Malayalam. With the help of Németh László, Hunspell developer, I had explored this path. But abandoned due to many limitation of Hunspell and lack of programmatic control of the morphological rules.

Nature of Malayalam spelling mistakes

Malayalam uses an alphasyllabary writing system. Each letter you write corresponds to the grapheme representation of a phoneme. In broader sense Malayalam can be considered as a language with one to one  grapheme to phoneme correspondence. Where as in English and similar languages, letters might represent a variety of sounds, or the same sounds can be written in different ways. The way a person learns writing a language strongly depends on the writing system.

In Malayalam, since there is one and only one set of characters that can correspond to a syllable, the confusion of letters does not happen. For example, in English, Education, Ship, Machine, Mission all has sh sound [ʃ]. So a person can mix up these combinations. But in Malayalam, if it is sh sound [ʃ], then it is always ഷ.

Because of this, the spelling mistakes that is resulted by four edit operations(deletion, insertion, change, or transposition) may not be an accurate classification of errors in Malayalam.  Let us try to classify and analyse the spelling mistake patterns of Malayalam.

  1. Phonetic approximation: The 1:1 grapheme to phoneme correspondence is the theory. But because of this the inaccurate utterance of syllables will cause incorrect spellings. For example, ബൂമി is a relaxed way of reading for ഭൂമി since it is relatively effortless. Since the relaxed way of pronunciation is normal, sometimes people think that they are writing in wrong way and will try to correct it unnecessarily പീഢനം->പീഡനം is one such example.
    • Consonants: Each consonant in Malayalam has aspirated, unaspirated, voiced and unvoiced variants. Between them, it is very usual to get mixed up
      • Aspirated and Unaspirated mix-up: Aspirated consonant can be mistakenly written as  Unaspirated consonant. For Example, à´§ -> à´¦, à´¢ -> à´¡ . Similarly Unaspirated consonant can be mistakenly written as aspirated consonant – Example, à´¦ ->à´§, à´¡ ->à´¢.
      • Voiced and Voiceless mix-up. Voiced consonants like à´—, à´˜ can be mistakenly written as voiceless forms à´•, à´–. And vice versa.
      • Gemination of consonants is often relaxed or skipped in the speech, hence it appear in writing too. Gemination in Malayalam script is by combining two consonants using virama. നീലതാമര/നീലതàµ�താമര is an example for this kind of mistakes. There are a few debatable words too, like à´¸àµ�വർണം/à´¸àµ�വർണàµ�ണം, പാർടി/പാർടàµ�à´Ÿà´¿. Another way of consonant stress indication is by using Unaspirated Consonant + Virama + Aspirated Consonant. à´…à´¦àµ�à´§àµ�യാപകൻ/à´…à´§àµ�യാപകൻ, തീർഥം/തീർതàµ�ഥം, വിഡàµ�à´¡à´¿/വിഡàµ�ഢി pairs are examples.
      • Hard, Soft variants confusion. Examples: à´¶/à´·, à´°/à´±, à´²/à´³
    • Vowels: Vowel elongation or shortening, gliding vowels and semi vowels are the cause for vowel related mistakes in writing.
      • Each vowel in Malayalam can be a short vowel or long vowel. Local dialect can confuse people to use one for the other. ചിലപàµ�പൊൾ/ചിലപàµ�പോൾ is one example. Since many input tools place the short and long vowels forms with very close keystrokes, it is possible to cause errors. In Inscript keyboard, short and long vowels are in normal and shift position. In transliteration based input methods, long vowel is often typed by repeated keys(i, ii for à´¿, ീ).
      •  The vowel à´‹ is close to റി or à´±àµ� in pronunciation. Example: à´‹à´¤àµ�/റിതàµ�. The vowel sign of à´‹ while appearing with a consonant is close to àµ�à´°. Example ഗൃഹം/à´—àµ�രഹം. ഹൃദയം/à´¹àµ�à´°àµ�ദയം.
      • Gliding vowels à´�, à´” get confused with its constituent vowels. കൈ/à´•à´‡/à´•à´¯àµ�, à´”/à´…à´‰/à´…à´µàµ� are example.
      • In Malayalam, there is a tendency to use à´� instead of à´‡, since the reduced effort. Examples: ചിലവàµ�/ചെലവàµ�, ഇല/à´�à´², തിരയàµ�à´•/തെരയàµ�à´•. Due to wide usage of these variants, it is sometimes very difficult to say one word is wrong. See the discussion about the ‘Standard Malayalam’ at the end of this essay.
    • Chillus: Chillus are pure consonants. A consonant + virama sequence sometimes has no phonetic difference from a chillu. For example, à´•à´²àµ�പന/കൽപന, നിൽകàµ�à´•àµ�à´•/നിലàµ�à´•àµ�à´•àµ�à´• combinations. The chillu ർ is sometimes confused with à´‹ sign. Examples are: à´ªàµ�രവർതàµ�തി/à´ªàµ�രവൃതàµ�തി. The chillu form of à´® – à´‚ can appear are as anuswara or ma+virama forms. Examples: പംപ, പമàµ�à´ª. But it is not rare to see പംമàµ�à´ª for this. Sometimes, the anuswara get confused with à´¨àµ�, and പമàµ�à´ª becomes പനàµ�à´ª. There were a few buggy fonts that used à´¨àµ�+à´ª for à´®àµ�à´ª ligature too.
  2. Weak Phoneme-Grapheme correspondence: Due to historic or evolutionary nature of the script, Malayalam also has some phonemes which has a weak relationship with the graphemes.
    • à´¹àµ�à´®/ à´®àµ�à´® as in à´¬àµ�à´°à´¹àµ�മം/à´¬àµ�à´°à´®àµ�മം, à´¨àµ�à´¦/à´¨àµ�à´¨ as in നനàµ�ദി/നനàµ�നി, à´¹àµ�à´¨/à´¨àµ�ന  as in à´šà´¿à´¹àµ�നം/à´šà´¿à´¨àµ�നം are some examples where what you pronounce is not exactly same as what you write.
    • à´±àµ�à´±, à´¨àµ�à´± – These two highly used conjuncts heavily deviate from the letters and pronunciation. While writing using pen, people don’t make much mistakes since they just draw the shape of these ligatures, but while typing, one need to know the exact key sequence and they get confused. Common mistakes for these conjuncts are ററ, ൻറ, ൻറàµ�à´± , ൻററ
  3. Visual similarity: While using visual input methods such as handwriting based or some onscreen keyboards, either the users or the input tool makes mistakes due to visual similarity
    • ൃ, àµ�à´¯ often get confused.
    • à´œàµ�à´�, à´�àµ�à´œ is one very common sequence where people are confused. ആദരാജàµ�à´�ലി/ആദരാà´�àµ�ജലി.
    • à´¤àµ�à´¸, à´� is another combination
    • The handwriting based input methods like Google handwriting tool is known for recognizing anuswara à´‚ as zero, English o, O etc.
    • When people don’t know how to insert visarga à´ƒ, and since there is a very similar key in keyboard- colon : they use it. Example: à´¦àµ�ഃഖം/à´¦àµ�:à´–à´‚
    • à´³àµ�à´³, the geminated form of à´³, is very similar to two adjacent à´³. This kind of mistakes are very frequent among people whi studied Malayalam inputting informally. Two adjacent à´±, is another mistake for à´±àµ�à´±,
    • The informal, trial-and-error based Malayalam inputting training also introduced some other mistakes such as using open parenthesis ‘(‘ for àµ�à´°, closing parenthesis ‘)’ for à´¾ sign.
  4. Ambiguity due to regional dialect: A good example for this is insertion of യ� in verbs. ക�റക�ക�ക/ക�റയ�ക�ക�ക, ചിരിക�ക�ക/ചിരിയ�ക�ക�ക, Also in nominal inflections: പൂച�ചയ�ക�ക�/പൂച�ചക�ക�.  Usuage of Samvruthokaram to distinguish between a pure consonant and stressed consonant at the end of word is a highly debated topic. For example, അവന�/അവന��/അവന�. All these forms are common, even though the usage of ന�� is less after the script reformation. But since script reformation was not an absolute transformation, it still exist in usage
  5. Spaces: Malayalam is an agglutinative language. Words can be agglutinated, but nothing prevents people to put space and write in simple words. But this should be done carefully since it can alter the meaning. An example is “ആന à´ªàµ�റതàµ�à´¤àµ� കയറി”, ആനപàµ�à´ªàµ�റതàµ�à´¤àµ� കയറി”, “ആനപàµ�à´ªàµ�റതàµ�à´¤àµ�കയറി”, “ആനപàµ�à´ªàµ�റതàµ�à´¤àµ� കയറി”. Another example: “മലയാള ഭാഷ”, “മലയാളഭാഷ” – Here, there is no valid word “മലയാള”. The anuswara at the end get deleted only when it joins with ഭാഷ as adjective. A morphology analyser can correctly parse “മലയാളഭാഷ” as മലയാളം<proper-noun><adjective>ഭാഷ<noun>. But since language already broke this rule and many people are liberally using space, a spellchecker would need to handle this cases.
  6. Slip of Finger: Accidental insertions or omissions of key presses is the common reason for spelling mistakes. For alphabetic language, mostly this type of errors are addressed. For Malayalam also, this type of accidental slip of finger can happen. For Latin based languages,  we can make some analysis since we know a QWERTY keyboard layout and do optimized checks for this kind of issues. Since Malayalam will use another level of mapping on top of QWERTY for inputting(inscript, phonetic, transliteration), it is not easy to analyse this errors. So, in general, we can expect random characters or omission of some characters in the query word. An accidental space insertion has the challenge that it will split the word to two words and if the spellchecking is done by one word at a time, we will miss it.

I must add that the above classification is not based on a systematic study of any test data that I can share. Ideally, this classification should done with real sample of Malayalam written on paper and computer. It should be then manually checked for spelling mistakes, list down the mistakes and analyse the patterns. This exercise would be very beneficial for spellcheck research. In my case, even since I released my word list based spellchecker, noticing spelling errors in internet(social media, mainly) has been my obsession. Sometimes I also tried to point out spelling mistakes to authors and that did not give much pleasant experience to me � . The above list is based on my observation from such patterns.

Malayalam spelling checker

To check if a word is valid, known, correctly spelled word, a simple look up using morphology analyser is enough. If the morphology analyser can parse the word, it is correctly spelled. Note that the word can be an agglutinated at arbitrary levels and inflected at same time.

Out of lexicon words

Compared to the finite set word list, the FST based morphology analyser and generator system covers large number of words using its generation system based on morpho-phonotactics. For a discussion on this see my previous blog post about the coverage test. Since every language vocabulary is a dynamic system, it is still impossible to cover 100% words in a language all the time. New words get added to language every now and then. There are nouns related to places, people names, product names etc that is not in the lexicon of Morphology analyser. So, these words will be reported as unknown words by the spellchecker. Unknown word is interpreted as misspelled word too. This issue is a known problem. But since a spellchecker is often used by a human user, the severity of the issue depends whether the spellchecker does not know about lot of commonly used words or not. Most of the spellcheckers provide an option to add to dictionary to avoid this issue.

As part of the Morphology analyser, the expansion of the lexicon is a never ending task. As the lexicon grows, the spellchecker improves automatically.

Malayalam spelling correction

To provide spelling suggestions, the FST based morphology analyser can be used. This is a three step process

  1. Generate a list of candidate words from the query word. The words in this list may be incorrect too. The words are generated based on the patterns we defined based on the nature of spelling mistakes. We scan the query word for common patterns of errors and apply fix for that pattern. Since there dozens of patterns, we will have many candidate words.
  2. From the candidate list, find out the correctly spelled word using spellcheck method. This will result a very small number of words. These words are the probable replacements for the misspelled query word.
  3. Sort the candidate words to provide more probable suggestion as the first one. For this, we can do a ranking on the suggestion strategies. A very common error pattern get high priority at step 1. So the suggestions from that appear first in the candidate list. A more sophisticated approach would use a frequency model for the words. So candidate words that are very frequent in the language will appear as first candidate.

One thing I observed from the above approach is, in reality the candidate words after all the above steps for Malayalam is most of the time one or two. This make step 3 less relevant. At the same time, an edit distance based approach would have generated more than 5 candidate words for each misspelled word. The candidates from the edit distance based suggestion mechanism would be very diverse, meaning, they won’t have be related to the indented word at all.  The following images illustrates the difference.

Spelling suggestion from the morphology analyser based system.
Spelling suggestions from edit distance based candidates

Context sensitive spellchecking

Usually the spellchecking and suggestion are done at one word at a time. But if we know the context of the word, the spellchecking will be further useful. The context is usually the words before and after the word. An example from English is “I am in Engineer”. Here the word “in” is a correct word, but with in the context, it is wrong. To mark the word “in” wrong, and provide ‘an’ as suggestion, one approach is ngram model of part of speech for the language. In simple words, what kind of word can appear in between a known kind of words. If we build this model for a language, that will surely tell that the a locative POS “in” before Engineer is rare or not seen before.

The Standard Malayalam or lack thereof

How do you determine which is the “correct” or “standard” way of writing a word? Malayalam has lot of orthographic variants for words which were introduced to language as genuine mistakes that later became common words(രാപàµ�പകൽ/രാപകൽ, ചിലവàµ�/ചെലവàµ�), phonetic simplification(à´…à´¦àµ�à´§àµ�യാപകൻ/à´…à´§àµ�യാപകൻ, à´¸àµ�വർണàµ�ണം/à´¸àµ�വർണം), or old spelling(കർതàµ�താവàµ�/à´•àµ�à´¤àµ�താവàµ�àµ�) and so on. A debate about the correctness of these words will hardly reach conclusion. For our case, this is more of an issue of selecting words in the lexicon. Which one to include, which one to exclude? It is easy to consider these debates as blocker for the progress of the project and give up: “well, these things are not decided by academics so far, so we cannot do anything about it till they make up their mind”.

I did not want to end up in that deadlock. I decided to be liberal about the lexicon. If people are using some words commonly, they are valid words the project need to recognize as much as possible. That is the very liberal definition I have. I leave the standardization discussion to linguists who care about it.

The news report from Mathrubhumi daily in 2007 about my old spelling checker

Back in 2007, when I developed the old Malayalam spellchecker, these debates came up.  Dr. P Somanathan, who helps me a lot now a days with this project, wrote about the issue of Malayalam spelling inconsistencies: “à´šà´°à´¿à´¤àµ�à´°à´¤àµ�തെ വീണàµ�ടെടàµ�à´•àµ�à´•àµ�à´•:” and “വേണം നമàµ�à´•àµ�à´•àµ� à´�കീകൃതമായ ഒരെഴàµ�à´¤àµ�à´¤àµ�രീതി


  1. A Data-Driven Approach to Checking and Correcting Spelling Errors in Sinhala. Asanka Wasala, Ruvan Weerasinghe, Randil Pushpananda,
    Chamila Liyanage and Eranga Jayalatharachchi [pdf] This paper discuss the phonetic similarity based strategies to create a wordlist, instead of edit distance approach.
  2. Finite-State Spell-Checking with Weighted Language and Error Models—Building and Evaluating Spell-Checkers with Wikipedia as Corpus Tommi A Pirinen, Krister Linde�n [pdf] This paper outlines the usage of Finite state transducer technique to address the issue of infinite dictionary of morphologically rich languages. They use Finnish as the example language
  3. The Malayalam morphology analyser project by myself is the foundation for the spellchecker.
  4. The common Malayalam spelling mistakes and confusables were presented in great depth by Renowned linguist and author Panmana Ramachandran Nair in his books  ‘തെറ�റില�ലാത�ത മലയാളം’, ‘തെറ�റ�ം ശരിയ�ം’, ‘ശ�ദ�ധ മലയാളം’ and ‘നല�ല മലയാളം’.
  5.  Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams Tommi A Pirinen and Miikka Silfverberg and Krister Lindén [pdf] – This paper discuss the context sensitive spellchecker approach.

Where can I try the spellchecker?

If you curious about the implementation of this approach, please refer and Since the implementation is not complete, I will write a new article about it later. Thanks for reading!

A screenshot of Malayalam spellchecker in action. Along with incorrect words, some correct words are marked as misspelled too. This is because of the incomplete morphology analyser. As it improves, more words will be covered.

by Santhosh Thottingal at September 08, 2018 09:41 AM

August 11, 2018

Santhosh Thottingal

Malayalam morphology analyser – status update

For the last several months, I am actively working on the Malayalam morphology analyser project. In case you are not familiar with the project, my introduction blog post is a good start. I was always skeptical about the approach and the whole project as such looked very ambitious. But, now  I am almost confident that the approach is viable. I am making good progress in the project, so this is some updates on that.

Analyser coverage statistics

Recently I added a large corpora to frequently monitor the percentage of words the analyser can parse.  The corpora was selected from two large chapters of ഐതിഹ്യമാല, some news reports, an art related essay, my own technical blog posts to have some diversity in the vocabulary.

Total words
Analysed words10532
Time taken
0.443 seconds

This is a very encouraging. Achieving a 66% for such a morphologically rich language Malayalam is no small task. From my reading, Turkish and Finnish, languages with same complexity of morphology achieved about 90% coverage. It may be more difficult to increase the coverage for me compared to achieving this much so far. So I am planning some frequency analysis on words that are not parsed by analyser, and find some patterns to improve.

The performance aspect is also notable. Once the automata is loaded to memory, the analysis or generation is super fast. You can see that ~16000 words were analyzed under half of a second.


From the very beginning the project was test driven. I now has 740 test cases for various word forms

The transducer

The compiled transducer now is 6.2 MB.  The transducer is written in SFST-PL and compile using SFST. It used to be compiled using hfst, but hfst is now severely broken for SFST-PL compilation, so I switched to SFST. But the compiled transducer is read using hfst python binding.

Fst type
arc typeSFST
Number of states
Number or arcs
Number of final states

The Lexicon

The POS tagged lexicon I prepared is from various sources like wiktionary, wikipedia(based on categories), CLDR. While developing I had to improve the lexicon several times since none of the above sources are accurate. The wiktionary also introduced a large amount of archaic or sanskrit terms to the lexicon. As of today, following table illustrates the lexicon status

Person names
Place names
English borrowed nouns
Language names(nouns)
Affirmations and negations

As you can see, the lexicon is not that big. Especially it is very limited for proper nouns like names, places. I think the verb lexicon is much better. I need to find a way to expand this further.

POS Tagging

There is no agreement or standard on the POS tagging schema to be used for Malayalam. But I refused to set this is as a blocker for the project. I defined my own POS tagging schema and worked on the analyser. The general disagreement is about naming, which is very trivial to fix using a tag name mapper. The other issue is classification of features, which I found that there no elaborate schema that can cover Malayalam.

I started referring and provided links to the pages in it from the web interface.  But UD is also missing several tags that Malayalam require. So far I have defined 85 tags


The main challenge I am facing is not technical, it is linguistic. I am often challenged by my limited understanding of Malayalam grammar. Especially about the grammatical classifications, I find it very difficult to come up with an agreement after reading several grammar books. These books were written in a span of 100 years and I miss a common thread in the approach for Malayalam grammar analysis. Sometimes a logical classification is not the purpose of the author too. Thankfully, I am getting some help from Malayalam professors whenever I am stuck.

The other challenge is I hardly got any contributor to the project except some bug reporting. There is a big entry barrier to this kind of projects. The SFST-PL is not something everybody familiar with. I need to write some simple examples for others to practice and join.

I found that some practical applications on top of the morphology analyser is attracting more people. For example, the number spellout application I wrote caught the attention of many people. I am excited to present the upcoming spellchecker that I was working recently. I will write about the theory of that soon.

by Santhosh Thottingal at August 11, 2018 12:43 PM

August 10, 2018

Santhosh Thottingal

How to customize Malayalam fonts in Linux

Now a days GNU/Linux distributions like Ubuntu, Debian, Fedora etc comes with pre-configured fonts for Malayalam. For Sans-serif family, it is Meera and  for serif, it is Rachana. If you like to change these fonts, there is no easy way to do with configuration tools in Gnome or KDE. They provide a general font selector for the whole desktop, but not for a given language.

The advantage of setting these preference at system level is, you don’t need to choose this fonts at application level then. For example, you don’t need to set them for firefox, chrome etc. All will follow the system preferences. We will use fontconfig for this

First, create a file named ~/.config/fontconfig/conf.d/50-my-malayalam.conf. If the folders for this file does not exist, just create them. To this file, add the following content.

<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<!-- Malayalam (ml) -->
<match target="font">
        <test name="lang" compare="contains">

<match target="font">
        <test name="lang" compare="contains">

<!-- Malayalam (ml) ends -->


Save the file and you are done. You can check if the default font for Malayalam changed or not using the following command

$ LANG=ml_IN fc-match

It should list Manjari. The above code we added to the file is not complicated. You can see that we are setting the sans-serif font preference for ml(Malayalam) language as Manjari. Also serif font preference as Rachana. You are free to change the fonts to whatever you prefer.

Note that you may want to close and open your applications to get this preference applied.

You may choose one of the fonts available at, download and install and use the above configuration with it.

by Santhosh Thottingal at August 10, 2018 04:09 PM

July 29, 2018

Santhosh Thottingal

യുവാക്കളുടെ തൊഴിലഭിമാനവും തൊഴിൽ സൊസൈറ്റികളും

നമ്മുടെ നാട്ടിലെ യുവാക്കൾ നേരിടുന്ന ഒരു പ്രതിസന്ധിയെപ്പറ്റിയും അതിന് പരിഹാരമായേക്കാവുന്ന  ഒരാശയത്തെപ്പറ്റിയും എഴുതിയ ഒരു കുറിപ്പാണിതു്.

നമ്മുടെ നാട്ടിൽ സവിശേഷ നൈപുണികൾ ആവശ്യമുള്ള പലതരത്തിലുള്ള കൂലിപ്പണികൾ,  ഡ്രൈവിങ്ങ്, കൃഷിപ്പണികൾ, പെയിന്റിങ്ങ്, കെട്ടിടനിർമാണം, മെക്കാനിക് തുടങ്ങിയ ജോലികളിൽ ഏർപ്പെടുന്ന യുവാക്കൾ ധാരാളമുണ്ട്. ഇവരെല്ലാം മിക്കപ്പൊഴും അസംഘടിത മേഖലയിലാണുതാനും. സർക്കാർ, സ്വകാര്യ ജോലി നേടാത്തതോ നേടാനാവശ്യമായ വിദ്യാഭ്യാസമില്ലാത്തവരോ ആയ യുവാക്കളായ പുരുഷന്മാരാണ് ഇവയിലധികവും. പക്ഷേ യുവതികൾ വിദ്യാഭ്യാസം പരമാവധി വിവാഹം വരെ തുടർന്ന് പിന്നീട് കുടുംബജീവിതത്തിൽ എത്തിച്ചേരുകയാണ്. ഇരുപതിനും മുപ്പത്തഞ്ചിനും ഇടക്ക് പ്രായമുള്ള ഇവർ പുതിയൊരു വെല്ലുവിളി നേരിടുന്നുണ്ട്. അതിനെപ്പറ്റി വിശദമായ ഒരു പഠനറിപ്പോർട്ട് ഈയിടെ സമകാലിക മലയാളം വാരിക പ്രസിദ്ധീകരിച്ചിരുന്നു(നിത്യഹരിത വരൻമാർ-രേഖാചന്ദ്ര, സമകാലിക മലയാളം ജൂലൈ 16). മലബാർ മേഖലയിൽ വ്യാപകമായി ഈ തരത്തിലുള്ള യുവാക്കൾ അവിവാഹിതരായിത്തുടരുന്നു എന്നതാണ് പഠനം.

ഇതിന്റെ കാരണം, സാംസ്കാരികമായി മേൽപ്പറഞ്ഞ ജോലിക്കാരോടുള്ള യുവതികളുടെ കുടുംബങ്ങളുടെ താത്പര്യക്കുറവാണ്. സർക്കാർ, സ്വകാര്യകമ്പനി ജോലിയില്ലാത്തവർക്ക് യുവതികളെ വിവാഹം കഴിച്ചുകൊടുക്കാൻ ആരും തയ്യാറാവുന്നില്ല. കുടക് കല്യാണം തുടങ്ങിയ പുതിയ പ്രതിഭാസങ്ങളുടെ വിവരങ്ങൾ ആ ലേഖനത്തിലുണ്ട്. ജാതി, ജാതകം തുടങ്ങിയവ പണ്ടത്തേക്കാളേറെ വഴിമുടക്കിയായി നിൽക്കുന്നുമുണ്ട്. പ്രണയവിവാഹങ്ങൾക്ക് ഗ്രാമപ്രദേശങ്ങളിൽ മിക്കവാറും സദാചാരപ്പോലീസുകാർ ഇടകൊടുക്കാറുമില്ല. യുവാക്കൾ ഇത്തരം പണികൾക്ക് പോയി സ്വന്തം വീട്ടിലെ യുവതികൾക്ക് കുറേകൂടി വിദ്യാഭ്യാസം കൊടുക്കാൻ ശ്രമിക്കാറുണ്ടെങ്കിലും ആ യുവതികൾ പിന്നീട് മെച്ചപ്പെട്ട ജോലിയുള്ള യുവാക്കളെ മാത്രം ശ്രമിക്കുന്നതുകൊണ്ട്, അവർ വീണ്ടും പ്രതിസന്ധിയിലാവുന്നു.<

കായികാദ്ധ്വാനത്തോടുള്ള വിമുഖത വളർന്നുവരാൻ മേൽപ്പറഞ്ഞ പ്രശ്നം കാരണമാകുന്നുണ്ട്. സോഷ്യൽ സ്റ്റാറ്റസ് എന്ന ഈഗോ പതിയെ മേൽപ്പറഞ്ഞ സുപ്രധാന ജോലികളിലേക്ക് ആളെകിട്ടാനില്ലാത്ത പ്രശ്നത്തിലേക്കും എത്തിക്കുന്നുണ്ട്. സമൂഹത്തിലെ  പൊതുവെയുള്ള വിദ്യാഭ്യാസനിലവാരം കൂടിവരുന്തോറും ഈ ഈഗോ വല്ലാതെ വർദ്ധിക്കുകയും ചെയ്യും. പതിയെപ്പതിയെ അനാരോഗ്യകരമായ ഒരു സാമൂഹികവ്യവസ്ഥ ഇതിൽനിന്നും ഉടലെടുക്കുമെന്ന് ഞാൻ ഭയക്കുന്നു. യുവതികൾ പ്രത്യേകിച്ചും കുടുംബങ്ങളിൽ നിന്നുള്ള സമ്മർദ്ദം കാരണം ജോലിസാധ്യതകളുടെ വളരെ ഇടുങ്ങിയ ഒരു സെലക്ഷൻ സ്പേസിലേക്ക് പോകുന്നുണ്ട്. അവർക്ക് മേൽപ്പറഞ്ഞ ജോലികളിലേക്ക് പോകാൻ നമ്മുടെ സാമൂഹികാവസ്ഥ സമ്മതിക്കാത്ത സ്ഥിതിയാണ് വരുന്നത്. ഇവിടെയാണ് അതിഥിത്തൊഴിലാളികൾ അവസരങ്ങൾ കണ്ടെത്തിയത്.

സാമൂഹികരംഗത്ത് മതേതര പൊതുവേദികൾ കുറഞ്ഞ നമ്മുടെ സമൂഹത്തിൽ ഈ യുവശക്തി രാഷ്ട്രീയപരമായി പ്രബുദ്ധരായിരിക്കുക എന്ന വെല്ലുവിളി കൂടുതലാവുന്നുമുണ്ട്. അരാഷ്ട്രീയത ഒരു ഡിഫോൾട്ട് ചോയ്സ് ആയി യുവാക്കൾക്കിടയിൽ വരാനുള്ള സാധ്യത എന്തുകൊണ്ടും പ്രതിരോധിച്ചേ മതിയാകൂ.

ഇതുവരെ ചുരുക്കിപ്പറഞ്ഞ പ്രശ്നങ്ങൾക്ക് മേൽപ്പറഞ്ഞ യുവാക്കൾക്കിടയിലേക്ക് ഒരു സാമൂഹികമുന്നേറ്റത്തിന്റെ ആവശ്യകതയുണ്ട്. ഉദ്ദേശങ്ങളിതാണ്:

  • കായികാദ്ധ്വാനമുള്ളതോ അല്ലാത്തതോ ആയ എല്ലാത്തരം അസംഘടിത ജോലികൾക്കും സാമൂഹികാംഗീകാരം വളർത്തിയെടുക്കുക. യുവാക്കളുടെ മാനവവിഭവശേഷി മിഥ്യാധാരണകളിലൂടെയും സാമൂഹികമായ കെട്ടുപാടുകളിലും തളയ്ക്കാതിരിക്കുക.
  • ഇത്തരം ജോലിക്കാരെ സംഘടിതമേഖലയിലേക്ക് കൊണ്ടുവന്ന് രാഷ്ട്രീയമായി പ്രബുദ്ധരാക്കുക. മതേതര ഇടങ്ങൾ സംഘടിപ്പിക്കുക.
  • തൊഴിൽ പരിശീലനങ്ങളും, ഉള്ള തൊഴിലുകളിൽ ആരോഗ്യകരമായ പരിഷ്കാരങ്ങൾക്ക് പ്രേരണയും പരിഷ്കാരങ്ങളും നൽകുക. തൊഴിലുകൾ ആകർഷണീയമാക്കുക.
  • കുടുംബശ്രീ കൊണ്ടുവന്ന സാമൂഹികചാലകശക്തി യുവാക്കളിലേക്ക് കൂടുതൽ വ്യാപിപ്പിക്കുക.

ഇതിലേക്ക് എനിക്ക് നിർദ്ദേശിക്കാനുള്ള ഒരു ആശയം “തൊഴിൽ സൊസൈറ്റികൾ” ആണ്. അതിനെപ്പറ്റിയുള്ള ഏകദേശധാരണ ഇങ്ങനെയാണ്.

  • തൊഴിലാളികളെ ആവശ്യമുള്ളവരും തൊഴിലാളികളും തമ്മിലുള്ള ഒരു മീറ്റിങ്ങ് പോയിങ്ങ് ആയി ഈ സൊസൈറ്റികൾ പ്രവർത്തിക്കുന്നു.
  • യുവാക്കൾ അവിടെ രജിസ്റ്റർ ചെയ്യുന്നു, അവരുടെ കഴിവുകളും.
  • ഇത്തരം സൊസൈറ്റികളിൽ രജിസ്റ്റർ ചെയ്തവർ യൂണിഫോമുള്ളവരും നെയിംടാഗും തൊഴിൽ സുരക്ഷാവസ്ത്രങ്ങൾ/ഉപകരണങ്ങളോടുകൂടിയവരാണ്(to overcome social stigma, this is
  • ആർക്കും ഈ സൈസൈറ്റികളിൽ ജോലിക്കാരെ തേടാം. നേരിട്ട് പോയി അന്വേഷിക്കണമെന്നില്ല. അല്പസ്വല്പം ടെക്നോളജിയുടെ സഹായത്തോടെ ഈ കണക്ഷനുകൾ പെട്ടെന്നുണ്ടാക്കാം. മൊത്തത്തിൽ അപ്പോയിന്റ്മെന്റ് സിസ്റ്റം ഒക്കെ വെച്ച് പഴയ ഫ്യൂഡൽ കാലഘട്ടത്തിലെ മുതലാളി-പണിക്കാർ റിലേഷനെ പൊളിച്ചെഴുതലാണ് ഉദ്ദേശം. അതുവഴി ഏത് ജോലിയുടെയും ഉയർച്ച താഴ്ചകളെ പൊളിക്കലും.
  • സൊസൈറ്റികൾക്ക് കൂലിനിരക്കുകൾ നിശ്ചയിക്കാം. തൊഴിൽ അവകാശങ്ങളെപ്പറ്റി ബോധമുള്ളവരായിരിക്കും.

ഈ ആശയം പാശ്ചാത്യനാടുകളിൽ മുതലാളിത്തവ്യവസ്ഥിതി നടപ്പിലാക്കിത്തുടങ്ങിയിട്ടുണ്ട്.Amazon Services ഉദാഹരണം.  Uber, Airbnb ഒക്കെപ്പോലെ അത്തരം “ഓൺലൈൻ ആപ്പുകൾ” ഉടൻ
നമ്മുടെ നാട്ടിലുമെത്തും. പക്ഷേ, തൊഴിൽദാതാവ്-തൊഴിലാളി ബന്ധത്തിൽനിന്നുള്ള ചൂഷണത്തിനപ്പുറം അവക്ക് ലക്ഷ്യങ്ങളുണ്ടാവില്ല. ആ സ്പേസിലേക്ക് സാമൂഹികരാഷ്ട്രീയ ലക്ഷ്യങ്ങളോടെ നേരത്തെത്തന്നെ കേരളജനത പ്രവേശിക്കണമെന്നാണാഗ്രഹം.

by Santhosh Thottingal at July 29, 2018 10:08 AM

July 15, 2018

Santhosh Thottingal

The many forms of ചിരി ☺️

This is an attempt to list down all forms of Malayalam word ചിരി(meaning: ☺, smile, laugh). For those who are unfamiliar with Malayalam, the language is a highly inflectional Dravidian language. I am actively working on a morphology analyser(mlmorph) for the language as outlined in one of my previous blogpost.

I prepared this list as a test case for mlmorph project to evaluate the grammar rule coverage. So I thought of listing it here as well with brief comments.
1. ചിരി
ചിരി is a noun. So it can have all nominal inflections.

2. ചിരിയുടെ
3. ചിരിക്ക്
4. ചിരിയ്ക്ക്
5. ചിരിയെ
6. ചിരിയിലേയ്ക്ക്
7. ചിരികൊണ്ട്
8. ചിരിയെക്കൊണ്ട്
9. ചിരിയിൽ
10. ചിരിയോട്
11. ചിരിയേ

There is a plural form
12. ചിരികൾ

A number of agglutinations can happen at the end of the word using Affirmatives, negations, interrogatives etc. For example, ചിരിയുണ്ട്, ചിരിയില്ല, ചിരിയോ. But now I am ignoring all agglutinations and listing only the inflections.

ചിരിക്കുക is the verb form of ചിരി.
13.  ചിരിക്കുക

It can have the following tense forms
14. ചിരിച്ചു
15. ചിരിക്കുക
16. ചിരിക്കും

A concessive form for the word
17. ചിരിച്ചാലും

This verb has the following aspects
18. ചിരിക്കാറ്
19. ചിരിച്ചിരുന്നു
20. ചിരിച്ചിരിയ്ക്കുന്നു
21. ചിരിച്ചിരിക്കുന്നു
22. ചിരിച്ചിരിക്കും
23. ചിരിച്ചിട്ട്
24. ചിരിച്ചുകൊണ്ടിരുന്നു
25. ചിരിച്ചുകൊണ്ടേയിയിരുന്നു
26. ചിരിച്ചുകൊണ്ടേയിരിക്കുന്നു
27. ചിരിച്ചുകൊണ്ടിരിക്കുന്നു
28. ചിരിച്ചുകൊണ്ടിരിക്കും
29. ചിരിച്ചുകൊണ്ടേയിരിക്കും

There are number of mood forms for the verb ചിരിക്കുക
30. ചിരിക്കാവുന്നതേ
31. ചിരിച്ചേ
32. ചിരിക്കാതെ
33. ചിരിച്ചാൽ
34. ചിരിക്കണം
35. ചിരിക്കവേണം
36. ചിരിക്കേണം
37. ചിരിക്കേണ്ടതാണ്
38. ചിരിക്ക്
39. ചിരിക്കുവിൻ
40. ചിരിക്കൂ
41. ചിരിക്ക
42. ചിരിച്ചെനെ
43. ചിരിക്കുമേ
44. ചിരിക്കട്ടെ
45. ചിരിക്കട്ടേ
46. ചിരിക്കാം
47. ചിരിച്ചോ
48. ചിരിച്ചോളൂ
49. ചിരിച്ചാട്ടെ
50. ചിരിക്കാവുന്നതാണ്
51. ചിരിക്കണേ
52. ചിരിക്കേണമേ
53. ചിരിച്ചേക്കാം
54. ചിരിച്ചോളാം
55. ചിരിക്കാൻ
56. ചിരിച്ചല്ലോ
57. ചിരിച്ചുവല്ലോ

There are a few inflections with adverbial participles
58. ചിരിക്കാൻ
59. ചിരിച്ച്
60. ചിരിക്ക
61. ചിരിക്കിൽ
62. ചിരിക്കുകിൽ
63. ചിരിക്കയാൽ
64. ചിരിക്കുകയാൽ

The verb can act as an adverb clause. Examples
65. ചിരിച്ച
66. ചിരിക്കുന്ന
67. ചിരിച്ചത്
68. ചിരിച്ചതു്
69. ചിരിക്കുന്നത്

The above two forms act as nominal forms. Hence they have all nominal inflections too
70. ചിരിച്ചതിൽ
71. ചിരിക്കുന്നതിൽ
72. ചിരിക്കുന്നതിന്
73. ചിരിച്ചതിന്
74. ചിരിച്ചതിന്റെ
75. ചിരിക്കുന്നതിന്റെ
76. ചിരിച്ചതുകൊണ്ട്
77. ചിരിക്കുന്നതുകൊണ്ട്
78. ചിരിച്ചതിനോട്
79. ചിരിക്കുന്നതിനോട്
80. ചിരിക്കുന്നതിലേയ്ക്ക്

Now, a few voice forms for the verb ചിരിക്കുക
81. ചിരിക്കപ്പെടുക
82. ചിരിപ്പിക്കുക

These voice forms are again just verbs. So it can go through all the above inflections the verb ചിരിക്കുക has. Not writing it here, since it mostly a repeat of what is listed here. ചിരിക്കപ്പെടുക has all inflections of the verb പെടുക. You can see them listed in my test case file though

A noun can be derived from the verb ചിരിക്കുക too. That is
83. ചിരിക്കൽ

Since it is a noun, all nominal inflections apply.
84. ചിരിക്കലേ
85. ചിരിക്കലിനോട്
86. ചിരിക്കലിൽ
87. ചിരിക്കലിന്റെ
88. ചിരിക്കലിനെക്കൊണ്ട്
89. ചിരിക്കലിലേയ്ക്ക്
90. ചിരിക്കലിന്

My test file has 164 entries including the ones I skipped here. As per today, the morphology analyser can parse 74% of the items. You can check the test results here:

A native Malayalam speaker may point out that the variation fo this word ചിരിയ്ക്കുക -with യ് before ക്കുക. My intention is to support that variation as well. Obviously that word also will have the above listed inflected forms.

Now that I wrote this list here, I think having a rough English translation of each items would be cool, but it is too tedious to me.

by Santhosh Thottingal at July 15, 2018 12:11 PM

July 03, 2018

Santhosh Thottingal

How to type Malayalam using Keyman 10 and Mozhi

This is a quick tutorial on installing Mozhi input method in Windows 10.

Mozhi is a transliteration based keyboard  for Malayalam. You can type malayaalam to get മലയാളം for example. We will use Keyman tool as the input tool. Keyman input tool is an opensource input mechanism now developed by SIL. It supports lot of languages and Mozhi malayalam is one of that.

Step 1: Download Keyman desktop with Mozhi Malayalam keyboard

Go to There you will see the following options to download. Select the first one as shown below. Download the installer to your computer. It is a file about 20MB.

Keyman 10 Desktop download page.

Step 2: Installation

Double click the downloaded file to start installation. The installer will be like this:

Keyman 10 Desktop installer

Click on the Install Keyman Desktop button. You will see the below screen.

Keyman 10 Desktop welcome page.


Press the “Start keyman” button. The installation will start and keyboard will start.

Step 3: Choose Mozhi input method

You will see a small icon at the bottom of your screen, near time is displayed.

Click on that to choose Mozhi.

Keyboard selection

Once you chose Mozhi, you can type in Manglish anywhere and you will see malayalam. To learn typing click on the “Keyboard Usage” as shown above.

Step 4: Start typing in Malayalam

You can directly type Malayalam in any application without copy paste. Just like English, start typing. Make sure to use a good Malayalam font. You can get them from

Using Mozhi in LibreOffice. Notice the font used is Manjari.What I typed is “ippOL enikk malayaalam ezhuthaanaRiyaam”


by Santhosh Thottingal at July 03, 2018 02:41 PM

July 01, 2018

Santhosh Thottingal

Kindle supports custom fonts

I am pleasantly surprised to see that Amazon Kindle now supports installing custom fonts. A big step towards supporting non-latin content in their devices. I can now read Malayalam ebooks in my kindle with my favorite fonts.

Content rendered in Manjari font. Note that I installed Bold, Regular, Thin variants so that Kindle can pick up the right one

This feature is introduced in Kindle version released in June 2018. Once updated to that version, all you need is to connect the device using the USB cable to your computer. Copy your fonts to the fonts folder there. Remove the usb cable. You will see the fonts listed in font selector.

Kindle had added Malayalam rendering support back in 2016, but the default font provided was one of the worst Malayalam fonts. It had wrong glyphs for certain conjuncts and font had minimal glyphs.

I tried some of the SMC Malayalam fonts in the new version of Kindle. Screenshots given below

Custom fonts selection screen. These fonts were copied to the device

Select a font other than the default one

Content in Rachana.

Make sure to check the version. is the latest version and it supports custom fonts

by Santhosh Thottingal at July 01, 2018 04:15 AM

May 03, 2018

Rajeesh K Nambiar

Adventures in upgrading to Fedora 27/28 using ‘dnf system-upgrade’

[This post was drafted on the day Fedora 27 released, about half a year ago, but was not published. The issue bit me again with Fedora 28, so documenting it for referring next time.]

UPDATE: The issue occurred in Fedora 28 because I had exclude=grub2-tools in /etc/dnf/dnf.conf which is the reason error “nothing provides grub2-tools” was coming up. Removing that previously added and then forgotten line fixes the issue with updating grub2 packages.

With fedup and subsequently dnf improving the upgrade experience of Fedora for power users, last few system upgrades have been smooth, quiet, even unnoticeable. That actually speaks volumes of the maturity and user friendliness achieved by these tools.

Upgrading from Fedora 25 to 26 was so event-less and smooth (btw: I have installed and used every version of Fedora from its inception and the default wallpaper of Fedora 26 was the most elegant of them all!).

With that, on the release day I set out to upgrade the main workstation from Fedora 26 to 27 using dnf system-upgrade as documented. Before downloading the packages, dnf warned that upgrade cannot be done because of package dependency issues with grub2-efi-modules and grub2-tools.

Things go wrong!

I simply removed both the offending packages and their dependencies (assuming were probably installed for the grub2-breeze-theme dependency, but grub2-tools actually provides grub2-mkconfig) and proceeded with dnf upgrade --refresh and dnf system-upgrade download --refresh --releasever=27. If you are attempting this, don’t remove the grub2 packages yet, but read on!

Once the download and check is completed, running dnf system-upgrade reboot will cause the system reboot to upgrade target and actual upgrade happen.

Except, I was greeted with EFI MOK (Machine Owner Key) screen on reboot. Now that the grub2 bootloader is broken thanks to the removal of grub2-efi-modules and other related packages, a recovery must be attempted.


It is important to have a (possibly EUFI enabled) live media where you can boot from. Boot into the live media and try to reinstall grub. Once booted in, mount the root filesystem under /mnt/sysimage, and EFI boot partition at /mnt/sysimage/boot/efi. Then chroot /mnt/sysimage and try to reinstall grub2-efi-x64 and shim packages. If there’s no network connectivity, don’t despair, nmcli is to your rescue. Connect to wifi using nmcli device wifi connect <ssid> password <wifi_password>. Generate the boot configuration using grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg followed by actual install grub2-install --target=x86_64-efi /dev/sdX (the –target option ensures correct host installation even if the live media is booted via legacy BIOS). You may now reboot and proceed with the upgrade.

But this again failed at the upgrade stage because of grub package clash that dnf warned earlier about.


Once booted into old installation, take a backup of the /boot/ directory, remove the conflicting grub related packages, and copy over the backed up /boot/ directory contents, especially /boot/efi/EFI/fedora/grubx64.efi. Now rebooting (using dnf system-upgrade reboot) had  the grub contents intact and the upgrade worked smoothly.

For more details on the package conflict issue, follow this bug.

by Rajeesh at May 03, 2018 07:16 AM

March 25, 2018

Balasankar C

FOSSAsia 2018 - Singapore


So I attended my first international FOSS conference - FOSSAsia 2018 at Lifelong learning institute, Singapore. I presented a talk titled “Omnibus - Serve your dish on all the tables” (slides, video) about the tool Chef Omnibus which I use on a daily basis for my job at GitLab.

The conference was a 4-day long one and my main aim was to network with as many people as I can. Well, I planned to attend sessions, but unlike earlier times when I attended all the sessions, these days I am more focussed on certain topics and technologies and tend to attend sessions on those (for example, devops is an area I focuses on, block chain isn’t).

One additional task I had was attend the Debian booth at the exhibition from time to time. It was mainly handled by Abhijith (who is a DM). I also met two other Debian Developers there - Andrew Lee(alee) and Héctor Orón Martínez(zumbi).

I also met some other wonderful people at FOSSAsia, like Chris Aniszczyk of CNCF, Dr Graham Williams of Microsoft, Frank Karlitschek of NextCloud, Jean-Baptiste Kempf and Remi Denis-Courmont of VideoLan, Stephanie Taylor of Google, Philip Paeps(trouble) of FreeBSD, Harish Pillai of RedHat, Anthony, Christopher Travers, Vasudha Mathur of KDE, Adarsh S of CloudCV (and who is from MEC College, which is quite familiar to me), Tarun Kumar of Melix, Roy Peter of Go-Jek (with whom I am familiar, thanks to the Ruby conferences I attended), Dias Lonappan of Serv and many more. I also met with some whom I know knew only digitally, like Sana Khan who was (yet another, :D) a Debian contributor from COEP. I also met with some friends like Hari, Cherry, Harish and Jackson.

My talk went ok without too much of stuttering and I am kinda satisfied by it. The only thing I forgot is to mention during the talk that I had stickers (well, I later placed them in the sticker table and it disappeared within minutes. So that was ok. ;))

PS: Well, I had to cut down quite a lot of my explanation and drop my demo due to limited time. This caused me miss many important topics like omnibus-ctl or cookbooks that we use at GitLab. But, I had a few participants come up and meet me after the talk, with doubts regarding omnibus and its similarity with flatpak, relevance during the times of Docker etc, which was good.

Some photos are here:

Abhijith in Debian Booth

Abhijith in Debian Booth

Abhijith with VLC folks

Abhijith with VLC folks

Andrew's talk

Andrew's talk

With Anthony and Harish: Two born-and-brought-up-in-SG-Malayalees

With Anthony and Harish: Two born-and-brought-up-in-SG-Malayalees

Chris Aniszczyk

With Chris Aniszczyk

Debian Booth

At Debian Booth

Frank Karlitschek

With Frank Karlitschek

Graham Williams

With Graham Williams

MOS Burgers - Our breakfast place

MOS Burgers - Our breakfast place

Premas Cuisine - The kerala taste

Premas Cuisine - The kerala taste

The joy of seeing Malayalam

The joy of seeing Malayalam

With Sana

With Sana

Well, Tamil, ftw

Well, Tamil, ftw

Zumbi's talk

Zumbi's talk

March 25, 2018 05:00 AM

February 09, 2018

Rajeesh K Nambiar

Sundar — a new traditional orthography ornamental font for Malayalam

There is a dearth of good Unicode fonts for Malayalam script. Most publishing houses and desktop publishing agencies still rely on outdated ASCII era fonts. This not only causes issues with typesetting using present technologies, it makes the ‘document’ or ‘data’ created using these fonts and tools absolutely useless — because the ‘document/data’ is still Latin, not Malayalam.

Rachana Institute of Typography ( has designed and published a new traditional orthography ornamental Unicode font for Malayalam script, for use in headings, captions and titles. It is named after Sundar, who was a relentless advocate of open fonts, open standards and open publishing. He dreamed of making available several good quality Malayalam fonts, particularly created by Narayana Bhattathiri with his unique calligraphic and typographic signature, freely and openly to the users. The font is licensed under OFL.

The font follows traditional orthography for Malayalam, rather than the unpleasing reformed orthography which was solely introduced due to the technical limitations of typewriters in the ’70s. Such restrictions do not apply to computers and present technology, so it is possible to render the classic beauty of Malayalam script using Unicode and Opentype technologies.

‘Sundar’ is designed by K.H. Hussain — known for his work on Rachana and Meera fonts which comes pre-installed with most Linux distributions; and Narayana Bhattathiri — known for his beautiful calligraphy and lettering in Malayalam script. Graphic engineers of STM Docs ( did the vectoring and glyph creation. Yours truly took care of the Opentype feature programming. The font can be freely downloaded from

The source code of ‘Sundar’, licensed under OFL is available at

by Rajeesh at February 09, 2018 07:53 AM

January 17, 2018

Balasankar C

Introduction to Git workshop at CUSAT


It has been long since I have written somewhere. In the last year I attended some events, like FOSSMeet, DeccanRubyConf, GitLab’s summit and didn’t write anything about it. The truth is, I forgot I used to write about all these and never got the motivation to do that.

Anyway, last week, I conducted a workshop on Git basics for the students of CUSAT. My real plan, as always, was to do a bit of FOSS evangelism too. Since the timespan of workshop was limited (10:00 to 13:00), I decided to keep everything to bare basics.

Started with an introduction to what a VCS is and how it became necessary. As a prerequisite, I talked about FOSS, concept of collaborative development, open source development model etc. It wasn’t easy as my audience were not only CS/IT students, but those from other departments like Photonics, Physics etc. I am not sure if I was able to help them understand the premise clearly. However, then I went on to talk about what Git does and how it helps developers across the world.

IIRC, this was the first talk/workshop I did without a slide show. I was damn lazy and busy to create one. I just had one page saying “Git Workshop” and my contact details. So guess what? I used a whiteboard! I went over the basic concepts like repositories, commits, staging area etc and started with the hand-on session. In short, I talked about the following

  1. Initializing a repository
  2. Adding files to it
  3. Add files to staging areas
  4. Committing
  5. Viewing commit logs
  6. Viewing what a specific commit did
  7. Viewing a file’s contents at a specific commit
  8. Creating a GitLab account (Well, use all opportunity to talk about your employer. :P)
  9. Creating a project in GitLab
  10. Adding it as a remote repository to your local one
  11. Pushing your changes to remote repository

I wanted to talk about clone, fork, branch and MRs, but time didn’t permit. We wound up the session with Athul and Kiran talking about how they need the students to join the FOSSClub of CUSAT, help organizing similar workshops and how it can help them as well. I too did a bit of “motivational talk” regarding how community activities can help them get a job, based on my personal experience.

Here are a few photos, courtesy of Athul and Kiran:

January 17, 2018 06:00 AM

September 07, 2016

Balasankar C

SMC/IndicProject Activities- ToDo List


So, M.Tech is coming to an end I should probably start searching for a job soon. Still, it seems I will be having a bit of free time from Mid-September. I have got some plans about the areas I should contribute to SMC/Indic Project. As of now, the bucket list is as follows:

  1. Properly tag versions of fonts in SMC GitLab repo - I had taken over the package fonts-smc from Vasudev, but haven’t done any update on that yet. The main reason was fontforge being old in Debian. Also, I was waiting for some kind of official release of new versions by SMC. Since the new versions are already available in the SMC Fonts page, I assume I can go ahead with my plans. So, as a first step I have to tag the versions of fonts in the corresponding GitLab repo. Need to discuss whether to include TTF file in the repo or not.
  2. Restructure LibIndic modules - Those who were following my GSoC posts will know that I made some structural changes to the modules I contributed in LibIndic. (Those who don’t can check this mail I sent to the list). I plan to do this for all the modules in the framework, and to co-ordinate with Jerin to get REST APIs up.
  3. GNOME Localization - GNOME Localization has been dead for almost two years now. Ashik has shown interest in re-initiating it and I plan to do that. I first have to get my committer access back.
  4. Documentation - Improve documentation about SMC and IndicProject projects. This will be a troublesome and time consuming task but I still like our tools to have proper documentation.
  5. High Priority Projects - Create a static page about the high priority projects so that people can know where and how to contribute.
  6. Die Wiki, Die - Initiate porting Wiki to a static site using Git and Jekyll (or any similar tool). Tech people should be able to use git properly.

Knowing me pretty much better than anyone else, I understand there is every chance of this being “Never-being-implemented-plan” (അതായത് ആരംഭശൂരത്വം :D) but still I intend to do this in an easy-first order.

September 07, 2016 04:47 AM

August 29, 2016


GSoC — Final Report!

So finally it’s over. Today is the last date for submission of the GSoC project. This entire ride was a lot informative as well as an experience filled one. I thank Indic Project organisation for accepting my GSoC project and my mentors Navaneeth K N and Jishnu Mohan for helping me out fully throughout this project.

The project kicked off keeping in mind of incorporating the native libvarnam shared library with the help of writing JNI wrappers. But unfortunately the method came to a stall when we were unable to import the libraries correctly due to lack of sufficient official documentations. So my mentor suggested me an alternative approach by making use of the Varnam REST API. This has been successfully incorporated for 13 languages with the necessity of the app requiring internet connection. Along with it, the suggestions which come up are also the ones returned by Varnam in the priority order. I would be contributing further to Indic Project to make the library method work in action. Apart from that see below the useful links,

  • this and this is related to adding a new keyboard with “qwerty” layout.
  • this is adding a new SubType value and a method to identify TransliterationEngine enabled keyboards.
  • this is adding the Varnam class and setting the TransliterationEngine.
  • this and this deals with applying the transliteration by Varnam and returning it back to the keyboard.
  • this is the patch to resolve the issue, program crashes on switching keyboards.
  • this makes sure that after each key press, the displayed word is refreshed and the transliteration of the entire word is shown.
  • this makes sure that on pressing deletion, the new word in displayed.
  • this creates a template such that more keyboards can be added easily.
  • this makes sure that the suggestions appearing are directly from the Varnam engine and not from the inbuilt library.
  • The lists of the commits can be seen here which includes the addition of layouts for different keyboards and nit fixes.

Add Varnam support into Indic Keyboard

The project as a whole is almost complete. The only thing left to do is to incorporate the libvarnam library into the apk and then we can call that instead of the Varnam class given here. The ongoing work for that can be seen below,


varnamc -s ml -t "Adutha ThavaNa kaaNaam" //See you next time

by Vishnu H Nair at August 29, 2016 08:18 AM

August 23, 2016

Anwar N

GSoC 2016 IBus-Braille-Enhancement Project - Summary

   First of all my thanks to Indic Project and Swathanthra Malayalam Computing(SMC) for accepting this project. All hats off to my mentors Nalin Sathyan and Samuel Thibault. The project was awesome and I believe that I have done my maximum without any prior experience

Project Blog :

Now let me outline what we have done during this period.

Braille-Input-Tool (The on-line version)
  Just like Google transliteration or Google Input Tools online. This is required because it's completely operating system independent and it's a modern method which never force user to install additional plugin or specific browser. The user might use this form temporary places like internet cafe. This is written using JQuery and Html. And works well in GNU/Linux, Microsoft windows, Android etc

See All Commits :
Test with following link :

IBus-Braille enhancements
See All Commits :

1 IBus-Braille integrated with Liblouis : The Liblouis software suite provides an open-source braille translator, back-translator and formatter for a large number of languages and braille codes. So maintaining and shipping separate braille maps(located at /share/ibus-sharada-braille/braille) with ibus-braille is a bad idea. Through this we completely adopted Ibus-Braille to use Liblouis. The conversion is done in an entire word manner instead of each letter. ie the conversion does after writing direct braille unicode and pressing space.
Commit 1 :
Commit 2 :
Commit 3 :

See Picture of Ibus-Braille preferences given below

2 8-Dot braille Enabled : Yes languages having more than 64 characters which can't be handled with 64 (6 dots ) combination are there, Music notations like  “Abreu” and LAMBDA (Linear Access to Mathematics for Braille Device and Audio Synthesis) uses 8-dot braille system.  unicode support 8-dot braille.
Commit 1 :

See key/shortcut page picture of ISB preferences dot setting

3 Dot 4 issue Solved :  In IBus-Braille when we type in bharati braille such as Malayalam, Hindi, etc. we have to use 13-4-13 to get letter ക്ക(Kka). But according to braille standard in order to get EKKA one should press 4-13-13. And this make beginners to do extra learning to start typing. Through this project we solved this issues and a conventional-braille-mode switch is provided in preferences in order to switch between.

Commit :

4 Add Facility to write direct Braille Unicode : Now one can use IBus-Braille to type braille dot notation directly with the combination.  The output may be sent to a braille embosser. Here braille embosser is an impact printer that renders text in braille characters as tactile braille cells.

Commit :

5 Three to Six for disabled people with one hand : A three key implementation which uses delay factor between key presses for example 13 followed by
13 having delay less than delay factor (eg:0.2) will give X. If more, then output would be KK. If one want to type a letter having combination only 4,5,6 he have to press "t" key prior. The key and the Conversion-Delay can be adjusted from preferences.

Commit :

6 Arabic language added
Commit :

7 Many bugs solved
Commit :
others are implied

Project Discourse :
IBus-Sharada-Braille (GSoC 2014) :

Plugins for firefox and chrome
    This plugin can be installed will work with every text entry on the web pages no need for copy paste. extensions are written in Javascript.
See All Commits :

Modification yet desirable are as following

1 Announce extra information through Screen Reader:  When user expand abbreviation or a contraction having more than 2 letters is substituted the screen reader is not announcing it. We have to write a orca(screen reader) plugin for Ibus-Braille

2 A UI for Creating and Editing Liblouis Tables

3 Add support for more Indic Languages and Mathematica Operators via liblouis

Braille-input-tool (online version)
                       Liblouis integration
Conventional Braille, Three Dot mode and Table Type selection 
Chrome Extension

Direct braille unicode typing
 Eight dot braille enabled

by Anonymous ( at August 23, 2016 04:39 AM

August 22, 2016

Sreenadh T C

It’s a wrap!

“To be successful, the first thing to do is to fall in love with your work — Sister Mary Lauretta”

Well, the Google Summer of Code 2016 is reaching its final week as I get ready to submit my work. It has been one of those best three-four months of serious effort and commitment. To be frank, this has to be one of those to which I was fully motivated and have put my 100%.

Well, at first, the results of training wasn’t that promising and I was actually let down. But then, me and my mentor had a series of discussions on submitting, during which she suggested me to retrain the model excluding the data set or audio files of those speakers which produced the most errors. So after completing the batch test, I noticed that four of the data set was having the worst accuracy, which was shockingly below 20%. This was causing the overall accuracy to dip from a normal one.

So, I decided to delete those four data set and retrain the model. It was not that of a big deal, so I thought its not gonna be drastic change from the current model. But the result put me into a state of shock for about 2–3 seconds. It said

TOTAL Words: 12708 Correct: 12375 Errors: 520
TOTAL Percent correct = 97.38% Error = 4.09% Accuracy = 95.91%
TOTAL Insertions: 187 Deletions: 36 Substitutions: 297
SENTENCE ERROR: 9.1% (365/3993) WORD ERROR RATE: 4.1% (519/12708)

Now, this looks juicy and near to perfect. But the thing is, the sentences are tested as they where trained. So, if we change the structure of sentence that we ultimately give to recognize, it will still be having issues putting out the correct hypothesis. Nevertheless, it was far more better than it was when I was using the previous model.

So I guess I will settle with this for now as the aim of the GSoC project was to start the project and show proof of that this can be done, but will keep training better ones in the near future.

Google Summer of Code 2016 — Submission

  1. Since the whole project was carried under my personal Github repository, I will link the commits in it here : Commits
  2. Project Repository : ml-am-lm-cmusphinx
  3. On top of that, we (me and the organization) had a series of discussions regarding the project over here: Discourse IndicProject

Well, I have been documenting my way through the project over here at Medium starting from the month of May. The blogs can be read from here.

What can be done in near future?

Well, this model is still in its early stage and is still not the one that can be used error free, let alone be applied on applications.

The data set is still buggy and have to improved with better cleaner audio data and a more tuned Language Model.

Speech Recognition development is rather slow and is obviously community based. All these are possible with collaborated work towards achieving a user acceptable level of practical accuracy rather than quoting a statistical, theoretical accuracy.

All necessary steps and procedure have been documented in the README sections of the repository.

puts "thank you everyone!"

by Sreenadh T C at August 22, 2016 07:01 AM

August 21, 2016

Arushi Dogra

GSoC Final Report

Its almost the end of the GSoC internship. From zero knowledge of Android to writing a proposal, proposal getting selected and finally 3 months working on the project was a great experience for me! I have learned a lot and I am really thankful to Jishnu Mohan for mentoring throughout .

Contributions include :-

All the tasks mentioned in the proposal were discussed and worked upon.

I started with making the designs of the layouts. The task was to make Santali Olchiki and Soni layouts for the keyboard. I looked at the code of the other layouts to get a basic understanding of how phonetic and inscript layouts work. Snapshot of one of the view of Santali keyboard :

Screen Shot 2016-08-21 at 6.53.03 PM

Language Support Feature 
While configuring languages, the user is prompted about the locales that might not be supported by the phone.

Screen Shot 2016-08-21 at 6.33.25 PM

Adding Theme Feature
Feature is added at the setup to enable user to select the keyboard theme

Screen Shot 2016-08-21 at 6.49.21 PM

Merging AOSP code
After looking at everything mentioned in the proposal, Jishnu  gave me the job of  merging AOSP source code to the keyboard as the current keyboard doesn’t have changes that were released along with  android M code drop because of which target sdk is not 23 . There are a few errors yet to be resolved and I am working on that 😀

Overall, it was a wonderful journey and I will always want to be a contributor to the organisation as it introduced me to the world of open source and opened a whole new area to work upon and learn more.
Link to the discourse topic :

Thank You!  😀

by arushidogra at August 21, 2016 01:29 PM

August 17, 2016

Balasankar C

GSoC Final Report


It is finally the time to wind up the GSoC work on which I have been buried for the past three months. First of all, let me thank Santhosh, Hrishi and Vasudev for their help and support. I seem to have implemented, or at least proved the concepts that I mentioned in my initial proposal. A spell checker that can handle inflections in root word and generate suggestion in the same inflected form and differentiate between spelling mistakes and intended modifications has been implemented. The major contributions that I made were to

  1. Improve LibIndic’s Stemmer module. - My contributions
  2. Improve LibIndic’s Spell checker module - My contributions
  3. Implement relatively better project structure for the modules I used - My contributions on indicngram

1. Lemmatizer/Stemmer


My initial work was on improving the existing stemmer that was available as part of LibIndic. The existing implementation was a rule based one that was capable of handling single levels of inflections. The main problems of this stemmer were

  1. General incompleteness of rules - Plurals (പശുക്കൾ), Numerals(പതിനാലാം), Verbs (കാണാം) are missing.
  2. Unable to handle multiple levels of inflections - (പശുക്കളോട്)
  3. Unnecessarily stemming root words that look like inflected words - (ആപത്ത് -> ആപം following the rule of എറണാകുളത്ത് -> എറണാകുളം)

The above mentioned issues were fixed. The remaining category is verbs which need more detailed analysis.

I too decided to maintain the rule-based approach for lemmatizer (actually, what we are designing is half way between a stemmer and lemmatizer. Since it is more inclined towards a lemmatizer, I am going to call it that.) mainly because for implementing any ML or AI techniques, there should be sufficient training data, without which the efficiency will be very poor. It felt better to gain higher efficiency with available rules than to try out ML techniques with no guarantee (Known devil is better logic).

The basic logic behind the multi-level inflection handling lemmatizer is iterative suffix stripping. At each iteration, a suffix is identified from the word and it is transformed to something else based on predefined rules. When no more suffixes are found that have a match on the rule set, we assume the multiple levels of inflection have been handled.

To handle root words that look like inflected words (hereafter called ‘exceptional words’) from being stemmed unnecessarily, it is obvious we have to use a root word corpus. I used the Datuk dataset that is made available openly by Kailash as the root word corpus. A corpus comparison was performed before the iterative suffix stripping started, so as to handle root words without any inflection. Thus, the word ആപത്ത് will get handled even before the iteration begins. However, what if the input word is an inflected form of an exceptional word, like ആപത്തിലേക്ക്? This makes it necessary to introduce the corpus comparison step after each iteration.

Lemmatizer Flowchart

At each iteration, suffix stripping happens from left to right. Initial suffix has 2nd character as the starting point and last character as end point. At each inner iteration, the starting point moves rightwards, thus making the suffix shorter and shorter. Whenever a suffix is obtained that has a transformation rule defined in the rule set, it is replaced with the corresponding transformation. This continues until the suffix becomes null.

Multi-level inflection is handled on the logic that each match in rule set induces a hope that there is one more inflection present. So, before each iteration, a flag is set to False. Whenever a match in ruleset occurs at that iteration, it is set to true. If at the end of an iteration, the flag is true, the loop repeats. Else, we assume all inflections have been handled.

Since this lemmatizer is also used along with a spellchecker, we will need a history of the inflections identified so that the lemmatization process can be reversed. For this purpose, I tagged the rules unambiguously. Each time an inflection is identified, that is the extracted suffix finds a match in the rule set, in addition to the transformation, the associated tag is also pushed to a list. As the result, the stem along with this list of tags is given to the user. This list of tags can be used to reverse the lemmatization process - for which I wrote an inflector function.

A demo screencast of the lemmatizer is given below.

So, comparing with the existing stemmer algorithm in LibIndic, the one I implemented as part of GSoC shows considerable improvement.

Future work

  1. Add more rules to increase grammatical coverage.
  2. Add more grammatical details - Handling Samvruthokaram etc.
  3. Use this to generate sufficient training data that can be used for a self-learning system implementing ML or AI techniques.

2. Spell Checker


The second phase of my GSoC work involved making the existing spell checker module better. The problems I could identify in the existing spell checker were

  1. It could not handle inflections in an intelligent way.
  2. It used a corpus that needed inflections in them for optimal working.
  3. It used only levenshtein distance for finding out suggestions.

As part of GSoC, I incorporated the lemmatizer developed in phase one to the spell checker, which could handle the inflection part. Three metrics were used to detect suggestion words - Soundex similarity, Levenshtein Distance and Jaccard Index. The inflector module that was developed along with lemmatizer was used to generate suggestions in the same inflected form as that of original word.

There were some general assumptions and facts which I inferred and collected while working on the spell checker. They are

  1. Malayalam is a phonetic language, where the word is written just like it is pronounced. This is opposite to the case of English, where alphabets have different pronunciations in different words. Example is the English letter “a” which is pronounced differently in “apple” and “ate”.
  2. Spelling mistakes in Malayalam, hence, are also phonetic. The mistakes occur by a character with similar pronunciation, usually from the same varga. For example, അദ്ധ്യാപകൻ may be written mistakenly as അദ്യാപകൻ, but not as അച്യാപകൻ.
  3. A spelling mistake does not mean a word that is not present in the dictionary. The user has to be considered intelligent and he should be trusted not to make mistakes. A word not present in dictionary can be an intentional modification also. A "mistake" is something which is not in the dictionary AND which is very similar to a valid word. If a word is not found in dictionary and no similar words are found, it has to be considered an intentional change the user induced and hence should be deemed correct. This often solves the issues of foreign words deemed as incorrect.
  4. Spelling mistakes in inflected words usually happen at the lemma of the word, not the suffix. This is also because most commonly used suffix parts are pronounced differently and mistakes have a smaller chance to be present there.

Spell checker architecture

The first phase, obviously is a corpus comparison to check if the input word is actually a valid word or not. If it is not, suggestions are generated. For this, a range of words have to be selected. From the logic of Malayalam having phonetic spelling mistakes, the words starting with the characters that are linguistic successor and predecessor of the first character of the word is selected. That is, for the input words ബാരതം, which have ബ as first character the words selected will be the ones starting by ഫ and ഭ. Out of these words, the top N (which is defaulted to 5) words have to be found out that are most similar to the input word.

Three metrics were used for finding out similarity between two words. For Malayalam, a phonetic language, soundex similarity was assigned the top priority. To handle the words that were similar but not phonetically similar because of a difference on a single character that defines phonetic similarity, levenshtein distance was also used. This finds out distance between two words, or the number of operations needed for one word to be transformed to other. To handle the other words, Jaccard index was also used. The priority was assigned as soundex > levenshtein > jaccard. Weights were assigned to each possible suggestion based on the values of these three metrics based on the following logic:

If soundex == 1, similarity = 100
Elseif levenshtein <= 2, weight = 75 + (1.5 * jaccards)
Elseif levenshtein < 5, weight = 65 + (1.5 * jaccards)
Else, weight = 0

To differentiate between spelling “mistakes” and intended modifications, the logic used that if a word did not have N suggestions that have weight > 50, it is most probably an intended word and not a spelling mistake. So, such words were deemed correct.

A demo screencast of the spell checker is given below.

3. Package structure

The existing modules of libindic had an inconsistent package structure that gave no visibility to the project. Also, the package names were too general and didn’t convey the fact that they were used for Indic languages. So, I suggested and implemented the following suggestions

  1. Package names (of the ones I used) were changed to libindic-. Examples would be libindic-stemmer, libindic-ngram and libindic-spellchecker. So, the users will easily understand this package is part of libindic framework, and thus for indic text.
  2. Namespace packages (PEP 421) were used, so that import statments of libindic modules will be of the form from libindic.<module> import <language>. So, the visibility of the project ‘libindic’ is increased pretty much.

August 17, 2016 04:47 AM

August 16, 2016

Anwar N

IBus-Braille Enhancement - 3

 A hard week passed!

1 Conventional Braille Mode enabled : Through this we solved dot-4 issue and now one can type using braille without any extra knowledge

commit 1 :

2 handle configure parser exceptions : corrupted isb configuration file can make it won't start. so I solved this by proper exception handling

commit 2 :

3 Liblouis integration : I think our dream is about to come true!  But still also we are struggling with vowel substitution on the middle.
commit 3 :
commit 4 :
commit 5 :

by Anonymous ( at August 16, 2016 08:35 PM