Planet SMC

June 08, 2019

Santhosh Thottingal

Markov chain for Malayalam

I have been trying to generate a Markov chain for Malayalam content. A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.(wikipedia). For natural language, it represents a probabilistic model of words- the probability that one word can come after another word. This model can be prepared by feeding large amount of text to system that learns the probabilities of each words.

For Malayalam, I used the SMC Malayalam corpora. I used the markovchain python library as the tool to build the model. I had to do some bug fixes and customization to get it working for Malayalam, but the developer of the library was generous to merge my pull requests.

A markov chain is not interesting to a general user since as such it does not provide any direct benefits. But this is a foundation for many applications like speech recognition, handwriting recognition, automatic text generation etc. Mainly, it is used a tool that predict the next word given a prompt word. So I built a web application and web api that predict the next Malayalam word

This application is available at https://predict.smc.org.in/ and source code is at https://gitlab.com/smc/mlpredict

Another interesting application is automatic text generation. Some sample texts generated:

നാളെ വീണ്ടും ഉപേക്ഷിയ്ക്കപ്പെടുകതന്നെയായിരിക്കില്ലേ അവരുടെ ഉല്പന്നങ്ങളെക്കുറിച്ചുള്ള വിശദാംശങ്ങൾ പ്രസിദ്ധീകരിക്കാനായി കമ്പനിയെ സമ്മതിപ്പിക്കാൻ നമുക്കാകുന്നുണ്ടു്. ചിലപ്പോൾ സമൂഹവുമായി സഹകരിച്ചും നമ്മുടെ കമ്പ്യൂട്ടറുകളിലും ഡിജിറ്റൽ.’

നാളെ കാലത്തു കുറച്ചു വെള്ളം കോരിയൊഴിച്ചു കുടം നിറച്ചു കഞ്ഞിയുണ്ടായി. അതു വരുമ്പോൾ കുട്ടികളുടെ ഒന്നും ചേർന്നു തന്നെ. വരികളോർമ്മിച്ച് ആസ്വദിച്ച് കൊണ്ടുള്ള കഞ്ഞിയോ പുഴുക്കോ ആയിരുന്നു വലിയൊരു കൂട്ടത്തിന്റെ വിലാപത്തിന്റെ സംഗീതികതന്നെയായി മാറുകയാണ് ഈ.

ഇനിയും വല്ലതും തിന്നുകയും ചെയ്തതിന്റെശേഷം കൊട്ടാരംവക ആനയെ അതുവരെ ഇവിടെ വന്നു തുടങ്ങി. എങ്കിലും നിന്റെ കമ്പ്യൂട്ടറിനെ അനുഗ്രഹിക്കുന്നു കുട്ടീ. വികസിപ്പിക്കാവുന്ന ടെക്സ്റ്റ് ബുക്കായി ഉപയോഗിക്കാവുന്ന തരത്തിൽ അതിനെപറ്റി സങ്കൽപ്പിക്കാൻ സാധ്യമല്ല. അതുകൊണ്ട്, ആസന്നമായിരിക്കുന്നുവെന്ന് എല്ലാ ജില്ലകളിലും കളക്ടർമാരുടെ നേതൃത്വത്തിൽ നടത്തിയ നിക്ഷേപവുമാണു്, അല്ലാതെ മറ്റൊരു സുഖം. സ്കൂൾജീവിതം കഴിഞ്ഞപ്പോൾ അതു് നിങ്ങളുടെ പിന്തുണ ഉറപ്പാക്കാനായിട്ടില്ല. ഇതു് നിസ്സാരകാര്യമല്ല. ഭാരതി എയർടെൽ സീറോയുടെ ഭാഗമായി ബിയർ പാർലറിന്റെ ചുമരിടിച്ചു തകർത്താണ് സർജെന്റ് ഐസക്കും, കൂട്ടാളികളും ചെക്കോസ്ലോവാക്യൻ മണ്ണിൽ പിറക്കണമെന്ന് ജനിക്കാനിരിക്കുന്ന പെൺകുഞ്ഞ് ഭീതി കലർന്ന വാർത്തകൾ വിശ്വസിച്ച് ഈ വിവരങ്ങൾ നിങ്ങൾക്കു നശിപ്പിക്കാം, തോല്പിക്കാനാവില്ല എന്ന ചോദ്യം 3: പ്രോലിറ്റേറിയന്മാർ എക്കാലത്തുമുണ്ടായിരുന്നില്ലെന്നല്ലേ ഇതിന്റെ ഏഴിരട്ടിയുണ്ടെന്നോർക്കുക. ചുറ്റോടുചുറ്റുമുള്ള കടലോരങ്ങളുടെ ചാരുത മുതൽ അവസാനംവരെ അവന്റെ കചക്കയറിന്മേൽ കെട്ടി ചിലപ്പോഴൊക്കെ നമ്മളെ ഭയപ്പെടുത്തുന്നതാണെന്നു് നാം ഭൂമിക്കുചുറ്റും മണിക്കൂറിൽ 1600 – ൽ കൂടുതൽ.

Have fun!

by Santhosh Thottingal at June 08, 2019 04:23 AM

Updated web interface for mlmorph

The web interface of Malayalam morphology analyser(mlmorph) is updated. You can see new interface at https://morph.smc.org.in/. The new web application is written in vuejs using vuetify UI framework. The backend is flask. Source code is available at https://gitlab.com/smc/mlmorph-web

Morphology analysis
Morphology generator
Named entity recognition
Spellchecker
Number spellout

by Santhosh Thottingal at June 08, 2019 03:48 AM

June 05, 2019

Santhosh Thottingal

Chilanka version 1.400 released

A new version of Chilanka typeface is available now. Version 1.400 is available for download from SMC’s font download and preview site smc.org.in/fonts

For users, there is not much changes, but the source and code build system got a major upgrade.

  • Source code updated to UFO format from fontforge sfd format. This allows to work with modern font editors.
  • Use cubic beziers for master design, generate OTF along with TTF. The original drawings for Chilanka was using cubic beziers.
  • fontmake is used for building the ttf and otf, similar to the latest font projects by SMC
  • fontbakery is used for tests, all tests are passing now
  • Added a few important latin glyphs missing, reported by fontbakery

by Santhosh Thottingal at June 05, 2019 02:58 PM

May 26, 2019

Santhosh Thottingal

Lexicon Curation for Mlmorph

One of the key components of Mlmorph is its lexicon. The lexicon contains the root words categorized as nouns, verbs, adjectives, adverbs etc. These are the components used with morphological rules to generate the vocabulary of Malayalam. I collected initial lexicon with about 100,000 words from various sources such as Wikipedia, CLDR and many targeted web crawls. One problem with such collected words is they often contains spelling mistakes. Secondly, classifying these words is not possible without the tedious task of a person going through each and every words.

So, I was thinking of a solution which consists of:

  • A crawler or multiple targeted crawlers looking for candidate words. For example, I can write script to look for the entire Malayalam wikipedia dump and look for words that are most probably nouns or inflected nouns or words derived out of nouns. This is possible with some kind of pattern matching. For example, a word ending with -യുടെ, -ിന്റെ, -ിൽ, -യെ are most probably noun(we don’t know whether it is pronoun or place name or person name- that require human curation). A word ending with -ക്കുക, -ച്ചു, -ട്ട്, -ിരുന്നു, is most probably a verb.
  • A database and an application that helps a person to quickly approve the prediction, remove the misspelled word, edit the word to correct mistakes, choose a correct POS tagging
  • A set of scripts that will take the curated words to the lexicon of mlmorph. Also as mlmorph learn new root words, the database will require a refresh since mlmorph start recognizing words related to the new words learned.

Over the last few days, I was working to implement this system. Interestingly, I was also learning and practicing Vuejs. I was amazed by the productivity it gives to quickly build clean and fast modern web applications. So I decide to use that for my curator application. For database I found firebase with Vuefire will be a perfect fit. Vuetify helped to do quick UI styling. Without writing any specific code for database management I got the whole system working.

Screenshot of the lexicon curator application. The two words shown here are misspelled that I can quickly remove. The prediction for these two words is Verb.

The mobile friendly application allows me to do this otherwise tedious task as a leisure activity. After adding some user authentication, I will make it public and share with some friends. Source code: https://gitlab.com/santhoshtr/lexicon-curator/. Thr mlmorph scripts are at https://gitlab.com/smc/mlmorph

by Santhosh Thottingal at May 26, 2019 09:18 AM

May 19, 2019

Rajeesh K Nambiar

Okular: another improvement to annotation

Continuing with the addition of line terminating style for the Straight Line annotation tool, I have added the ability to select the line start style also. The required code changes are committed today.

Line annotation with circled start and closed arrow ending.

Currently it is supported only for PDF documents (and poppler version ≥ 0.72), but that will change soon — thanks to another change by Tobias Deiminger under review to extend the functionality for other documents supported by Okular.

by Rajeesh at May 19, 2019 01:40 PM

May 07, 2019

Rajeesh K Nambiar

Okular: improved PDF annotation tool

Okular, KDE’s document viewer has very good support for annotating/reviewing/commenting documents. Okular supports a wide variety of annotation tools out-of-the-box (enable the ‘Review’ tool [F6] and see for yourself) and even more can be configured (such as the ‘Strikeout’ tool) — right click on the annotation tool bar and click ‘Configure Annotations’.

One of the annotation tools me and my colleagues frequently wanted to use is a line with arrow to mark an indent. Many PDF annotating software have this tool, but Okular was lacking it.

So a couple of weeks ago I started looking into the source code of okular and poppler (which is the PDF library used by Okular) and noticed that both of them already has support for the ‘Line Ending Style’ for the ‘Straight Line’ annotation tool (internally called the TermStyle). Skimming through the source code for a few hours and adding a few hooks in the code, I could add an option to configure the line ending style for ‘Straight Line’ annotation tool. Many line end styles are provided out of the box, such as open and closed arrows, circle, diamond etc.

An option to the ‘Straight Line’ tool configuration is added to choose the line ending style:

New ‘Line Ending Style’ for the ‘Straight Line’ annotation tool.

Here’s the review tool with ‘Open Arrow’ ending in action:

‘Arrow’ annotation tool in action.

Once happy with the outcome, I’ve created a review request to upstream the improvement. A number of helpful people reviewed and commented. One of the suggestions was to add icon/shape of the line ending style in the configuration options so that users can quickly preview what the shape will look like without having to try each one. The first attempt to implement this feature was by adding Unicode symbols (instead of a SVG or internally drawn graphics) and it looked okay. Here’s a screen shot:

‘Line End’ with symbols preview.

But it had various issues — some symbols are not available in Unicode and the localization of these strings without some context would be difficult. So, for now it is decided to drop the symbols.

For now, this feature works only on PDF documents. The patch is committed today and will be available in the next version of Okular.

by Rajeesh at May 07, 2019 01:40 PM

March 28, 2019

Rajeesh K Nambiar

Meera font updated to fix issue with InDesign

I have worked to make sure that fonts maintained at SMC work with mlym (Pango/Qt4/Windows XP era) opentype specification as well as mlm2 (Harfbuzz/Windows Vista+ era) specification, in the same font. These have also been tested in the past (2016ish) with Adobe softwares which use their own shaping engine (they use neither Harfbuzz nor Uniscribe; but there are plans to use Harfbuzz in the future — the internet tells me).

Some time ago, I received reports that typesetting articles in Adobe InDesign using Meera font has some serious issues with Chandrakkala/Halant positioning in combination with conjuncts.

When the Savmruthokaram/Chandrakkala ് (U+0D4D) follows a consonant or conjunct, it should be placed at the ‘right shoulder’ of the consonant/conjunct. But in InDesgin (CC 2019), it appears incorrectly on the ‘left shoulder’. This incorrect rendering is highlighted in figure below.

Wrong chandrakkala position before consonant in InDesign.

The correct rendering should have Chandrakkala appearing at the right of as in figure below.

Correct chandrakkala position after consonant.

This issue manifested only in Meera, but not in other fonts like Rachana or Uroob. Digging deeper, I found that only Meera has Mark-to-Base positioning GPOS lookup rule for Chandrakkala. This was done (instead of adjusting leftt bearing of the Chandrakkala glyph) to appear correctly on the ‘right shoulder’ of consonant. Unfortunately, InDesign seems to get this wrong.

To verify, shaping involving the Dot Reph ൎ (U+0D4E) (which is also opentype engineered as Mark-to-Base GPOS lookup) is checked. And sure enough, InDesign gets it wrong as well.

Dot Reph position (InDesign on left, Harfbuzz/Uniscribe on right)

The issue has been worked around by removing the GPOS lookup rules for Chandrakkala and tested with Harfbuzz, Uniscribe and InDesign. I have tagged a new version 7.0.2 of Meera and it is available for download from SMC website. As this issue has affected many users of InDesign, hopefully this update brings much joy to them to use Meera again. Windows/InDesign users make sure that previous versions of the font are uninstalled before installing this version.

by Rajeesh at March 28, 2019 08:38 AM

March 14, 2019

Rajeesh K Nambiar

New package in Fedora: python-xslxwriter

XlsxWriter is a Python module for creating files in xlsx (MS Excel 2007+) format. It is used by certain python modules some of our customers needed (such as OCA report_xlsx module).

This module is available in pypi but it was not packaged for Fedora. I’ve decided to maintain it in Fedora and created a package review request which is helpfully reviewed by Robert-André Mauchin.

The package, providing python3 compatible module, is available for Fedora 28 onwards.

by Rajeesh at March 14, 2019 09:42 AM

March 10, 2019

Santhosh Thottingal

LibreOffice Malayalam spellchecker using mlmorph

A few months back, I wrote about the spellchecker based on Malayalam morphology analyser. I was also trying to intergrate that spellchecker with LibreOffice. It is not yet ready for any serious usage, but if you are curious and would like to help me in its further development, please read on.

Blog post on spellchecker approach and pla

Current status

The libreoffice spellchecker for Malayalam is available at https://gitlab.com/smc/mlmorph-libreoffice-spellchecker. You need to get the code using git checkout or download the master version as zip file

You need LibreOffice 4.1 or later. Latest version is recommended. In the source code directory, run make install to install the extension.

Open libreoffice writer, add some Malayalam text. Make sure to select the language as Malayalam by choosing it from the menu or bottom status bar. You should see the spelling check in action… if everything goes as expected 😉

LibreOffice language settings, You can see mlmorph listed.
Spellchecker in action- libreoffice writer.

How can you help?

Theoretically, the extension should work in non-Linux platforms as well. But I have not tested it. The extension need python3 and python-hfst for the operating system. But python-hfst is not available for Windows 64 bit python installation. If you test and get the extension working, please add documentation and if anything missing to make the installation more easy, let me know.

As the mlmorph project get wider support for Malayalam vocabulary, the quality of spellchecker improves automatically.

by Santhosh Thottingal at March 10, 2019 10:16 AM

Malayalam Named Entity Recognition using morphology analyser

Named Entity Recognition, a task of identifying and classifying real world objects such as persons, places, organizations from a given text is a well known NLP problem. For Malayalam, there were several research papers published on this topic, but none are functional or reproducible research.

The morphological characteristics of Malayalam has been always a challenge to solve this problem. When the named entities appear in an inflected or agglutinated complex word, the first step is to analyse such words and arrive at the root words.

As the Malayalam morphology analyser is progressing well, I attempted to build a first version of Malayalam NER on top of it. Since mlmorph gives the POS tagging and analysis, there is not much to do in NER. We just need to look for tags corresponding to proper nouns and report.

You can try the system at https://morph.smc.org.in/ner

Malayalam named entity recognition example using https://morph.smc.org.in/ner

Known Limitations

  • The recognition is limited by the current lexicon of mlmorph. To recognize out of lexicon entities, a POS guesser would be needed. But this is a general problem not limited to NER. A morphology analyser should also have a POS guesser. In other words as the mlmorph improves, this system also improves automatically.
  • Currently the recognition is at word level. But sometimes, the entities are written in multiple consecutive words. To resolve that we will need to write a wrapper on top of word level detection system.
  • The current system is a javascript wrapper on top the mlmorph analyse api. I think NER deserve its own api.

by Santhosh Thottingal at March 10, 2019 09:25 AM

March 02, 2019

Santhosh Thottingal

Scribus gets hyphenation support for 11 Indian languages

Support for hyphenating in 11 Indian languages is now available in Scribus, desktop publishing system. Two years back I had written how Malayalam hyphenation support was added to Scribus. Later, I had filed a bug to add support for more Indian languages. That is now fixed.

Scribus has a new way to download and use these hyphenation dictionaries. You can now use this feature right away in your installed scribus. The languages with hyphenation support are the following:

  • Malayalam
  • Tamil
  • Telugu
  • Kannada
  • Marathi
  • Hindi
  • Bengali
  • Gujarati
  • Assamese
  • Panjabi
  • Odia

How to Add Hyphenation Dictionary?

Navigate to Windows -> Resources in the menu bar. You will see a window as given below. You may want to press “Update Available List”. Then you can see all the languages with hyphenation dictionaries available. Select the download checkbox and press “Download” button. The dictionary will get installed to your system.

Scribus Resource Manager

How to use?

  • Start a new document. Add text frames and content. You may need narrow columns to have wordbreaking contexts.
  • Select the text and set appropriate font(Unicode) for your language. Make sure the language is selected as your preferred language.
  • In Hyphenation properties, set hyphenation character as blank, otherwise visible hyphens will appear.
  • Set the text justified.
  • From menu Extras->Hyphenate text. Done.
Hyphenated two column content

How does it work?

The resource manager based hyphenation libraries are easier way to add new hyphenation dictionaries. Earlier, these files need to add to Scribus source code. Now these files are defined in scribus server – http://services.scribus.net/scribus_hyph_dicts.xml. It maps the languages to files to download. So if I update the dictionaries in the github repo, a new installation will take that updated file.

Reporting issues

If you find any issues in the hyphenation rules, you can file at https://github.com/smc/hyphenation/

by Santhosh Thottingal at March 02, 2019 04:49 AM

February 21, 2019

Santhosh Thottingal

Gayathri – New Malayalam typeface

Swathanthra Malayalam Computing is proud to announce Gayathri – a new typeface for Malayalam. Gayathri is designed by Binoy Dominic, opentype engineering by Kavya Manohar and project coordination by Santhosh Thottingal.

This typeface was financially supported by Kerala Bhasha Institute, a Kerala government agency under cultural department. This is the first time SMC work with Kerala Government to produce a new Malayalam typeface.

Gayathri is a display typeface, available in Regular, Bold, Thin style variants. It is licensed under Open Font License. Source code, including the SVG drawings of each glyph is available in the repository. Gayathri is available for download from smc.org.in/fonts#gayathri

Gayathri has soft, rounded terminals, strokes with varying thickness and good horizontal packing. Gayathri has large glyph set for supporting Malayalam traditional orthography, which is the new trend in contemporary Malayalam. With a total of 1124 glyphs, Gayathri also has basic latin coverage. All Malayalam characters defined till Unicode 11 is supported.

There are not much Malayalam typefaces designed for titles and large displays. We hope Gayathri will fill that gap.

This is also the first typeface by Binoy Dominic. He had proved his lettering skills in his profession as graphic designer, working on branding with Malayalam content for his clients.

Binoy prepared all glyphs in SVGs, our scipts converted it to UFO sources. Trufont was used for small edits. Important glyph information like bearings, names, were defined in yaml configuration. Build scripts generated valid UFO sources and fontmake was used to build OTF output. Of course, there were lot of cycles of design fine tuning. Gitlab CI was used for running the build chain and testing. Fontbakery was used for quality assurance. UFO Normalizer, UFO Lint tools were also part of build system.

by Santhosh Thottingal at February 21, 2019 06:40 AM

February 08, 2019

Santhosh Thottingal

How to setup DNS over TLS using systemd-resolved

DNS over TLS is a security protocol that forces all connections with DNS servers to be made securely using TLS. This effectively keeps ISPs from seeing what website you’re accessing.

For the GNU/Linux distributions using systemd, you can setup this easily by following the below steps.

First, edit the /etc/systemd/resolved.conf and changed the value of DNSOverTLS as :

DNSOverTLS=opportunistic

Now, configure your DNS servers. You need to use DNS server that support DNS over TLS. Examples are Cloudflare DNS 1.1.1.1 or 1.0.0.1. Google DNS 8.8.8.8 also support it. To configure you can use Network manager graphical interface.

Then restart the systemd-resolved using:

sudo systemctl restart systemd-resolved

You are done. To check whether settings are correctly applied, you can try:

$ resolvectl status
Global
       LLMNR setting: no
MulticastDNS setting: no
  DNSOverTLS setting: opportunistic

If you really want to see how DNS resolution requests are happening, you may use wireshark and inspect port 53 – the usual DNS port. You should not see any traffic on that port. Instead, if you inspect port 853, you can see DNS over TLS requests.

by Santhosh Thottingal at February 08, 2019 05:36 AM

January 15, 2019

Santhosh Thottingal

വിക്കിപീഡിയയ്ക്ക് പതിനെട്ട്. നാലുലക്ഷം തർജ്ജമകളും

വിക്കിപീഡിയയുടെ പതിനെട്ടാം പിറന്നാളാണിന്ന്. അമ്പത്തെട്ടുലക്ഷം ലേഖനങ്ങളോടെ ഇംഗ്ലീഷ് വിക്കിപീഡിയയും അറുപതിനായിരത്തോളം ലേഖനങ്ങളോടെ മലയാളം വിക്കിപീഡിയയും ഒരുപാടു പരിമിതികൾക്കും വെല്ലുവിളികൾക്കുമിടയിൽ യാത്ര തുടരുന്നു.

292 ഭാഷകളിൽ വിക്കിപീഡിയ ഉണ്ടെങ്കിലും ഉള്ളടക്കത്തിന്റെ അനുപാതം ഒരുപോലെയല്ല. വിക്കിമീഡിയ ഫൗണ്ടേഷനിൽ കഴിഞ്ഞ നാലുവർഷമായി എന്റെ പ്രധാനജോലി ഭാഷകൾ തമ്മിൽ മെഷീൻ ട്രാൻസ്‌ലേഷന്റെയും മറ്റും സഹായത്തോടെ ലേഖനങ്ങൾ പരിഭാഷപ്പെടുത്തുന്ന സംവിധാനത്തിന്റെ സാങ്കേതികവിദ്യയ്ക്ക് നേതൃത്വം കൊടുക്കലായിരുന്നു.

ഇന്നലെ ഈ സംവിധാനത്തിന്റെ സഹായത്തോടെ പുതുതായി കൂട്ടിച്ചേർത്ത ലേഖനങ്ങളുടെ എണ്ണം നാലുലക്ഷമായി.

by Santhosh Thottingal at January 15, 2019 06:57 AM

January 13, 2019

Santhosh Thottingal

Swanalekha input method now available for Windows and Mac

The Swanalekha transliteration based Malayalam input method is now available in Windows and Mac platforms. Thanks to Ramesh Kunnappully, who wrote the keyman implementation.

I wrote this input method in 2008. At those days SCIM was the popular input method for Linux. Later it was rewritten for M17N and used with either IBus or FCITX. A few years later, this input method was made to available in Android using Indic keyboard. Last year, due to requests from Windows and Mac users, Chrome and Firefox extensions were prepared. Thanks to SIL Keyman, now we made it available in those operating systems as well.

By this, Swanalekha Malayalam becomes an input method you can use in all operating systems and phones.

Detailed documentation, downloads are available in Swanalekha website. Source code: gitlab.com/smc/swanalekha. A small video illustrating the installation, configuration and use in Windows 10 given below.

Update: The keyboard is now served by keyman from their website. And the supported platforms also increased.

Download options from https://keyman.com/keyboards/swanalekha_malayalam

by Santhosh Thottingal at January 13, 2019 04:22 AM

January 09, 2019

Rajeesh K Nambiar

Smarter tabular editing with Vim

I happen to edit tabular data in LaTeX format quite a bit. Being scientific documents, the table columns are (almost) always left-aligned, even for numbers. That warrants carefully crafted decimal and digit alignment on such columns containing only numbers.

I also happen to edit the text (almost) always in Vim, and just selecting/changing a certain column only is not easily doable (like in a spreadsheet). If there are tens of rows that needs manual digit/decimal align adjustment, it gets even more tedious. There must be another way!

Thankfully, smarter people already figured out better ways (h/t MasteringVim).

With that neat trick, it is much more palatable to look at the tabular data and edit it. Even then, though, it is not possible to search & replace only within a column using Visual Block selection. The Visual Block (^v) sets mark on the column of first row till the column on last row, so any :<','>s/.../.../g would replace any matching text in-between (including any other columns).

To solve that, I’ve figured out another way. It is possible to copy the Visual Block alone and pasting any other content over (though cutting it and pasting would not work as you think). Thus, the plan is:

  • Copy the required column using Visual Block (^v + y)
  • Open a new buffer and paste the copied column there
  • Edit/search & replace to your need in that buffer, so nothing else would be unintentionally changed
  • Select the modified content as Visual Block again, copy/cut it and come back to the main buffer/file
  • Re-select the required column using Visual Block again and paste over
  • Profit!

Here’s a short video of how to do so. I’d love to hear if there are better ways.

Column editing in Vim
Demo of column editing in Vim

by Rajeesh at January 09, 2019 11:44 AM

December 23, 2018

Santhosh Thottingal

പത്തുവർഷത്തെ കോഡ്

ഭാഷാകമ്പ്യൂട്ടിങ്ങുമായി ബന്ധപ്പെട്ട സ്വതന്ത്ര സോഫ്റ്റ്‌വെയർ വികസനപ്രവർത്തനങ്ങളിൽ ഏർപ്പെടാൻ തുടങ്ങിയിട്ട് പത്തുവർഷമാകുന്നു. 2008 ൽ ഒക്കെയാണ് ഈ മേഖലയിൽ സജീവമാകുന്നതും പലതരം പ്രോജക്ടുകൾക്കായി സമയം നീക്കിവെച്ചു തുടങ്ങുന്നതും. കഴിഞ്ഞ പത്തുവർഷത്തെ എന്റെ സംഭാവനകൾ ഗിറ്റ്‌ഹബ്ബിൽ ഉള്ള കോഡിന്റെ അടിസ്ഥാനത്തിൽ ചിത്രീകരിച്ചിരിക്കുകയാണിവിടെ.

Generated using https://github-contributions.now.sh/ for my github username santhoshtr

ഇതിലെ ഓരോ കള്ളിയും ഒരു ദിവസമാണ്. പച്ച നിറത്തിലുള്ള കള്ളിയുള്ള ദിവസങ്ങളിൽ കോഡ്, ബഗ്ഗ് റിപ്പോർട്ടുകൾ, മറ്റുള്ളവരുടെ കോഡ് റിവ്യൂ ചെയ്യൽ അങ്ങനെയെന്തെങ്കിലും രീതിയിലുള്ള പ്രവർത്തനം ചെയ്തുവെന്നർത്ഥം. ഇളം പച്ചയിൽ നിന്നും കടുംപച്ചയിലേക്ക് പോകുന്തോറും അതിന്റെ എണ്ണം കൂടുന്നു.

ഒരു ഡയറി പോലെത്തന്നെ എന്റെ ജീവിതത്തിലെ മധുരവും കയ്പ്പും എനിക്കിതിൽ വായിച്ചെടുക്കാം. പലപ്പോളായി കാണുന്ന നീണ്ട ഇടവേളകൾ യാത്രകളോ, വ്യക്തിപരമായ നല്ലതോ മോശമോ ആയ വിട്ടുനിൽക്കലുകളാണ്. ഇക്കാര്യത്തിൽ 2016 വളരെ മോശമായിരുന്നെന്നു കാണാം. 2013 ഏപ്രിലിലെ ഇടവേള എന്റെ വിവാഹത്തെ കാണിക്കുന്നു. ഇടയ്ക്ക് ഇടവേളകളില്ലാതെ 100 ദിവസം എന്തെങ്കിലും ചെയ്യുക എന്ന ഒരു ചലഞ്ചും ചെയ്തിരുന്നു- github streak – 2014 സെപ്റ്റംബർ മുതൽ അത് കാണാം.

അഭിമാനിക്കാവുന്ന ഒരു കാര്യം എന്റെ കരിയർ മുന്നോട്ടുപോകുന്തോറും എൻജിനിയറിങ്ങിൽ കൂടുതൽ സംഭാവനകൾ ചെയ്യാൻ കഴിയുന്നുണ്ട് എന്നതാണ്. പൊതുവിൽ ഐടി മേഖലയിൽ പ്രവർത്തിക്കുന്നവർക്കറിയാം, ആദ്യ പത്ത് വർഷങ്ങൾ പിന്നിടുമ്പോൾ മിക്കവാറും എൻജിനിയറിങ്ങ് സ്വഭാവമുള്ള ജോലിയിൽ നിന്നും മാനേജ്മെന്റ് സ്വഭാവമുള്ള ജോലിയിലെത്തിയിരിക്കും. ഞാൻ ആ പാത തിരഞ്ഞെടുത്തില്ല.

2011 ൽ വിക്കിമീഡിയ ഫൌണ്ടേഷനിൽ ഭാഷാ സാങ്കേതികവിദ്യാ വിഭാഗത്തിൽ ജോലിക്ക് ചേർന്നതോടെ, പൊതുജനങ്ങൾക്കായുള്ള കോഡ് എഴുതുന്നത് വളരെയേറെ കൂടി. അതേ സമയം വാരാന്ത്യങ്ങളിലും മറ്റ് ഒഴിവുസമയങ്ങളിലും മലയാളഭാഷയുമായി ബന്ധപ്പെട്ട പ്രവർത്തനങ്ങളിലും ഏർപ്പെട്ടു. അതുകൊണ്ടാണ് ഈ ഗ്രാഫിൽ ശനിയും ഞായറുമൊക്കെ പച്ച നിറം കാണുന്നത്.

അഭിമാനിക്കാവുന്ന മറ്റൊരു കാര്യം എന്റെ പ്രൊഫഷനിൽ, പൊതുജനങ്ങൾക്കായുള്ള കോഡ് എഴുതേണ്ടിവന്നപ്പോഴൊക്കെ അത് സ്വതന്ത്ര സോഫ്റ്റ്‌വെയറായി ചെയ്യാൻ സാധിച്ചുവെന്നതാണ്. അതായത് ഒരു ലൈൻ കോഡുപോലും ഞാൻ മറച്ചുവെച്ചിട്ടില്ല. ഞാൻ ചെയ്ത ഓരോ സംഭാവനയും കാര്യകാരണസഹിതം തുറന്നുവെച്ചിരിക്കുന്നു. ആർക്കും എപ്പോഴും പരിശോധിക്കാവുന്ന, പഠിക്കാവുന്ന, ഉപയോഗിക്കാവുന്ന വിധം. അതാണ് സ്വതന്ത്ര സോഫ്റ്റ്‌വെയർ.

ഇതിലെ ചില പ്രവർത്തനങ്ങളുടെ ഫലങ്ങളെങ്കിലും നിങ്ങൾ ഒരു മലയാളിയാണെങ്കിൽ മിക്കവാറും നിത്യജീവിതത്തിൽ ഏതെങ്കിലും വിധത്തിൽ ഉപയോഗിക്കുന്നുണ്ടാവും. അതേസമയം തുടക്കകാലങ്ങളിൽ എഴുതിയ പലതും ഒരു ടെക്നോളജി പരീക്ഷണത്തിൽ നിന്ന് പുറത്ത് കടന്നു ഉപയോഗപ്രദമായ ഒരു സോഫ്റ്റ്‌വെയർ ആക്കുന്നതിൽ പരാജയപ്പെട്ടിട്ടുമുണ്ട്. പക്ഷേ അതൊക്കെ സ്വാഭാവികമായും പിന്നത്തേക്കുള്ള പാഠങ്ങളായിരുന്നു.

by Santhosh Thottingal at December 23, 2018 03:11 PM

December 19, 2018

Balasankar C

DebUtsav Kochi 2018

Heya,

Been quite some time since I wrote about anything. This time, it is Debutsav. When it comes to full-fledged FOSS conferences, I usually am an attendee or at most a speaker. I have given some sporadic advices and suggestions to few in the past, but that was it. However, this time I played the role of an organizer.

DebUtsav Kochi is the second edition of Debian Utsavam, the celebration of Free Software by Debian community. We didn’t name it MiniDebConf because it was our requirement for the conference to be not just Debian specific, but should include general FOSS topics too. This is specifically because our target audience aren’t yet Debian-aware to have a Debian-only event. So, DebUtsav Kochi had three tracks - one for general FOSS topics, one for Debian talks and one for hands-on workshops.

As a disclaimer, the description about the talks below are what I gained from my interaction with the speakers and attendees, since I wasn’t able to attend as many talks as I would’ve liked, since I was busy with the organizing stuff.

The event was organized by Free Software Community of India, whom I represented along with Democratic Alliance for Knowledge Freedom (DAKF) and Student Developer Society (SDS). Cochin University of Science and Technology were generous enough to be our venue partners, providing us with necessary infrastructure for conducting the event as well as accommodation for our speakers.

The event span across two days, with a registration count around 150 participants. Day 1 started with a keynote session by Aruna Sankaranarayanan, affiliated with OpenStreetMap. She has been also associated with GNOME Project, Wikipedia and Wikimedia Commons as well as was a lead developer of the Chennai Flood Map that was widely used during the floods that struck city of Chennai.

Sruthi Chandran, Debian Maintainer from Kerala, gave a brief introduction about the Debian project, its ideologies and philosophies, people behind it, process involved in the development of the operating system etc. An intro about DebUtsav, how it came to be, the planning and organizations process that was involved in conducting the event etc were given by SDS members.

After these common talks, the event was split to two parallel tracks - FOSS and Debian.

In the FOSS track, the first talk was by Prasanth Sugathan of Software Freedom Law Centre about the needs of Free Software licenses and ensuring license compliance by projects. Parallely, Raju Devidas discussed about the process behind becoming an official Debian Developer, what does it mean and why it matters to have more and more developers from India etc.

After lunch, Ramaseshan S introduced the audience to Project Vidyalaya, a free software solution for educational institutions to manage and maintain their computer labs using FOSS solutions rather than the conventional proprietary solutions. Shirish Agarwal shared general idea about various teams in Debian and how everyone can contribute to these teams based on their interest and ability.

Subin S showed introduced some nifty little tools and tricks that make Linux desktop cool, and improve the productivity of users. Vipin George shared about the possibility of using Debian as a forensic workstation, and how it can be made more efficient than the proprietary counterparts.

Ompragash V from RedHat talked about using Ansible for automation tasks, its advantages over similar other tools etc. Day 1 ended with Simran Dhamija talking about Apache SQOOP and how it can be used for data transformation and other related usecases.

In the afternoon session of Day 1, two workshops were also conducted parallel to the talks. First one was by Amoghavarsha about reverse engineering, followed by an introduction to machine learning using Python by Ditty.

We also had an informal discussion with few of the speakers and participants about Free Software Community of India, the services it provide and how to get more people aware of such services and how to get more maintainers for them etc. We also discussed the necessity of self-hosted services, onboarding users smoothly to them and evangelizing these services as alternatives to their proprietary and privacy abusing counterparts etc.

Day 2 started with a keynote session by Todd Weaver, founder and CEO of Purism who aims at developing laptops and phones that are privacy focused. Purism also develops PureOS, a Debian Derivative that consists of Free Software only, with further privacy enhancing modifications.

On day 2, the Debian track focused on a hands-on packaging workshop by Pirate Praveen and Sruthi Chandran that covered the basic workflow of packaging, the flow of packages through various suites like Unstable, Testing and Stable, structure of packages. Then it moved to the actual process of packaging by guiding the participants through packaging a javascript module that is used by GitLab package in Debian. Participants were introduced to the tools like npm2deb, lintian, sbuild/pbuilder etc. and the various debian specific files and their functionalities.

In the FOSS track, Biswas T shared his experience in developing keralarescue.in, a website that was heavily used during the Kerala Floods for effective collaboration between authorities, volunteers and public. It was followed by Amoghavarsha’s talk on his journey from Dinkoism to Debian. Abhijit AM of COEP talked about how Free Software may be losing against Open Source and why that may be a problem. Ashish Kurian Thomas shed some knowledge on few *nix tools and tricks that can be a productivity booster for GNU/Linux users. Raju and Shivani introduced Hamara Linux to the audience, along with the development process and the focus of the project.

The event ended with a panel discussion on how Debian India should move forward to organize itself properly to conduct more events, spread awareness about Debian and other FOSS projects out there, prepare for a potential DebConf in India in the near future etc.

The number of registrations and enthusiasms of the attendees for the event is giving positive signs on the probability of having a proper MiniDebConf in Kerala, followed by a possible DebConf in India, for which we have bid for. Thanks to all the participants and speakers for making the event a success.

Thanks to FOSSEE, Hamara Linux and GitLab for being sponsors of the event and thus enabling us to actually do this. And also to all my co-organizers.

A very special thanks to Kiran S Kunjumon, who literally did 99% of the work needed for the event to happen (as you may recall, I am good at sitting on a chair and planning, not actually doing anything. :D ).

Group photo

Group Photo

December 19, 2018 12:00 AM

November 25, 2018

Santhosh Thottingal

Malayalam morphology analyser – First release

I am happy to announce the first version of Malayalam morphology analyser.

After two years of development, I tagged version 1.0.0

In this release

In this release, mlmorph can analyse and generate malayalam words using the morpho-phonotactical rules defined and based on a lexicon. We have a test corpora of Fifty thousand words and 82% of the words in it are recognized by the analyser.

A python interface is released to make the usage of library very easy for developers. The library is available in pypi.org – https://pypi.org/project/mlmorph/ Installing it is very easy:

Installing it is very easy:

pip install mlmorph

It avoids all difficulties of compiling the sfst formalism and installing the required hfst, sfst packages.

For detailed python api documentation and command line utility refer https://pypi.org/project/mlmorph/

Next

There are lot of known limitations with the current release. I plan to address them in future releases.

  • Expand lexicon further: The current lexicon was compiled by testing various text and adding missing words found in it. Preparing the coverage test corpora also helped to increase the lexicon. But it still need more improvement
  • Many language specific constructs which are commonly used, but consisting of multiple conjunctions, adjectives are not well covered. Some examples are മറ്റൊരു, പിന്നീട്, അതുപോലെത്തന്നെ, എന്നതിന്റെ etc.
  • Optimizing the weight calculation: As the lexicon size is increased, many rarely used words can become alternate parts in agglutination of the words. For example, പാലക്കാട് can have an analysis of പാല്, അക്ക്, ആട് -Even though this is grammatically correct, it should get less preference than പാലക്കാട്<proper noun>.
  • Standardization of POS tags: mlmorph has its own pos tags definition. These tags need documentation with examples. I tried to use universal dependencies as much as possible, but it is not enough to cover all required tags for malayalam.
  • Documentation of formalism and tutorials for developers. So far I am the only developer for the project, which I am not happy about. The learning curve for this project is too steep to attract new developers. Above average understanding of Malayalam grammar is a difficult requirement too. I am planning to write down some tutorials to help new developers to join.

Applications

The project is meaningful only when practical applications are built on top of this.



by Santhosh Thottingal at November 25, 2018 10:55 AM

October 24, 2018

Rajeesh K Nambiar

Powerline git dirty status without powerline_gitstatus

With git-prompt it is possible to display the dirty state (when a tracked file is modified) by setting the env variable GIT_PS1_SHOWDIRTYSTATE=true. Powerline can display the status of a git repository, such as number of commits ahead/behind, number of modified files etc. using the powerline_gitstatus module. Unfortunately, Fedora doesn’t have it packaged. I did some digging in, and found that there’s colour highlighting for branch_dirty and powerline.segments.common.vcs.branch function (which displays the current branch name) takes 2 parameters  to modify its behaviour. Modify the shell theme /etc/xdg/powerline/themes/shell/default.json under the left segment (because only left works in shell) then as follows:
...
    {   "function": "powerline.segments.common.vcs.branch",
        "args": {"ignore_statuses": ["U"], "status_colors": true},
        "priority": 20
    }
...
The branch will now be highlighted if a tracked file is modified (ignore_statuses = ["U"] causes untracked files to be ignored). Clean repository:
Clean repo
Once a tracked file is modified:
Dirty repo

by Rajeesh at October 24, 2018 05:22 AM

September 27, 2018

Santhosh Thottingal

Malayalam Script LGR rules for public review

The Malayalam and Tamil Root Zone Label Generation Rules for International Domain names have been released for public comments. See the announcement from ICANN. This was drafted by the Neo-Brahmi Script Generation Panel (NBGP), in which I am also a member.

Your comments on the proposal for the Malayalam Script Label Generation Rules for the Root Zone (LGR [XML, 18 KB] and supporting documentation [PDF, 998 KB]) can be submitted at the feedback form till Nov 7 2018.

My earlier blog post on Internationalized Top Level Domain Names in Indian Languages has some detailed information about this.

by Santhosh Thottingal at September 27, 2018 11:53 AM

September 08, 2018

Santhosh Thottingal

Malayalam spellchecker – a morphology analyser based approach

My first attempt to develop a spellchecker for Malayalam was in 2007. I was using hunspell and a word list based approach. It was not successful because of rich morphology of Malayalam. Even though I prepared a  manually curated 150K words list, it was nowhere near to cover practically infinite words of Malayalam. For languages with productive morphological processes in compounding and derivation that are capable of generating dictionaries of infinite length, a morphology analysis and generation system is required. Since my efforts towards building such a morphology analyser is progressing well, I am proposing a finite state transducer based spellchecker for Malayalam. In this article, I will first analyse the characteristics of Malayalam spelling mistakes and then explain how an FST can be used to implement the solution.

What is a spellchecker?

The spellchecker is an application that tells whether the given word is spelled correctly as per the language or not. If the word is not spelled correctly, the spellchecker often gives possible alternatives as suggestion to correct the misspelled word. The word can be spellchecked independently or in the context of a sentence. For example, in the sentence “à´…à´¸àµ�തമയസൂരàµ�യൻ കടലയിൽ à´®àµ�à´™àµ�ങിതàµ�താഴàµ�à´¨àµ�à´¨àµ�”, the word “കടലയിൽ” is spelled correctly if considered independently. But in the context of the sentence, it is supposed to be “കടലിൽ”.

The correctness of the word is tested by checking if that word is in the language model. The language model can be simply a list of all known words in the language. Or it can be a system which knows how a word in a language will look like and tell whether the given word is such a word. In the case of Malayalam, we saw that the finite dictionary is not possible. So we will need a system which is ‘aware’ of all words in the language. We will see how a morphology analyser can be such a system.

If the word is misspelled, the system need to give correction. To generate the correctly spelled words from a misspelled word form, an error model is needed. The most common error model is Levenshtein edit distance. In the edit distance algorithm, the misspelling is assumed to be a finite number of operations applied to characters of a string: deletion, insertion, change, or transposition. The number of operations is known as ‘edit distance‘. Any word from the known list of words in the language, with a minimal distance is a candidate for suggestion. Peter Norvig explains such a functional spellchecker in his article “How to Write a spelling corrector?

There are multiple problems with the edit distance based correction mechanism

  • For a query word, to generate all candidates after applying the four operations, we can calculate the number of words we need to generate and test its correctness. For a word of length n, an alphabet size a, an edit distance d=1, there will be n deletions, n-1 transpositions, a*n alterations, and a*(n+1) insertions, for a total of 2n+2an+a-1 terms at search time. In the case of Malayalam, a is 117 if we consider all encoded characters in Unicode version 11. If we remove all archaic characters, we still need about 75 characters. So, for edit distance d=1, a=75, for a word with 10 characters, 2*10+2*75*10+75-1 = 1594 and much larger for larger d. So, you will need to do 1594 lookups(spellchecks) in the language model to get possible suggestions.
  • The concept that the 4 edit operations are the cause for all spelling mistakes is not accurate for Malayalam. There are many common spelling mistakes in Malayalam that are 3 or 4 edit distance from the original word. Usually the edit distance based corrections won’t go beyond d=2 since the number of candidates increases.

The problems with hunspell based spellchecker and Malayalam

Hunspell has a limited compounding support, but limited to two levels. Malayalam can have more than 2 level compounding and sometimes the agglutinated words is also inflected. Hunspell system has an affix dictionary and suffix mapping system. But it is very limited to support complex morphology like Malayalam. With the help of Németh László, Hunspell developer, I had explored this path. But abandoned due to many limitation of Hunspell and lack of programmatic control of the morphological rules.

Nature of Malayalam spelling mistakes

Malayalam uses an alphasyllabary writing system. Each letter you write corresponds to the grapheme representation of a phoneme. In broader sense Malayalam can be considered as a language with one to one  grapheme to phoneme correspondence. Where as in English and similar languages, letters might represent a variety of sounds, or the same sounds can be written in different ways. The way a person learns writing a language strongly depends on the writing system.

In Malayalam, since there is one and only one set of characters that can correspond to a syllable, the confusion of letters does not happen. For example, in English, Education, Ship, Machine, Mission all has sh sound [ʃ]. So a person can mix up these combinations. But in Malayalam, if it is sh sound [ʃ], then it is always ഷ.

Because of this, the spelling mistakes that is resulted by four edit operations(deletion, insertion, change, or transposition) may not be an accurate classification of errors in Malayalam.  Let us try to classify and analyse the spelling mistake patterns of Malayalam.

  1. Phonetic approximation: The 1:1 grapheme to phoneme correspondence is the theory. But because of this the inaccurate utterance of syllables will cause incorrect spellings. For example, ബൂമി is a relaxed way of reading for ഭൂമി since it is relatively effortless. Since the relaxed way of pronunciation is normal, sometimes people think that they are writing in wrong way and will try to correct it unnecessarily പീഢനം->പീഡനം is one such example.
    • Consonants: Each consonant in Malayalam has aspirated, unaspirated, voiced and unvoiced variants. Between them, it is very usual to get mixed up
      • Aspirated and Unaspirated mix-up: Aspirated consonant can be mistakenly written as  Unaspirated consonant. For Example, à´§ -> à´¦, à´¢ -> à´¡ . Similarly Unaspirated consonant can be mistakenly written as aspirated consonant – Example, à´¦ ->à´§, à´¡ ->à´¢.
      • Voiced and Voiceless mix-up. Voiced consonants like à´—, à´˜ can be mistakenly written as voiceless forms à´•, à´–. And vice versa.
      • Gemination of consonants is often relaxed or skipped in the speech, hence it appear in writing too. Gemination in Malayalam script is by combining two consonants using virama. നീലതാമര/നീലതàµ�താമര is an example for this kind of mistakes. There are a few debatable words too, like à´¸àµ�വർണം/à´¸àµ�വർണàµ�ണം, പാർടി/പാർടàµ�à´Ÿà´¿. Another way of consonant stress indication is by using Unaspirated Consonant + Virama + Aspirated Consonant. à´…à´¦àµ�à´§àµ�യാപകൻ/à´…à´§àµ�യാപകൻ, തീർഥം/തീർതàµ�ഥം, വിഡàµ�à´¡à´¿/വിഡàµ�ഢി pairs are examples.
      • Hard, Soft variants confusion. Examples: à´¶/à´·, à´°/à´±, à´²/à´³
    • Vowels: Vowel elongation or shortening, gliding vowels and semi vowels are the cause for vowel related mistakes in writing.
      • Each vowel in Malayalam can be a short vowel or long vowel. Local dialect can confuse people to use one for the other. ചിലപàµ�പൊൾ/ചിലപàµ�പോൾ is one example. Since many input tools place the short and long vowels forms with very close keystrokes, it is possible to cause errors. In Inscript keyboard, short and long vowels are in normal and shift position. In transliteration based input methods, long vowel is often typed by repeated keys(i, ii for à´¿, ീ).
      •  The vowel à´‹ is close to റി or à´±àµ� in pronunciation. Example: à´‹à´¤àµ�/റിതàµ�. The vowel sign of à´‹ while appearing with a consonant is close to àµ�à´°. Example ഗൃഹം/à´—àµ�രഹം. ഹൃദയം/à´¹àµ�à´°àµ�ദയം.
      • Gliding vowels à´�, à´” get confused with its constituent vowels. കൈ/à´•à´‡/à´•à´¯àµ�, à´”/à´…à´‰/à´…à´µàµ� are example.
      • In Malayalam, there is a tendency to use à´� instead of à´‡, since the reduced effort. Examples: ചിലവàµ�/ചെലവàµ�, ഇല/à´�à´², തിരയàµ�à´•/തെരയàµ�à´•. Due to wide usage of these variants, it is sometimes very difficult to say one word is wrong. See the discussion about the ‘Standard Malayalam’ at the end of this essay.
    • Chillus: Chillus are pure consonants. A consonant + virama sequence sometimes has no phonetic difference from a chillu. For example, à´•à´²àµ�പന/കൽപന, നിൽകàµ�à´•àµ�à´•/നിലàµ�à´•àµ�à´•àµ�à´• combinations. The chillu ർ is sometimes confused with à´‹ sign. Examples are: à´ªàµ�രവർതàµ�തി/à´ªàµ�രവൃതàµ�തി. The chillu form of à´® – à´‚ can appear are as anuswara or ma+virama forms. Examples: പംപ, പമàµ�à´ª. But it is not rare to see പംമàµ�à´ª for this. Sometimes, the anuswara get confused with à´¨àµ�, and പമàµ�à´ª becomes പനàµ�à´ª. There were a few buggy fonts that used à´¨àµ�+à´ª for à´®àµ�à´ª ligature too.
  2. Weak Phoneme-Grapheme correspondence: Due to historic or evolutionary nature of the script, Malayalam also has some phonemes which has a weak relationship with the graphemes.
    • à´¹àµ�à´®/ à´®àµ�à´® as in à´¬àµ�à´°à´¹àµ�മം/à´¬àµ�à´°à´®àµ�മം, à´¨àµ�à´¦/à´¨àµ�à´¨ as in നനàµ�ദി/നനàµ�നി, à´¹àµ�à´¨/à´¨àµ�ന  as in à´šà´¿à´¹àµ�നം/à´šà´¿à´¨àµ�നം are some examples where what you pronounce is not exactly same as what you write.
    • à´±àµ�à´±, à´¨àµ�à´± – These two highly used conjuncts heavily deviate from the letters and pronunciation. While writing using pen, people don’t make much mistakes since they just draw the shape of these ligatures, but while typing, one need to know the exact key sequence and they get confused. Common mistakes for these conjuncts are ററ, ൻറ, ൻറàµ�à´± , ൻററ
  3. Visual similarity: While using visual input methods such as handwriting based or some onscreen keyboards, either the users or the input tool makes mistakes due to visual similarity
    • ൃ, àµ�à´¯ often get confused.
    • à´œàµ�à´�, à´�àµ�à´œ is one very common sequence where people are confused. ആദരാജàµ�à´�ലി/ആദരാà´�àµ�ജലി.
    • à´¤àµ�à´¸, à´� is another combination
    • The handwriting based input methods like Google handwriting tool is known for recognizing anuswara à´‚ as zero, English o, O etc.
    • When people don’t know how to insert visarga à´ƒ, and since there is a very similar key in keyboard- colon : they use it. Example: à´¦àµ�ഃഖം/à´¦àµ�:à´–à´‚
    • à´³àµ�à´³, the geminated form of à´³, is very similar to two adjacent à´³. This kind of mistakes are very frequent among people whi studied Malayalam inputting informally. Two adjacent à´±, is another mistake for à´±àµ�à´±,
    • The informal, trial-and-error based Malayalam inputting training also introduced some other mistakes such as using open parenthesis ‘(‘ for àµ�à´°, closing parenthesis ‘)’ for à´¾ sign.
  4. Ambiguity due to regional dialect: A good example for this is insertion of യ� in verbs. ക�റക�ക�ക/ക�റയ�ക�ക�ക, ചിരിക�ക�ക/ചിരിയ�ക�ക�ക, Also in nominal inflections: പൂച�ചയ�ക�ക�/പൂച�ചക�ക�.  Usuage of Samvruthokaram to distinguish between a pure consonant and stressed consonant at the end of word is a highly debated topic. For example, അവന�/അവന��/അവന�. All these forms are common, even though the usage of ന�� is less after the script reformation. But since script reformation was not an absolute transformation, it still exist in usage
  5. Spaces: Malayalam is an agglutinative language. Words can be agglutinated, but nothing prevents people to put space and write in simple words. But this should be done carefully since it can alter the meaning. An example is “ആന à´ªàµ�റതàµ�à´¤àµ� കയറി”, ആനപàµ�à´ªàµ�റതàµ�à´¤àµ� കയറി”, “ആനപàµ�à´ªàµ�റതàµ�à´¤àµ�കയറി”, “ആനപàµ�à´ªàµ�റതàµ�à´¤àµ� കയറി”. Another example: “മലയാള ഭാഷ”, “മലയാളഭാഷ” – Here, there is no valid word “മലയാള”. The anuswara at the end get deleted only when it joins with ഭാഷ as adjective. A morphology analyser can correctly parse “മലയാളഭാഷ” as മലയാളം<proper-noun><adjective>ഭാഷ<noun>. But since language already broke this rule and many people are liberally using space, a spellchecker would need to handle this cases.
  6. Slip of Finger: Accidental insertions or omissions of key presses is the common reason for spelling mistakes. For alphabetic language, mostly this type of errors are addressed. For Malayalam also, this type of accidental slip of finger can happen. For Latin based languages,  we can make some analysis since we know a QWERTY keyboard layout and do optimized checks for this kind of issues. Since Malayalam will use another level of mapping on top of QWERTY for inputting(inscript, phonetic, transliteration), it is not easy to analyse this errors. So, in general, we can expect random characters or omission of some characters in the query word. An accidental space insertion has the challenge that it will split the word to two words and if the spellchecking is done by one word at a time, we will miss it.

I must add that the above classification is not based on a systematic study of any test data that I can share. Ideally, this classification should done with real sample of Malayalam written on paper and computer. It should be then manually checked for spelling mistakes, list down the mistakes and analyse the patterns. This exercise would be very beneficial for spellcheck research. In my case, even since I released my word list based spellchecker, noticing spelling errors in internet(social media, mainly) has been my obsession. Sometimes I also tried to point out spelling mistakes to authors and that did not give much pleasant experience to me � . The above list is based on my observation from such patterns.

Malayalam spelling checker

To check if a word is valid, known, correctly spelled word, a simple look up using morphology analyser is enough. If the morphology analyser can parse the word, it is correctly spelled. Note that the word can be an agglutinated at arbitrary levels and inflected at same time.

Out of lexicon words

Compared to the finite set word list, the FST based morphology analyser and generator system covers large number of words using its generation system based on morpho-phonotactics. For a discussion on this see my previous blog post about the coverage test. Since every language vocabulary is a dynamic system, it is still impossible to cover 100% words in a language all the time. New words get added to language every now and then. There are nouns related to places, people names, product names etc that is not in the lexicon of Morphology analyser. So, these words will be reported as unknown words by the spellchecker. Unknown word is interpreted as misspelled word too. This issue is a known problem. But since a spellchecker is often used by a human user, the severity of the issue depends whether the spellchecker does not know about lot of commonly used words or not. Most of the spellcheckers provide an option to add to dictionary to avoid this issue.

As part of the Morphology analyser, the expansion of the lexicon is a never ending task. As the lexicon grows, the spellchecker improves automatically.

Malayalam spelling correction

To provide spelling suggestions, the FST based morphology analyser can be used. This is a three step process

  1. Generate a list of candidate words from the query word. The words in this list may be incorrect too. The words are generated based on the patterns we defined based on the nature of spelling mistakes. We scan the query word for common patterns of errors and apply fix for that pattern. Since there dozens of patterns, we will have many candidate words.
  2. From the candidate list, find out the correctly spelled word using spellcheck method. This will result a very small number of words. These words are the probable replacements for the misspelled query word.
  3. Sort the candidate words to provide more probable suggestion as the first one. For this, we can do a ranking on the suggestion strategies. A very common error pattern get high priority at step 1. So the suggestions from that appear first in the candidate list. A more sophisticated approach would use a frequency model for the words. So candidate words that are very frequent in the language will appear as first candidate.

One thing I observed from the above approach is, in reality the candidate words after all the above steps for Malayalam is most of the time one or two. This make step 3 less relevant. At the same time, an edit distance based approach would have generated more than 5 candidate words for each misspelled word. The candidates from the edit distance based suggestion mechanism would be very diverse, meaning, they won’t have be related to the indented word at all.  The following images illustrates the difference.

Spelling suggestion from the morphology analyser based system.
Spelling suggestions from edit distance based candidates

Context sensitive spellchecking

Usually the spellchecking and suggestion are done at one word at a time. But if we know the context of the word, the spellchecking will be further useful. The context is usually the words before and after the word. An example from English is “I am in Engineer”. Here the word “in” is a correct word, but with in the context, it is wrong. To mark the word “in” wrong, and provide ‘an’ as suggestion, one approach is ngram model of part of speech for the language. In simple words, what kind of word can appear in between a known kind of words. If we build this model for a language, that will surely tell that the a locative POS “in” before Engineer is rare or not seen before.

The Standard Malayalam or lack thereof

How do you determine which is the “correct” or “standard” way of writing a word? Malayalam has lot of orthographic variants for words which were introduced to language as genuine mistakes that later became common words(രാപàµ�പകൽ/രാപകൽ, ചിലവàµ�/ചെലവàµ�), phonetic simplification(à´…à´¦àµ�à´§àµ�യാപകൻ/à´…à´§àµ�യാപകൻ, à´¸àµ�വർണàµ�ണം/à´¸àµ�വർണം), or old spelling(കർതàµ�താവàµ�/à´•àµ�à´¤àµ�താവàµ�àµ�) and so on. A debate about the correctness of these words will hardly reach conclusion. For our case, this is more of an issue of selecting words in the lexicon. Which one to include, which one to exclude? It is easy to consider these debates as blocker for the progress of the project and give up: “well, these things are not decided by academics so far, so we cannot do anything about it till they make up their mind”.

I did not want to end up in that deadlock. I decided to be liberal about the lexicon. If people are using some words commonly, they are valid words the project need to recognize as much as possible. That is the very liberal definition I have. I leave the standardization discussion to linguists who care about it.

The news report from Mathrubhumi daily in 2007 about my old spelling checker

Back in 2007, when I developed the old Malayalam spellchecker, these debates came up.  Dr. P Somanathan, who helps me a lot now a days with this project, wrote about the issue of Malayalam spelling inconsistencies: “à´šà´°à´¿à´¤àµ�à´°à´¤àµ�തെ വീണàµ�ടെടàµ�à´•àµ�à´•àµ�à´•:” and “വേണം നമàµ�à´•àµ�à´•àµ� à´�കീകൃതമായ ഒരെഴàµ�à´¤àµ�à´¤àµ�രീതി

References

  1. A Data-Driven Approach to Checking and Correcting Spelling Errors in Sinhala. Asanka Wasala, Ruvan Weerasinghe, Randil Pushpananda,
    Chamila Liyanage and Eranga Jayalatharachchi [pdf] This paper discuss the phonetic similarity based strategies to create a wordlist, instead of edit distance approach.
  2. Finite-State Spell-Checking with Weighted Language and Error Models—Building and Evaluating Spell-Checkers with Wikipedia as Corpus Tommi A Pirinen, Krister Linde�n [pdf] This paper outlines the usage of Finite state transducer technique to address the issue of infinite dictionary of morphologically rich languages. They use Finnish as the example language
  3. The Malayalam morphology analyser project by myself https://gitlab.com/smc/mlmorph is the foundation for the spellchecker.
  4. The common Malayalam spelling mistakes and confusables were presented in great depth by Renowned linguist and author Panmana Ramachandran Nair in his books  ‘തെറ�റില�ലാത�ത മലയാളം’, ‘തെറ�റ�ം ശരിയ�ം’, ‘ശ�ദ�ധ മലയാളം’ and ‘നല�ല മലയാളം’.
  5.  Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams Tommi A Pirinen and Miikka Silfverberg and Krister Lindén [pdf] – This paper discuss the context sensitive spellchecker approach.

Where can I try the spellchecker?

If you curious about the implementation of this approach, please refer https://gitlab.com/smc/mlmorph and https://gitlab.com/smc/mlmorph/wikis/Spellchecker-Plan. Since the implementation is not complete, I will write a new article about it later. Thanks for reading!

A screenshot of Malayalam spellchecker in action. Along with incorrect words, some correct words are marked as misspelled too. This is because of the incomplete morphology analyser. As it improves, more words will be covered.

by Santhosh Thottingal at September 08, 2018 09:41 AM

August 11, 2018

Santhosh Thottingal

Malayalam morphology analyser – status update

For the last several months, I am actively working on the Malayalam morphology analyser project. In case you are not familiar with the project, my introduction blog post is a good start. I was always skeptical about the approach and the whole project as such looked very ambitious. But, now  I am almost confident that the approach is viable. I am making good progress in the project, so this is some updates on that.

Analyser coverage statistics

Recently I added a large corpora to frequently monitor the percentage of words the analyser can parse.  The corpora was selected from two large chapters of ഐതിഹ്യമാല, some news reports, an art related essay, my own technical blog posts to have some diversity in the vocabulary.

Total words
15808
Analysed words10532
Coverage66.62%
Time taken
0.443 seconds

This is a very encouraging. Achieving a 66% for such a morphologically rich language Malayalam is no small task. From my reading, Turkish and Finnish, languages with same complexity of morphology achieved about 90% coverage. It may be more difficult to increase the coverage for me compared to achieving this much so far. So I am planning some frequency analysis on words that are not parsed by analyser, and find some patterns to improve.

The performance aspect is also notable. Once the automata is loaded to memory, the analysis or generation is super fast. You can see that ~16000 words were analyzed under half of a second.

Tests

From the very beginning the project was test driven. I now has 740 test cases for various word forms

The transducer

The compiled transducer now is 6.2 MB.  The transducer is written in SFST-PL and compile using SFST. It used to be compiled using hfst, but hfst is now severely broken for SFST-PL compilation, so I switched to SFST. But the compiled transducer is read using hfst python binding.

Fst type
SFST
arc typeSFST
Number of states
200562
Number or arcs
732268
Number of final states
130

The Lexicon

The POS tagged lexicon I prepared is from various sources like wiktionary, wikipedia(based on categories), CLDR. While developing I had to improve the lexicon several times since none of the above sources are accurate. The wiktionary also introduced a large amount of archaic or sanskrit terms to the lexicon. As of today, following table illustrates the lexicon status

Nouns
64763
Person names
505
Place names
2031
Postpositions
85
Pronouns
33
Quantifiers
57
Abbreviations
27
Adjectives
18
Adverbs
14
Affirmatives
6
Conjunctions
75
Demonstratives
9
English borrowed nouns
657
Interjections
36
Language names(nouns)
639
Affirmations and negations
8
Verbs
3844

As you can see, the lexicon is not that big. Especially it is very limited for proper nouns like names, places. I think the verb lexicon is much better. I need to find a way to expand this further.

POS Tagging

There is no agreement or standard on the POS tagging schema to be used for Malayalam. But I refused to set this is as a blocker for the project. I defined my own POS tagging schema and worked on the analyser. The general disagreement is about naming, which is very trivial to fix using a tag name mapper. The other issue is classification of features, which I found that there no elaborate schema that can cover Malayalam.

I started referring http://universaldependencies.org/ and provided links to the pages in it from the web interface.  But UD is also missing several tags that Malayalam require. So far I have defined 85 tags

Challenges

The main challenge I am facing is not technical, it is linguistic. I am often challenged by my limited understanding of Malayalam grammar. Especially about the grammatical classifications, I find it very difficult to come up with an agreement after reading several grammar books. These books were written in a span of 100 years and I miss a common thread in the approach for Malayalam grammar analysis. Sometimes a logical classification is not the purpose of the author too. Thankfully, I am getting some help from Malayalam professors whenever I am stuck.

The other challenge is I hardly got any contributor to the project except some bug reporting. There is a big entry barrier to this kind of projects. The SFST-PL is not something everybody familiar with. I need to write some simple examples for others to practice and join.

I found that some practical applications on top of the morphology analyser is attracting more people. For example, the number spellout application I wrote caught the attention of many people. I am excited to present the upcoming spellchecker that I was working recently. I will write about the theory of that soon.

by Santhosh Thottingal at August 11, 2018 12:43 PM

August 10, 2018

Santhosh Thottingal

How to customize Malayalam fonts in Linux

Now a days GNU/Linux distributions like Ubuntu, Debian, Fedora etc comes with pre-configured fonts for Malayalam. For Sans-serif family, it is Meera and  for serif, it is Rachana. If you like to change these fonts, there is no easy way to do with configuration tools in Gnome or KDE. They provide a general font selector for the whole desktop, but not for a given language.

The advantage of setting these preference at system level is, you don’t need to choose this fonts at application level then. For example, you don’t need to set them for firefox, chrome etc. All will follow the system preferences. We will use fontconfig for this

First, create a file named ~/.config/fontconfig/conf.d/50-my-malayalam.conf. If the folders for this file does not exist, just create them. To this file, add the following content.

<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
<!-- Malayalam (ml) -->
<match target="font">
        <test name="lang" compare="contains">
                <string>ml</string>
        </test>
        <alias>
                <family>sans-serif</family>
                <prefer>
                        <family>Manjari</family>
                </prefer>
        </alias>
</match>

<match target="font">
        <test name="lang" compare="contains">
                <string>ml</string>
        </test>
        <alias>
                <family>serif</family>
                <prefer>
                        <family>Rachana</family>
                </prefer>
        </alias>
</match>

<!-- Malayalam (ml) ends -->

</fontconfig>

Save the file and you are done. You can check if the default font for Malayalam changed or not using the following command

$ LANG=ml_IN fc-match

It should list Manjari. The above code we added to the file is not complicated. You can see that we are setting the sans-serif font preference for ml(Malayalam) language as Manjari. Also serif font preference as Rachana. You are free to change the fonts to whatever you prefer.

Note that you may want to close and open your applications to get this preference applied.

You may choose one of the fonts available at smc.org.in/fonts, download and install and use the above configuration with it.

by Santhosh Thottingal at August 10, 2018 04:09 PM

July 29, 2018

Santhosh Thottingal

യുവാക്കളുടെ തൊഴിലഭിമാനവും തൊഴിൽ സൊസൈറ്റികളും

നമ്മുടെ നാട്ടിലെ യുവാക്കൾ നേരിടുന്ന ഒരു പ്രതിസന്ധിയെപ്പറ്റിയും അതിന് പരിഹാരമായേക്കാവുന്ന  ഒരാശയത്തെപ്പറ്റിയും എഴുതിയ ഒരു കുറിപ്പാണിതു്.

നമ്മുടെ നാട്ടിൽ സവിശേഷ നൈപുണികൾ ആവശ്യമുള്ള പലതരത്തിലുള്ള കൂലിപ്പണികൾ,  ഡ്രൈവിങ്ങ്, കൃഷിപ്പണികൾ, പെയിന്റിങ്ങ്, കെട്ടിടനിർമാണം, മെക്കാനിക് തുടങ്ങിയ ജോലികളിൽ ഏർപ്പെടുന്ന യുവാക്കൾ ധാരാളമുണ്ട്. ഇവരെല്ലാം മിക്കപ്പൊഴും അസംഘടിത മേഖലയിലാണുതാനും. സർക്കാർ, സ്വകാര്യ ജോലി നേടാത്തതോ നേടാനാവശ്യമായ വിദ്യാഭ്യാസമില്ലാത്തവരോ ആയ യുവാക്കളായ പുരുഷന്മാരാണ് ഇവയിലധികവും. പക്ഷേ യുവതികൾ വിദ്യാഭ്യാസം പരമാവധി വിവാഹം വരെ തുടർന്ന് പിന്നീട് കുടുംബജീവിതത്തിൽ എത്തിച്ചേരുകയാണ്. ഇരുപതിനും മുപ്പത്തഞ്ചിനും ഇടക്ക് പ്രായമുള്ള ഇവർ പുതിയൊരു വെല്ലുവിളി നേരിടുന്നുണ്ട്. അതിനെപ്പറ്റി വിശദമായ ഒരു പഠനറിപ്പോർട്ട് ഈയിടെ സമകാലിക മലയാളം വാരിക പ്രസിദ്ധീകരിച്ചിരുന്നു(നിത്യഹരിത വരൻമാർ-രേഖാചന്ദ്ര, സമകാലിക മലയാളം ജൂലൈ 16). മലബാർ മേഖലയിൽ വ്യാപകമായി ഈ തരത്തിലുള്ള യുവാക്കൾ അവിവാഹിതരായിത്തുടരുന്നു എന്നതാണ് പഠനം.

ഇതിന്റെ കാരണം, സാംസ്കാരികമായി മേൽപ്പറഞ്ഞ ജോലിക്കാരോടുള്ള യുവതികളുടെ കുടുംബങ്ങളുടെ താത്പര്യക്കുറവാണ്. സർക്കാർ, സ്വകാര്യകമ്പനി ജോലിയില്ലാത്തവർക്ക് യുവതികളെ വിവാഹം കഴിച്ചുകൊടുക്കാൻ ആരും തയ്യാറാവുന്നില്ല. കുടക് കല്യാണം തുടങ്ങിയ പുതിയ പ്രതിഭാസങ്ങളുടെ വിവരങ്ങൾ ആ ലേഖനത്തിലുണ്ട്. ജാതി, ജാതകം തുടങ്ങിയവ പണ്ടത്തേക്കാളേറെ വഴിമുടക്കിയായി നിൽക്കുന്നുമുണ്ട്. പ്രണയവിവാഹങ്ങൾക്ക് ഗ്രാമപ്രദേശങ്ങളിൽ മിക്കവാറും സദാചാരപ്പോലീസുകാർ ഇടകൊടുക്കാറുമില്ല. യുവാക്കൾ ഇത്തരം പണികൾക്ക് പോയി സ്വന്തം വീട്ടിലെ യുവതികൾക്ക് കുറേകൂടി വിദ്യാഭ്യാസം കൊടുക്കാൻ ശ്രമിക്കാറുണ്ടെങ്കിലും ആ യുവതികൾ പിന്നീട് മെച്ചപ്പെട്ട ജോലിയുള്ള യുവാക്കളെ മാത്രം ശ്രമിക്കുന്നതുകൊണ്ട്, അവർ വീണ്ടും പ്രതിസന്ധിയിലാവുന്നു.<

കായികാദ്ധ്വാനത്തോടുള്ള വിമുഖത വളർന്നുവരാൻ മേൽപ്പറഞ്ഞ പ്രശ്നം കാരണമാകുന്നുണ്ട്. സോഷ്യൽ സ്റ്റാറ്റസ് എന്ന ഈഗോ പതിയെ മേൽപ്പറഞ്ഞ സുപ്രധാന ജോലികളിലേക്ക് ആളെകിട്ടാനില്ലാത്ത പ്രശ്നത്തിലേക്കും എത്തിക്കുന്നുണ്ട്. സമൂഹത്തിലെ  പൊതുവെയുള്ള വിദ്യാഭ്യാസനിലവാരം കൂടിവരുന്തോറും ഈ ഈഗോ വല്ലാതെ വർദ്ധിക്കുകയും ചെയ്യും. പതിയെപ്പതിയെ അനാരോഗ്യകരമായ ഒരു സാമൂഹികവ്യവസ്ഥ ഇതിൽനിന്നും ഉടലെടുക്കുമെന്ന് ഞാൻ ഭയക്കുന്നു. യുവതികൾ പ്രത്യേകിച്ചും കുടുംബങ്ങളിൽ നിന്നുള്ള സമ്മർദ്ദം കാരണം ജോലിസാധ്യതകളുടെ വളരെ ഇടുങ്ങിയ ഒരു സെലക്ഷൻ സ്പേസിലേക്ക് പോകുന്നുണ്ട്. അവർക്ക് മേൽപ്പറഞ്ഞ ജോലികളിലേക്ക് പോകാൻ നമ്മുടെ സാമൂഹികാവസ്ഥ സമ്മതിക്കാത്ത സ്ഥിതിയാണ് വരുന്നത്. ഇവിടെയാണ് അതിഥിത്തൊഴിലാളികൾ അവസരങ്ങൾ കണ്ടെത്തിയത്.

സാമൂഹികരംഗത്ത് മതേതര പൊതുവേദികൾ കുറഞ്ഞ നമ്മുടെ സമൂഹത്തിൽ ഈ യുവശക്തി രാഷ്ട്രീയപരമായി പ്രബുദ്ധരായിരിക്കുക എന്ന വെല്ലുവിളി കൂടുതലാവുന്നുമുണ്ട്. അരാഷ്ട്രീയത ഒരു ഡിഫോൾട്ട് ചോയ്സ് ആയി യുവാക്കൾക്കിടയിൽ വരാനുള്ള സാധ്യത എന്തുകൊണ്ടും പ്രതിരോധിച്ചേ മതിയാകൂ.

ഇതുവരെ ചുരുക്കിപ്പറഞ്ഞ പ്രശ്നങ്ങൾക്ക് മേൽപ്പറഞ്ഞ യുവാക്കൾക്കിടയിലേക്ക് ഒരു സാമൂഹികമുന്നേറ്റത്തിന്റെ ആവശ്യകതയുണ്ട്. ഉദ്ദേശങ്ങളിതാണ്:

  • കായികാദ്ധ്വാനമുള്ളതോ അല്ലാത്തതോ ആയ എല്ലാത്തരം അസംഘടിത ജോലികൾക്കും സാമൂഹികാംഗീകാരം വളർത്തിയെടുക്കുക. യുവാക്കളുടെ മാനവവിഭവശേഷി മിഥ്യാധാരണകളിലൂടെയും സാമൂഹികമായ കെട്ടുപാടുകളിലും തളയ്ക്കാതിരിക്കുക.
  • ഇത്തരം ജോലിക്കാരെ സംഘടിതമേഖലയിലേക്ക് കൊണ്ടുവന്ന് രാഷ്ട്രീയമായി പ്രബുദ്ധരാക്കുക. മതേതര ഇടങ്ങൾ സംഘടിപ്പിക്കുക.
  • തൊഴിൽ പരിശീലനങ്ങളും, ഉള്ള തൊഴിലുകളിൽ ആരോഗ്യകരമായ പരിഷ്കാരങ്ങൾക്ക് പ്രേരണയും പരിഷ്കാരങ്ങളും നൽകുക. തൊഴിലുകൾ ആകർഷണീയമാക്കുക.
  • കുടുംബശ്രീ കൊണ്ടുവന്ന സാമൂഹികചാലകശക്തി യുവാക്കളിലേക്ക് കൂടുതൽ വ്യാപിപ്പിക്കുക.

ഇതിലേക്ക് എനിക്ക് നിർദ്ദേശിക്കാനുള്ള ഒരു ആശയം “തൊഴിൽ സൊസൈറ്റികൾ” ആണ്. അതിനെപ്പറ്റിയുള്ള ഏകദേശധാരണ ഇങ്ങനെയാണ്.

  • തൊഴിലാളികളെ ആവശ്യമുള്ളവരും തൊഴിലാളികളും തമ്മിലുള്ള ഒരു മീറ്റിങ്ങ് പോയിങ്ങ് ആയി ഈ സൊസൈറ്റികൾ പ്രവർത്തിക്കുന്നു.
  • യുവാക്കൾ അവിടെ രജിസ്റ്റർ ചെയ്യുന്നു, അവരുടെ കഴിവുകളും.
  • ഇത്തരം സൊസൈറ്റികളിൽ രജിസ്റ്റർ ചെയ്തവർ യൂണിഫോമുള്ളവരും നെയിംടാഗും തൊഴിൽ സുരക്ഷാവസ്ത്രങ്ങൾ/ഉപകരണങ്ങളോടുകൂടിയവരാണ്(to overcome social stigma, this is
    important)
  • ആർക്കും ഈ സൈസൈറ്റികളിൽ ജോലിക്കാരെ തേടാം. നേരിട്ട് പോയി അന്വേഷിക്കണമെന്നില്ല. അല്പസ്വല്പം ടെക്നോളജിയുടെ സഹായത്തോടെ ഈ കണക്ഷനുകൾ പെട്ടെന്നുണ്ടാക്കാം. മൊത്തത്തിൽ അപ്പോയിന്റ്മെന്റ് സിസ്റ്റം ഒക്കെ വെച്ച് പഴയ ഫ്യൂഡൽ കാലഘട്ടത്തിലെ മുതലാളി-പണിക്കാർ റിലേഷനെ പൊളിച്ചെഴുതലാണ് ഉദ്ദേശം. അതുവഴി ഏത് ജോലിയുടെയും ഉയർച്ച താഴ്ചകളെ പൊളിക്കലും.
  • സൊസൈറ്റികൾക്ക് കൂലിനിരക്കുകൾ നിശ്ചയിക്കാം. തൊഴിൽ അവകാശങ്ങളെപ്പറ്റി ബോധമുള്ളവരായിരിക്കും.

ഈ ആശയം പാശ്ചാത്യനാടുകളിൽ മുതലാളിത്തവ്യവസ്ഥിതി നടപ്പിലാക്കിത്തുടങ്ങിയിട്ടുണ്ട്.Amazon Services ഉദാഹരണം.  Uber, Airbnb ഒക്കെപ്പോലെ അത്തരം “ഓൺലൈൻ ആപ്പുകൾ” ഉടൻ
നമ്മുടെ നാട്ടിലുമെത്തും. പക്ഷേ, തൊഴിൽദാതാവ്-തൊഴിലാളി ബന്ധത്തിൽനിന്നുള്ള ചൂഷണത്തിനപ്പുറം അവക്ക് ലക്ഷ്യങ്ങളുണ്ടാവില്ല. ആ സ്പേസിലേക്ക് സാമൂഹികരാഷ്ട്രീയ ലക്ഷ്യങ്ങളോടെ നേരത്തെത്തന്നെ കേരളജനത പ്രവേശിക്കണമെന്നാണാഗ്രഹം.

by Santhosh Thottingal at July 29, 2018 10:08 AM

July 15, 2018

Santhosh Thottingal

The many forms of ചിരി ☺️

This is an attempt to list down all forms of Malayalam word ചിരി(meaning: ☺, smile, laugh). For those who are unfamiliar with Malayalam, the language is a highly inflectional Dravidian language. I am actively working on a morphology analyser(mlmorph) for the language as outlined in one of my previous blogpost.

I prepared this list as a test case for mlmorph project to evaluate the grammar rule coverage. So I thought of listing it here as well with brief comments.
1. ചിരി
ചിരി is a noun. So it can have all nominal inflections.

2. ചിരിയുടെ
3. ചിരിക്ക്
4. ചിരിയ്ക്ക്
5. ചിരിയെ
6. ചിരിയിലേയ്ക്ക്
7. ചിരികൊണ്ട്
8. ചിരിയെക്കൊണ്ട്
9. ചിരിയിൽ
10. ചിരിയോട്
11. ചിരിയേ

There is a plural form
12. ചിരികൾ

A number of agglutinations can happen at the end of the word using Affirmatives, negations, interrogatives etc. For example, ചിരിയുണ്ട്, ചിരിയില്ല, ചിരിയോ. But now I am ignoring all agglutinations and listing only the inflections.

ചിരിക്കുക is the verb form of ചിരി.
13.  ചിരിക്കുക

It can have the following tense forms
14. ചിരിച്ചു
15. ചിരിക്കുക
16. ചിരിക്കും

A concessive form for the word
17. ചിരിച്ചാലും

This verb has the following aspects
18. ചിരിക്കാറ്
19. ചിരിച്ചിരുന്നു
20. ചിരിച്ചിരിയ്ക്കുന്നു
21. ചിരിച്ചിരിക്കുന്നു
22. ചിരിച്ചിരിക്കും
23. ചിരിച്ചിട്ട്
24. ചിരിച്ചുകൊണ്ടിരുന്നു
25. ചിരിച്ചുകൊണ്ടേയിയിരുന്നു
26. ചിരിച്ചുകൊണ്ടേയിരിക്കുന്നു
27. ചിരിച്ചുകൊണ്ടിരിക്കുന്നു
28. ചിരിച്ചുകൊണ്ടിരിക്കും
29. ചിരിച്ചുകൊണ്ടേയിരിക്കും

There are number of mood forms for the verb ചിരിക്കുക
30. ചിരിക്കാവുന്നതേ
31. ചിരിച്ചേ
32. ചിരിക്കാതെ
33. ചിരിച്ചാൽ
34. ചിരിക്കണം
35. ചിരിക്കവേണം
36. ചിരിക്കേണം
37. ചിരിക്കേണ്ടതാണ്
38. ചിരിക്ക്
39. ചിരിക്കുവിൻ
40. ചിരിക്കൂ
41. ചിരിക്ക
42. ചിരിച്ചെനെ
43. ചിരിക്കുമേ
44. ചിരിക്കട്ടെ
45. ചിരിക്കട്ടേ
46. ചിരിക്കാം
47. ചിരിച്ചോ
48. ചിരിച്ചോളൂ
49. ചിരിച്ചാട്ടെ
50. ചിരിക്കാവുന്നതാണ്
51. ചിരിക്കണേ
52. ചിരിക്കേണമേ
53. ചിരിച്ചേക്കാം
54. ചിരിച്ചോളാം
55. ചിരിക്കാൻ
56. ചിരിച്ചല്ലോ
57. ചിരിച്ചുവല്ലോ

There are a few inflections with adverbial participles
58. ചിരിക്കാൻ
59. ചിരിച്ച്
60. ചിരിക്ക
61. ചിരിക്കിൽ
62. ചിരിക്കുകിൽ
63. ചിരിക്കയാൽ
64. ചിരിക്കുകയാൽ

The verb can act as an adverb clause. Examples
65. ചിരിച്ച
66. ചിരിക്കുന്ന
67. ചിരിച്ചത്
68. ചിരിച്ചതു്
69. ചിരിക്കുന്നത്

The above two forms act as nominal forms. Hence they have all nominal inflections too
70. ചിരിച്ചതിൽ
71. ചിരിക്കുന്നതിൽ
72. ചിരിക്കുന്നതിന്
73. ചിരിച്ചതിന്
74. ചിരിച്ചതിന്റെ
75. ചിരിക്കുന്നതിന്റെ
76. ചിരിച്ചതുകൊണ്ട്
77. ചിരിക്കുന്നതുകൊണ്ട്
78. ചിരിച്ചതിനോട്
79. ചിരിക്കുന്നതിനോട്
80. ചിരിക്കുന്നതിലേയ്ക്ക്

Now, a few voice forms for the verb ചിരിക്കുക
81. ചിരിക്കപ്പെടുക
82. ചിരിപ്പിക്കുക

These voice forms are again just verbs. So it can go through all the above inflections the verb ചിരിക്കുക has. Not writing it here, since it mostly a repeat of what is listed here. ചിരിക്കപ്പെടുക has all inflections of the verb പെടുക. You can see them listed in my test case file though

A noun can be derived from the verb ചിരിക്കുക too. That is
83. ചിരിക്കൽ

Since it is a noun, all nominal inflections apply.
84. ചിരിക്കലേ
85. ചിരിക്കലിനോട്
86. ചിരിക്കലിൽ
87. ചിരിക്കലിന്റെ
88. ചിരിക്കലിനെക്കൊണ്ട്
89. ചിരിക്കലിലേയ്ക്ക്
90. ചിരിക്കലിന്

My test file has 164 entries including the ones I skipped here. As per today, the morphology analyser can parse 74% of the items. You can check the test results here: https://paste.kde.org/pn5z0oh7g

A native Malayalam speaker may point out that the variation fo this word ചിരിയ്ക്കുക -with യ് before ക്കുക. My intention is to support that variation as well. Obviously that word also will have the above listed inflected forms.

Now that I wrote this list here, I think having a rough English translation of each items would be cool, but it is too tedious to me.

by Santhosh Thottingal at July 15, 2018 12:11 PM

July 03, 2018

Santhosh Thottingal

How to type Malayalam using Keyman 10 and Mozhi

This is a quick tutorial on installing Mozhi input method in Windows 10.

Mozhi is a transliteration based keyboard  for Malayalam. You can type malayaalam to get മലയാളം for example. We will use Keyman tool as the input tool. Keyman input tool is an opensource input mechanism now developed by SIL. It supports lot of languages and Mozhi malayalam is one of that.

Step 1: Download Keyman desktop with Mozhi Malayalam keyboard

Go to https://keyman.com/keyboards/mozhi_malayalam. There you will see the following options to download. Select the first one as shown below. Download the installer to your computer. It is a file about 20MB.

Keyman 10 Desktop download page.

Step 2: Installation

Double click the downloaded file to start installation. The installer will be like this:

Keyman 10 Desktop installer

Click on the Install Keyman Desktop button. You will see the below screen.

Keyman 10 Desktop welcome page.

 

Press the “Start keyman” button. The installation will start and keyboard will start.

Step 3: Choose Mozhi input method

You will see a small icon at the bottom of your screen, near time is displayed.

Click on that to choose Mozhi.

Keyboard selection

Once you chose Mozhi, you can type in Manglish anywhere and you will see malayalam. To learn typing click on the “Keyboard Usage” as shown above.

Step 4: Start typing in Malayalam

You can directly type Malayalam in any application without copy paste. Just like English, start typing. Make sure to use a good Malayalam font. You can get them from https://smc.org.in/fonts/

Using Mozhi in LibreOffice. Notice the font used is Manjari.What I typed is “ippOL enikk malayaalam ezhuthaanaRiyaam”

 

by Santhosh Thottingal at July 03, 2018 02:41 PM

July 01, 2018

Santhosh Thottingal

Kindle supports custom fonts

I am pleasantly surprised to see that Amazon Kindle now supports installing custom fonts. A big step towards supporting non-latin content in their devices. I can now read Malayalam ebooks in my kindle with my favorite fonts.

Content rendered in Manjari font. Note that I installed Bold, Regular, Thin variants so that Kindle can pick up the right one

This feature is introduced in Kindle 5.9.6.1 version released in June 2018. Once updated to that version, all you need is to connect the device using the USB cable to your computer. Copy your fonts to the fonts folder there. Remove the usb cable. You will see the fonts listed in font selector.

Kindle had added Malayalam rendering support back in 2016, but the default font provided was one of the worst Malayalam fonts. It had wrong glyphs for certain conjuncts and font had minimal glyphs.

I tried some of the SMC Malayalam fonts in the new version of Kindle. Screenshots given below

Custom fonts selection screen. These fonts were copied to the device

Select a font other than the default one

Content in Rachana.

Make sure to check the version. 5.9.6.1 is the latest version and it supports custom fonts

by Santhosh Thottingal at July 01, 2018 04:15 AM

May 03, 2018

Rajeesh K Nambiar

Adventures in upgrading to Fedora 27/28 using ‘dnf system-upgrade’

[This post was drafted on the day Fedora 27 released, about half a year ago, but was not published. The issue bit me again with Fedora 28, so documenting it for referring next time.]

UPDATE: The issue occurred in Fedora 28 because I had exclude=grub2-tools in /etc/dnf/dnf.conf which is the reason error “nothing provides grub2-tools” was coming up. Removing that previously added and then forgotten line fixes the issue with updating grub2 packages.

With fedup and subsequently dnf improving the upgrade experience of Fedora for power users, last few system upgrades have been smooth, quiet, even unnoticeable. That actually speaks volumes of the maturity and user friendliness achieved by these tools.

Upgrading from Fedora 25 to 26 was so event-less and smooth (btw: I have installed and used every version of Fedora from its inception and the default wallpaper of Fedora 26 was the most elegant of them all!).

With that, on the release day I set out to upgrade the main workstation from Fedora 26 to 27 using dnf system-upgrade as documented. Before downloading the packages, dnf warned that upgrade cannot be done because of package dependency issues with grub2-efi-modules and grub2-tools.

Things go wrong!

I simply removed both the offending packages and their dependencies (assuming were probably installed for the grub2-breeze-theme dependency, but grub2-tools actually provides grub2-mkconfig) and proceeded with dnf upgrade --refresh and dnf system-upgrade download --refresh --releasever=27. If you are attempting this, don’t remove the grub2 packages yet, but read on!

Once the download and check is completed, running dnf system-upgrade reboot will cause the system reboot to upgrade target and actual upgrade happen.

Except, I was greeted with EFI MOK (Machine Owner Key) screen on reboot. Now that the grub2 bootloader is broken thanks to the removal of grub2-efi-modules and other related packages, a recovery must be attempted.

Rescue

It is important to have a (possibly EUFI enabled) live media where you can boot from. Boot into the live media and try to reinstall grub. Once booted in, mount the root filesystem under /mnt/sysimage, and EFI boot partition at /mnt/sysimage/boot/efi. Then chroot /mnt/sysimage and try to reinstall grub2-efi-x64 and shim packages. If there’s no network connectivity, don’t despair, nmcli is to your rescue. Connect to wifi using nmcli device wifi connect <ssid> password <wifi_password>. Generate the boot configuration using grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg followed by actual install grub2-install --target=x86_64-efi /dev/sdX (the –target option ensures correct host installation even if the live media is booted via legacy BIOS). You may now reboot and proceed with the upgrade.

But this again failed at the upgrade stage because of grub package clash that dnf warned earlier about.

Solution

Once booted into old installation, take a backup of the /boot/ directory, remove the conflicting grub related packages, and copy over the backed up /boot/ directory contents, especially /boot/efi/EFI/fedora/grubx64.efi. Now rebooting (using dnf system-upgrade reboot) had  the grub contents intact and the upgrade worked smoothly.

For more details on the package conflict issue, follow this bug.

by Rajeesh at May 03, 2018 07:16 AM

March 25, 2018

Balasankar C

FOSSAsia 2018 - Singapore

Heya,

So I attended my first international FOSS conference - FOSSAsia 2018 at Lifelong learning institute, Singapore. I presented a talk titled “Omnibus - Serve your dish on all the tables” (slides, video) about the tool Chef Omnibus which I use on a daily basis for my job at GitLab.

The conference was a 4-day long one and my main aim was to network with as many people as I can. Well, I planned to attend sessions, but unlike earlier times when I attended all the sessions, these days I am more focussed on certain topics and technologies and tend to attend sessions on those (for example, devops is an area I focuses on, block chain isn’t).

One additional task I had was attend the Debian booth at the exhibition from time to time. It was mainly handled by Abhijith (who is a DM). I also met two other Debian Developers there - Andrew Lee(alee) and Héctor Orón Martínez(zumbi).

I also met some other wonderful people at FOSSAsia, like Chris Aniszczyk of CNCF, Dr Graham Williams of Microsoft, Frank Karlitschek of NextCloud, Jean-Baptiste Kempf and Remi Denis-Courmont of VideoLan, Stephanie Taylor of Google, Philip Paeps(trouble) of FreeBSD, Harish Pillai of RedHat, Anthony, Christopher Travers, Vasudha Mathur of KDE, Adarsh S of CloudCV (and who is from MEC College, which is quite familiar to me), Tarun Kumar of Melix, Roy Peter of Go-Jek (with whom I am familiar, thanks to the Ruby conferences I attended), Dias Lonappan of Serv and many more. I also met with some whom I know knew only digitally, like Sana Khan who was (yet another, :D) a Debian contributor from COEP. I also met with some friends like Hari, Cherry, Harish and Jackson.

My talk went ok without too much of stuttering and I am kinda satisfied by it. The only thing I forgot is to mention during the talk that I had stickers (well, I later placed them in the sticker table and it disappeared within minutes. So that was ok. ;))

PS: Well, I had to cut down quite a lot of my explanation and drop my demo due to limited time. This caused me miss many important topics like omnibus-ctl or cookbooks that we use at GitLab. But, I had a few participants come up and meet me after the talk, with doubts regarding omnibus and its similarity with flatpak, relevance during the times of Docker etc, which was good.

Some photos are here:

Abhijith in Debian Booth

Abhijith in Debian Booth

Abhijith with VLC folks

Abhijith with VLC folks

Andrew's talk

Andrew's talk

With Anthony and Harish: Two born-and-brought-up-in-SG-Malayalees

With Anthony and Harish: Two born-and-brought-up-in-SG-Malayalees

Chris Aniszczyk

With Chris Aniszczyk

Debian Booth

At Debian Booth

Frank Karlitschek

With Frank Karlitschek

Graham Williams

With Graham Williams

MOS Burgers - Our breakfast place

MOS Burgers - Our breakfast place

Premas Cuisine - The kerala taste

Premas Cuisine - The kerala taste

The joy of seeing Malayalam

The joy of seeing Malayalam

With Sana

With Sana

Well, Tamil, ftw

Well, Tamil, ftw

Zumbi's talk

Zumbi's talk

March 25, 2018 12:00 AM

February 09, 2018

Rajeesh K Nambiar

Sundar — a new traditional orthography ornamental font for Malayalam

There is a dearth of good Unicode fonts for Malayalam script. Most publishing houses and desktop publishing agencies still rely on outdated ASCII era fonts. This not only causes issues with typesetting using present technologies, it makes the ‘document’ or ‘data’ created using these fonts and tools absolutely useless — because the ‘document/data’ is still Latin, not Malayalam.

Rachana Institute of Typography (rachana.org.in) has designed and published a new traditional orthography ornamental Unicode font for Malayalam script, for use in headings, captions and titles. It is named after Sundar, who was a relentless advocate of open fonts, open standards and open publishing. He dreamed of making available several good quality Malayalam fonts, particularly created by Narayana Bhattathiri with his unique calligraphic and typographic signature, freely and openly to the users. The font is licensed under OFL.

The font follows traditional orthography for Malayalam, rather than the unpleasing reformed orthography which was solely introduced due to the technical limitations of typewriters in the ’70s. Such restrictions do not apply to computers and present technology, so it is possible to render the classic beauty of Malayalam script using Unicode and Opentype technologies.

‘Sundar’ is designed by K.H. Hussain — known for his work on Rachana and Meera fonts which comes pre-installed with most Linux distributions; and Narayana Bhattathiri — known for his beautiful calligraphy and lettering in Malayalam script. Graphic engineers of STM Docs (stmdocs.in) did the vectoring and glyph creation. Yours truly took care of the Opentype feature programming. The font can be freely downloaded from rachana.org.in.

The source code of ‘Sundar’, licensed under OFL is available at https://gitlab.com/rit-fonts/Sundar.

by Rajeesh at February 09, 2018 07:53 AM

January 17, 2018

Balasankar C

Introduction to Git workshop at CUSAT

Heya,

It has been long since I have written somewhere. In the last year I attended some events, like FOSSMeet, DeccanRubyConf, GitLab’s summit and didn’t write anything about it. The truth is, I forgot I used to write about all these and never got the motivation to do that.

Anyway, last week, I conducted a workshop on Git basics for the students of CUSAT. My real plan, as always, was to do a bit of FOSS evangelism too. Since the timespan of workshop was limited (10:00 to 13:00), I decided to keep everything to bare basics.

Started with an introduction to what a VCS is and how it became necessary. As a prerequisite, I talked about FOSS, concept of collaborative development, open source development model etc. It wasn’t easy as my audience were not only CS/IT students, but those from other departments like Photonics, Physics etc. I am not sure if I was able to help them understand the premise clearly. However, then I went on to talk about what Git does and how it helps developers across the world.

IIRC, this was the first talk/workshop I did without a slide show. I was damn lazy and busy to create one. I just had one page saying “Git Workshop” and my contact details. So guess what? I used a whiteboard! I went over the basic concepts like repositories, commits, staging area etc and started with the hand-on session. In short, I talked about the following

  1. Initializing a repository
  2. Adding files to it
  3. Add files to staging areas
  4. Committing
  5. Viewing commit logs
  6. Viewing what a specific commit did
  7. Viewing a file’s contents at a specific commit
  8. Creating a GitLab account (Well, use all opportunity to talk about your employer. :P)
  9. Creating a project in GitLab
  10. Adding it as a remote repository to your local one
  11. Pushing your changes to remote repository

I wanted to talk about clone, fork, branch and MRs, but time didn’t permit. We wound up the session with Athul and Kiran talking about how they need the students to join the FOSSClub of CUSAT, help organizing similar workshops and how it can help them as well. I too did a bit of “motivational talk” regarding how community activities can help them get a job, based on my personal experience.

Here are a few photos, courtesy of Athul and Kiran:

January 17, 2018 12:00 AM

September 07, 2016

Balasankar C

SMC/IndicProject Activities- ToDo List

Heya,

So, M.Tech is coming to an end I should probably start searching for a job soon. Still, it seems I will be having a bit of free time from Mid-September. I have got some plans about the areas I should contribute to SMC/Indic Project. As of now, the bucket list is as follows:

  1. Properly tag versions of fonts in SMC GitLab repo - I had taken over the package fonts-smc from Vasudev, but haven’t done any update on that yet. The main reason was fontforge being old in Debian. Also, I was waiting for some kind of official release of new versions by SMC. Since the new versions are already available in the SMC Fonts page, I assume I can go ahead with my plans. So, as a first step I have to tag the versions of fonts in the corresponding GitLab repo. Need to discuss whether to include TTF file in the repo or not.
  2. Restructure LibIndic modules - Those who were following my GSoC posts will know that I made some structural changes to the modules I contributed in LibIndic. (Those who don’t can check this mail I sent to the list). I plan to do this for all the modules in the framework, and to co-ordinate with Jerin to get REST APIs up.
  3. GNOME Localization - GNOME Localization has been dead for almost two years now. Ashik has shown interest in re-initiating it and I plan to do that. I first have to get my committer access back.
  4. Documentation - Improve documentation about SMC and IndicProject projects. This will be a troublesome and time consuming task but I still like our tools to have proper documentation.
  5. High Priority Projects - Create a static page about the high priority projects so that people can know where and how to contribute.
  6. Die Wiki, Die - Initiate porting Wiki to a static site using Git and Jekyll (or any similar tool). Tech people should be able to use git properly.

Knowing me pretty much better than anyone else, I understand there is every chance of this being “Never-being-implemented-plan” (അതായത് ആരംഭശൂരത്വം :D) but still I intend to do this in an easy-first order.

September 07, 2016 04:47 AM

August 29, 2016

malayaleecoder

GSoC — Final Report!

So finally it’s over. Today is the last date for submission of the GSoC project. This entire ride was a lot informative as well as an experience filled one. I thank Indic Project organisation for accepting my GSoC project and my mentors Navaneeth K N and Jishnu Mohan for helping me out fully throughout this project.

The project kicked off keeping in mind of incorporating the native libvarnam shared library with the help of writing JNI wrappers. But unfortunately the method came to a stall when we were unable to import the libraries correctly due to lack of sufficient official documentations. So my mentor suggested me an alternative approach by making use of the Varnam REST API. This has been successfully incorporated for 13 languages with the necessity of the app requiring internet connection. Along with it, the suggestions which come up are also the ones returned by Varnam in the priority order. I would be contributing further to Indic Project to make the library method work in action. Apart from that see below the useful links,

  • this and this is related to adding a new keyboard with “qwerty” layout.
  • this is adding a new SubType value and a method to identify TransliterationEngine enabled keyboards.
  • this is adding the Varnam class and setting the TransliterationEngine.
  • this and this deals with applying the transliteration by Varnam and returning it back to the keyboard.
  • this is the patch to resolve the issue, program crashes on switching keyboards.
  • this makes sure that after each key press, the displayed word is refreshed and the transliteration of the entire word is shown.
  • this makes sure that on pressing deletion, the new word in displayed.
  • this creates a template such that more keyboards can be added easily.
  • this makes sure that the suggestions appearing are directly from the Varnam engine and not from the inbuilt library.
  • The lists of the commits can be seen here which includes the addition of layouts for different keyboards and nit fixes.

Add Varnam support into Indic Keyboard

https://medium.com/media/30df9a95b2ac8d2171a7e7a1d00fe0ad/href

The project as a whole is almost complete. The only thing left to do is to incorporate the libvarnam library into the apk and then we can call that instead of the Varnam class given here. The ongoing work for that can be seen below,

malayaleecoder/libvarnam-Android

//Varnam
varnamc -s ml -t "Adutha ThavaNa kaaNaam" //See you next time

by Vishnu H Nair at August 29, 2016 08:18 AM

August 23, 2016

Anwar N

GSoC 2016 IBus-Braille-Enhancement Project - Summary

Hi,
   First of all my thanks to Indic Project and Swathanthra Malayalam Computing(SMC) for accepting this project. All hats off to my mentors Nalin Sathyan and Samuel Thibault. The project was awesome and I believe that I have done my maximum without any prior experience

Project Blog : http://ibus-braille-enhancement.blogspot.in/


Now let me outline what we have done during this period.

Braille-Input-Tool (The on-line version)
  Just like Google transliteration or Google Input Tools online. This is required because it's completely operating system independent and it's a modern method which never force user to install additional plugin or specific browser. The user might use this form temporary places like internet cafe. This is written using JQuery and Html. And works well in GNU/Linux, Microsoft windows, Android etc

See All Commits : https://github.com/anwar3746/braille-input/commits/gh-pages
Test with following link : http://anwar3746.github.io/braille-input/


IBus-Braille enhancements
See All Commits : https://gitlab.com/anwar3746/ibus-braille/activity

1 IBus-Braille integrated with Liblouis : The Liblouis software suite provides an open-source braille translator, back-translator and formatter for a large number of languages and braille codes. So maintaining and shipping separate braille maps(located at /share/ibus-sharada-braille/braille) with ibus-braille is a bad idea. Through this we completely adopted Ibus-Braille to use Liblouis. The conversion is done in an entire word manner instead of each letter. ie the conversion does after writing direct braille unicode and pressing space.
Commit 1 : https://gitlab.com/anwar3746/ibus-braille/commit/6826982fa39cbd2e155bfb389658e16cc57b0dae
Commit 2 : https://gitlab.com/anwar3746/ibus-braille/commit/7032cf7b0c8cea7ce6c619c39750f5110effcfa3
Commit 3 : https://gitlab.com/anwar3746/ibus-braille/commit/46ec83a1caab75b2b25bbd06e1156d927b33c211

See Picture of Ibus-Braille preferences given below

2 8-Dot braille Enabled : Yes languages having more than 64 characters which can't be handled with 64 (6 dots ) combination are there, Music notations like  “Abreu” and LAMBDA (Linear Access to Mathematics for Braille Device and Audio Synthesis) uses 8-dot braille system.  unicode support 8-dot braille.
Commit 1 : https://gitlab.com/anwar3746/ibus-braille/commit/54d22c0acbf644709d72db076bd6de00af0e20b9

See key/shortcut page picture of ISB preferences dot setting

3 Dot 4 issue Solved :  In IBus-Braille when we type in bharati braille such as Malayalam, Hindi, etc. we have to use 13-4-13 to get letter ക്ക(Kka). But according to braille standard in order to get EKKA one should press 4-13-13. And this make beginners to do extra learning to start typing. Through this project we solved this issues and a conventional-braille-mode switch is provided in preferences in order to switch between.

Commit : https://gitlab.com/anwar3746/ibus-braille/commit/089edca78d31355c3ab0e08559f0d9fe79929de6

4 Add Facility to write direct Braille Unicode : Now one can use IBus-Braille to type braille dot notation directly with the combination.  The output may be sent to a braille embosser. Here braille embosser is an impact printer that renders text in braille characters as tactile braille cells.

Commit : https://gitlab.com/anwar3746/ibus-braille/commit/4c6d2e3c8a2bbe86e08ca8820412201a52117ad1


5 Three to Six for disabled people with one hand : A three key implementation which uses delay factor between key presses for example 13 followed by
13 having delay less than delay factor (eg:0.2) will give X. If more, then output would be KK. If one want to type a letter having combination only 4,5,6 he have to press "t" key prior. The key and the Conversion-Delay can be adjusted from preferences.

Commit : https://gitlab.com/anwar3746/ibus-braille/commit/dda2bd83ba69fb0a0f6b526a940bc878bf230485

6 Arabic language added
Commit : https://gitlab.com/anwar3746/ibus-braille/commit/bd0af5fcfabf891f0b0e6649a3a6c647b0d5e336

7 Many bugs solved
Commit : https://gitlab.com/anwar3746/ibus-braille/commit/da0f0309edb4915ed770e9ab41e4355c2bd2c713
others are implied

Project Discourse : https://docs.google.com/document/d/16v-BMLLzWmzbo1n5S-wDTnUmFV-cwhoon1PeJ0mDM64/edit?usp=sharing
IBus-Sharada-Braille (GSoC 2014) : http://ibus-sharada-braille.blogspot.in/

Plugins for firefox and chrome
    This plugin can be installed will work with every text entry on the web pages no need for copy paste. extensions are written in Javascript.
See All Commits : https://github.com/anwar3746/braille-browser-addons/commits/master


Modification yet desirable are as following

1 Announce extra information through Screen Reader:  When user expand abbreviation or a contraction having more than 2 letters is substituted the screen reader is not announcing it. We have to write a orca(screen reader) plugin for Ibus-Braille

2 A UI for Creating and Editing Liblouis Tables

3 Add support for more Indic Languages and Mathematica Operators via liblouis

Braille-input-tool (online version)
                             
                       Liblouis integration
Conventional Braille, Three Dot mode and Table Type selection 
Chrome Extension

Direct braille unicode typing
 Eight dot braille enabled

by Unknown (noreply@blogger.com) at August 23, 2016 04:39 AM

August 22, 2016

Sreenadh T C

It’s a wrap!

“To be successful, the first thing to do is to fall in love with your work — Sister Mary Lauretta”

Well, the Google Summer of Code 2016 is reaching its final week as I get ready to submit my work. It has been one of those best three-four months of serious effort and commitment. To be frank, this has to be one of those to which I was fully motivated and have put my 100%.

Well, at first, the results of training wasn’t that promising and I was actually let down. But then, me and my mentor had a series of discussions on submitting, during which she suggested me to retrain the model excluding the data set or audio files of those speakers which produced the most errors. So after completing the batch test, I noticed that four of the data set was having the worst accuracy, which was shockingly below 20%. This was causing the overall accuracy to dip from a normal one.

So, I decided to delete those four data set and retrain the model. It was not that of a big deal, so I thought its not gonna be drastic change from the current model. But the result put me into a state of shock for about 2–3 seconds. It said

TOTAL Words: 12708 Correct: 12375 Errors: 520
TOTAL Percent correct = 97.38% Error = 4.09% Accuracy = 95.91%
TOTAL Insertions: 187 Deletions: 36 Substitutions: 297
SENTENCE ERROR: 9.1% (365/3993) WORD ERROR RATE: 4.1% (519/12708)

Now, this looks juicy and near to perfect. But the thing is, the sentences are tested as they where trained. So, if we change the structure of sentence that we ultimately give to recognize, it will still be having issues putting out the correct hypothesis. Nevertheless, it was far more better than it was when I was using the previous model.

So I guess I will settle with this for now as the aim of the GSoC project was to start the project and show proof of that this can be done, but will keep training better ones in the near future.

Google Summer of Code 2016 — Submission

  1. Since the whole project was carried under my personal Github repository, I will link the commits in it here : Commits
  2. Project Repository : ml-am-lm-cmusphinx
  3. On top of that, we (me and the organization) had a series of discussions regarding the project over here: Discourse IndicProject
https://medium.com/media/9e8990c8b26cb11e147e0d3e4c5642a7/href

Well, I have been documenting my way through the project over here at Medium starting from the month of May. The blogs can be read from here.

What can be done in near future?

Well, this model is still in its early stage and is still not the one that can be used error free, let alone be applied on applications.

The data set is still buggy and have to improved with better cleaner audio data and a more tuned Language Model.

Speech Recognition development is rather slow and is obviously community based. All these are possible with collaborated work towards achieving a user acceptable level of practical accuracy rather than quoting a statistical, theoretical accuracy.

All necessary steps and procedure have been documented in the README sections of the repository.

puts "thank you everyone!"

by Sreenadh T C at August 22, 2016 07:01 AM

August 21, 2016

Arushi Dogra

GSoC Final Report

Its almost the end of the GSoC internship. From zero knowledge of Android to writing a proposal, proposal getting selected and finally 3 months working on the project was a great experience for me! I have learned a lot and I am really thankful to Jishnu Mohan for mentoring throughout .

Contributions include :-

All the tasks mentioned in the proposal were discussed and worked upon.

Layouts 
I started with making the designs of the layouts. The task was to make Santali Olchiki and Soni layouts for the keyboard. I looked at the code of the other layouts to get a basic understanding of how phonetic and inscript layouts work. Snapshot of one of the view of Santali keyboard :

Screen Shot 2016-08-21 at 6.53.03 PM

Language Support Feature 
While configuring languages, the user is prompted about the locales that might not be supported by the phone.

Screen Shot 2016-08-21 at 6.33.25 PM

Adding Theme Feature
Feature is added at the setup to enable user to select the keyboard theme

Screen Shot 2016-08-21 at 6.49.21 PM

Merging AOSP code
After looking at everything mentioned in the proposal, Jishnu  gave me the job of  merging AOSP source code to the keyboard as the current keyboard doesn’t have changes that were released along with  android M code drop because of which target sdk is not 23 . There are a few errors yet to be resolved and I am working on that 😀

Overall, it was a wonderful journey and I will always want to be a contributor to the organisation as it introduced me to the world of open source and opened a whole new area to work upon and learn more.
Link to the discourse topic : https://discourse.indicproject.org/t/indic-keyboard-project/45

Thank You!  😀

by arushidogra at August 21, 2016 01:29 PM

August 17, 2016

Balasankar C

GSoC Final Report

Heya,

It is finally the time to wind up the GSoC work on which I have been buried for the past three months. First of all, let me thank Santhosh, Hrishi and Vasudev for their help and support. I seem to have implemented, or at least proved the concepts that I mentioned in my initial proposal. A spell checker that can handle inflections in root word and generate suggestion in the same inflected form and differentiate between spelling mistakes and intended modifications has been implemented. The major contributions that I made were to

  1. Improve LibIndic’s Stemmer module. - My contributions
  2. Improve LibIndic’s Spell checker module - My contributions
  3. Implement relatively better project structure for the modules I used - My contributions on indicngram

1. Lemmatizer/Stemmer

TLDR

My initial work was on improving the existing stemmer that was available as part of LibIndic. The existing implementation was a rule based one that was capable of handling single levels of inflections. The main problems of this stemmer were

  1. General incompleteness of rules - Plurals (പശുക്കൾ), Numerals(പതിനാലാം), Verbs (കാണാം) are missing.
  2. Unable to handle multiple levels of inflections - (പശുക്കളോട്)
  3. Unnecessarily stemming root words that look like inflected words - (ആപത്ത് -> ആപം following the rule of എറണാകുളത്ത് -> എറണാകുളം)

The above mentioned issues were fixed. The remaining category is verbs which need more detailed analysis.

Long Version

A demo screencast of the lemmatizer is given below.

So, comparing with the existing stemmer algorithm in LibIndic, the one I implemented as part of GSoC shows considerable improvement.

Future work

  1. Add more rules to increase grammatical coverage.
  2. Add more grammatical details - Handling Samvruthokaram etc.
  3. Use this to generate sufficient training data that can be used for a self-learning system implementing ML or AI techniques.

2. Spell Checker

TLDR

The second phase of my GSoC work involved making the existing spell checker module better. The problems I could identify in the existing spell checker were

  1. It could not handle inflections in an intelligent way.
  2. It used a corpus that needed inflections in them for optimal working.
  3. It used only levenshtein distance for finding out suggestions.

As part of GSoC, I incorporated the lemmatizer developed in phase one to the spell checker, which could handle the inflection part. Three metrics were used to detect suggestion words - Soundex similarity, Levenshtein Distance and Jaccard Index. The inflector module that was developed along with lemmatizer was used to generate suggestions in the same inflected form as that of original word.

Long Version

A demo screencast of the lemmatizer is given below.

3. Package structure

The existing modules of libindic had an inconsistent package structure that gave no visibility to the project. Also, the package names were too general and didn’t convey the fact that they were used for Indic languages. So, I suggested and implemented the following suggestions

  1. Package names (of the ones I used) were changed to libindic-. Examples would be libindic-stemmer, libindic-ngram and libindic-spellchecker. So, the users will easily understand this package is part of libindic framework, and thus for indic text.
  2. Namespace packages (PEP 421) were used, so that import statments of libindic modules will be of the form from libindic.<module> import <language>. So, the visibility of the project ‘libindic’ is increased pretty much.

August 17, 2016 04:47 AM

August 16, 2016

Anwar N

IBus-Braille Enhancement - 3

Hi,
 A hard week passed!

1 Conventional Braille Mode enabled : Through this we solved dot-4 issue and now one can type using braille without any extra knowledge

commit 1 : https://gitlab.com/anwar3746/ibus-braille/commit/089edca78d31355c3ab0e08559f0d9fe79929de6

2 handle configure parser exceptions : corrupted isb configuration file can make it won't start. so I solved this by proper exception handling

commit 2 : https://gitlab.com/anwar3746/ibus-braille/commit/da0f0309edb4915ed770e9ab41e4355c2bd2c713

3 Liblouis integration : I think our dream is about to come true!  But still also we are struggling with vowel substitution on the middle.
commit 3 : https://gitlab.com/anwar3746/ibus-braille/commit/6826982fa39cbd2e155bfb389658e16cc57b0dae
commit 4 : https://gitlab.com/anwar3746/ibus-braille/commit/46ec83a1caab75b2b25bbd06e1156d927b33c211
commit 5 : https://gitlab.com/anwar3746/ibus-braille/commit/7032cf7b0c8cea7ce6c619c39750f5110effcfa3

by Unknown (noreply@blogger.com) at August 16, 2016 08:35 PM

August 09, 2016

Sreenadh T C

What now?

“Now that the basic aim was fulfilled, what more can we work on, given there is almost half a month to GSoC Submission!”

Well, as of now the phoneme transcription was done purely based on the manner the word was written and not completely based on the Speech pattern. What I mean is that there are some exceptions in how we write the word and pronounce it (differently). This was pointed out by Deepa mam. She also asked if I could possibly convert some of the existing Linguistic rules(algorithms) that was made with Malayalam TTS in mind, so that it could be used to re-design the phoneme transcription. This could also turn out to be helpful for future use like using it for a fully intelligent Phoneme Transcriber for Malayalam Language Modeling.

This is what we are working on right now, and am literally like scratching my head over some loops in Python!

juzzzz jokinnn
The basic idea is to iterate over each line in the ‘ml.dic’ file and validate the transcription I made earlier with the set of rules. Correcting them (if found invalid) as it goes over.

Seems pretty straight forward! Will see how it goes!

Update — 4th August

Wew!, This is going nuts! OK so I first tried using Lists to classify the different types of phones. It all was good, until I reached a point in algorithm where I have to check if the current phoneme in the transcription is a member of a particular class of phoneme ( now, when I say, class of Phoneme, I just mean, the classification and not the class ). Of course I can search in List for the presence of the element and its quite sufficient enough to say in small comparisons. Our case is different. We are talking about around 7000 words in a file, on top of which each line will have significant amount of if-elif clauses.

This could slow down things and make the script less efficient ( will eventually see the difference ). So I went back to Python documentation and read about the Set Types ( set and frozenset )

A set object is an un-ordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference. — said the Python doc.

This is exactly what I wanted. I mean, I don’t have to do any manipulation to the phoneme classes, so there is no real meaning in using a List. Furthermore, the Set supports the ‘in’ using which the membership can be checked with no additional searching procedure. How cool is that!

here!

Update — 9th August

So, after some test on the script, I generated the dictionary file once again, this time applying some of the TTS rules. Now the SphinxTrain is running with this dictionary file. Hopefully, there should be some change in the accuracy.!

left panel with new dictionary, right panel with old dictionary
left panel with new dictionary, right panel with old dictionary

This might as well be the last development phase update if all goes well. Then it is submission time.

puts 'until then ciao'

by Sreenadh T C at August 09, 2016 01:56 PM

Anwar N

IBus-Braille Enhancement - 2

Hi, with this week I where fighting with my final semester exams! and it's over.  Also within this week I added the facility for typing direct braille Unicode.

https://gitlab.com/anwar3746/ibus-braille/commit/4c6d2e3c8a2bbe86e08ca8820412201a52117ad1

instead of converting to Unicode I added it as a new language so that one can later edit and use. 

by Unknown (noreply@blogger.com) at August 09, 2016 03:40 AM

July 31, 2016

malayaleecoder

GSoC Progress — Week 8 & 9

Awesome, something good is happening :)

Cmake was giving me some trouble in the beginnning. After clearing all the dependency issues with the Cmake example, I was successfully able to run the endless-tunnel on my phone. Following the similar pattern of how the modules are being incorporated in the cmake app, we tried to incorporate the varnam module. The code for the attempt is given here.

Now there comes a problem :| I have documented the issue here,

Adding a new native module using CMake yields "Error: exception during working with external system:"

After 9 days, there has still not been a single response :( So as an alternative we have decided to use the varnam API. I have completed the class for the same and is yet to link to the Keyboard input from the Indic Keyboard app. This part is the agenda for the next week.

//Pascal
program HelloWorld(output);
begin
writeln("That's all for now, see you next time!")
end.

by Vishnu H Nair at July 31, 2016 04:53 PM

GSoC Progress — Week 6 & 7

Why doesn’t it work!!!!!

Alright, for the past two weeks, me and my mentor have been trying a lot to call the varnam library in Java. First we went on trying to load the prebuilt library onto Android Studio and then use the methods in Java, which didn’t work :(

Now we are on a different route of compiling varnam during runtime. For this we are following the cmake example given here. Another thing to note that is, cmake requires canary Android Studio which can be downloaded here. It all started off well when it was seen that OSX has a problem running that.

Now I am getting it all setup on Linux as well as Windows( just in case :P ) Sorry in not writing any technical details, will make it up in the next week.

//Rust
fn main() {
println!('That's all for now, see you next time!');
}

by Vishnu H Nair at July 31, 2016 04:51 PM

GSoC Progress — Week 4 & 5

Ooh boy, half way through GSoC and lot to be done. Finally we decided to do the entire project in Android Studio so that the later integration with Indic Keyboard would be easier. As said in the last post, I was in a state of completing the wrappers of varnam_init() and rest of the functions when a queue of challenges popped up.

First of all since we are moving out of the regular “PC” kind of architecture, storing the scheme files in a specific directory is still a problem. First we decided to store it in the internal storage of the mobile which then eventually caused a lot of problems because varnam_set_symbols_dir() required a string path to the directory, which was not possible. Then we later decided to store it in the external storage of the device. This decision is temporary because once the user removes the external SD card, Varnam keyboard would not be functional :P

Then came the problem of build architectures. Since my work machine is a Mac, all the built libraries are in the form of .dylib files. Android accepts only .so files as the jniLibs. After generating the binary in my dual boot Ubuntu, it turned out that Android accepts only 32 - bit architecture libraries. Then using VirtualBox I finally managed to get the desired files. Now out of nowhere the thrown error is,

"Cannot find: libpthread.so.0"

I have currently written wrappers for most of the required methods, but have to resolve these errors to get the testing going smoothly. I will upload a list of references I have gone through(there a tons of em) in the next post so that anyone working in this topic may find it useful.

//Scala
object Bye extends Application {
println('That's all for now, see you next time!')
}

by Vishnu H Nair at July 31, 2016 04:51 PM

Sreenadh T C

‘He’ just recognized what I said!

Yipeeee!!
Well, the title says it all. The computer just recognized what I said in my Mother Tongue! A major step in the right the direction.

For this to happen, I had to complete the Acoustic Model training. So then!

What is Acoustic Model!

Well it is a set of statistical representational parameters used to learn the language by representing the relation between audio signal and corresponding linguistic features that make up that speech or audio ( phoneme, and transcription! ).

To produce these we need to set up a database structure as documented by the CMU SphinxTrain team. Some of these files were common to the Language Model preparation like the phoneme transcription file. After setting up the database it should look like this irrespective of the language!

The training is straight forward if you get the database error free which was not my case! Thank you! ( ** if you get it error free on the first run, you are probably doing it wrong! ** )

I had to solve two issues ( 1 and 2 ) before I could run the training without any hiccups! It took a day to make the patch works in the files. The documentation didn’t mention that the phone set should contain a maximum of 255 phones due to practical limitation though theoretically it had no problems ( found out myself from the CMU help forums. ). That was the Issue : Reduce phoneset to a max of 255 #31. I successfully reduced to what is found in the current repository version.

Update — July 27

Acoustic Model is ready for testing!

How??!!
$ sphinxtrain -t ml setup

This command will setup the ‘etc’ and ‘wav’ folder as mentioned above. Now we need to setup the sphinx_train.cfg config file which is excellently documented by the team.

Once that is out of the way, run the training.

$ cd ml

and,

$ sphinxtrain run

and,

wait!!

..

….

still waiting!!

.

..

Finally its done! Took quite a lot of time!

Not only that, my Zenbook finally started showing heating and fan noise. That sixth gen Intel needed some extra air! ( ** nice! ** ).

Update — July 29

Well, this means, the GSoC 2016 aim have been achieved which was to develop the Language Model and Acoustic Model. Only thing left is to keep testing it.

The discussion with Deepa mam helped in bringing out a possibility in improving the accuracy which am working on as a branch in parallel to the testing.

With that in mind for the coming week, that’s it for this week

puts "until then ciao!"

by Sreenadh T C at July 31, 2016 07:46 AM

July 30, 2016

Anwar N

IBus-Braille Enhancement - 1

Hi,
  This week I forked IBus-Braille project from SMC GitLab repository  added two things.

1 Eight-Dot braille enabled. Now one can add languages with 8 dot's. The default keys are Z for dot 7 and period for dot-8. This can be remapped using preferences. 
https://gitlab.com/anwar3746/ibus-braille/commit/54d22c0acbf644709d72db076bd6de00af0e20b9


2 Arabic Language added and tested with users
https://gitlab.com/anwar3746/ibus-braille/commit/bd0af5fcfabf891f0b0e6649a3a6c647b0d5e336

See commits : https://gitlab.com/anwar3746/ibus-braille/commits/master

by Unknown (noreply@blogger.com) at July 30, 2016 03:23 AM

July 26, 2016

Arushi Dogra

Updates on work

My next task was to show instead of all layouts, filter them on the basis of language. My first option I decided to do filtering based on locale. So instead of ACTION_INPUT_METHOD_SUBTYPE_SETTINGS we can use ACTION_LOCALE_SETTINGS but the problem here was that it was giving a list of all the locales in the system instead of the locales in our app. So I skipped this idea. And then decided to create a list and enable users selection on that. But there was no way to connect that to enabled system subtypes. I was stuck on this for quite some time .We ditched the plan and moved on to the “Theme selection” task.

I am currently working on the Theme Selection task . I have successfully added the step . But now I am working on adding the fragment instead of the whole activity . After I am done with this, I will move to adding the images of the themes. I will hopefully complete this task by the weekend.

Also , after a meeting with the mentor, it is decided that after this task I will work on merging AOSP source code to the keyboard as the current keyboard doesn’t have changes that were released along with  android M code drop because of which target sdk is not 23 . So my next task will be merging AOSP code which will give the benifit of run time permissions. 😀

by arushidogra at July 26, 2016 12:34 AM

July 25, 2016

Balasankar C

4 Days. 22 Hours. LaTeX.

Heya folks,

One of the stuff I love doing is teaching what I know to others. Though it is a Cliché dialogue, I know from experience that when we teach others our knowledge expands. From 10 students, you often get 25 different doubts and minimum 5 of them would be ones you haven’t even thought yourself earlier. In that way, teaching drives a our curiosity to find out more.

I was asked to take a LaTeX training for B.Tech students as a bridge course (happening during their semester breaks. Poor kids!). The usual scenario is faculty taking class and we PG students assisting them. But, since all the faculty members were busy with their own subjects’ bridge courses and LaTeX was something like an additional skill that the students need for their next semesters for their report preparation, I was asked to take to take it with the assistance of my classmates. At first, I was asked to take a two-day session for third year IT students. But later, HOD decided that both CS and IT students should have that class, and guess what - I had to teach for four days. Weirdly, the IT class was split to two non-continuous dates - Monday and Wednesday. So, I didn’t have to take class for four consecutive days, but only three. :D

The syllabus I followed is as follows:

  • Basic LaTeX – Session I
    1. Brief introduction about LaTeX, general document structure, packages etc.
    2. Text Formatting
    3. Lists – Bullets and Numbering
  • Graphics and Formulas – Session II
    1. Working with Images
    2. Tables
    3. Basic Mathematical Formulas
  • Academic Document Generation (Reports and Papers) – Session III
    1. Sectioning and Chapters
    2. Header and Footer
    3. Table of Contents
    4. Adding Bibliography and Citations
    5. IEEETran template
  • Presentations using Beamer – Session IV

As (I, not the faculty) expected, only half of the students came (Classes on semester breaks, I was surprised when even half came!). Both the workshops - for CS and IT - were smooth without any much issues or hinderences. Students didn’t hesitate much to ask doubts or tips on how to do stuff that I didn’t teach (Unfortunately, I didn’t have time to go off-syllabus, so I directed them to Internet. :D). Analysing the students, CS students were more lively and interactive but they took some time to grasp the concept. Compared to them, even though kind of silent, IT students learned stuff fast.

By Friday, I had completed 4 days, around 22 hours of teaching and that too non-stop. I was tired each day after the class, but it was still fun to share the stuff I know. I would love to get this chance again.

IT Batch

IT Batch




CSE Batch

CSE Batch

July 25, 2016 12:00 AM

July 24, 2016

Sreenadh T C

Developing the Language Model

Finally, I can start the work towards Milestone — 2, which is completing the development of Language Model for Malayalam. Time to completely switch to Ubuntu from here on. Why?

Well, all the forums related to CMU Sphinx keep telling that they won’t monitor the reports from Windows anyways, and since all the commands and codes mentioned in the documentation is more inclined to Linux, let’s just stick to it as well. After all, when it comes to Open-Source, why should I develop using Microsoft Windows. (** Giggle **)

What is a Statistical Language Model?

Statistical language models describe more complex language, which in our case is Malayalam. They contain probabilities of the words and word combinations. Those probabilities are estimated from a sample data ( the sentence file ) and automatically have some flexibility.

This means, every combination from the vocabulary is possible, though probability of such combination might vary.

Let’s say if you create statistical language model from a list of words , which is what I did for my Major Project work, it will still allow to decode word combinations ( phrases or sentences for that matter. ) though it might not be our intent.

Overall, statistical language models are recommended for free-form input where user could say anything in a natural language and they require way less engineering effort than grammars, you just list the possible sentences using the words from the vocabulary.

Let me explain this with a traditional Malayalam example:

Suppose we have these two sentences “ ഞാനും അവനും ഭക്ഷണം കഴിച്ചു ” and “ ചേട്ടൻ ഭക്ഷണം കഴിച്ചില്ലേ ”.

If we use the statistical language model of this set of sentences, then it is possible to derive more sentences from the words( vocabulary ).

ഞാനും (1) , അവനും (1) , ഭക്ഷണം (2) , കഴിച്ചു (1) , ചേട്ടൻ (1) , കഴിച്ചില്ലേ (1)

That is, we can have sentences like “ ഞാനും കഴിച്ചു ഭക്ഷണം ” or maybe “ഭക്ഷണം കഴിച്ചില്ലേ ”, or “ അവനും കഴിച്ചില്ലേ ” and so on. It’s like the Transitive Property of Equality but in a more complex manner. Here it's related to probability of occurrence of a given word after a word. Now this is calculated using the sample data that we provide as the database.

Now, you might be wondering what the numbers inside the parenthesis mean. Those are nothing but the number of occurrences of each word in the given complete set of sentences. This is calculated by the set of C libraries provided by a toolkit that I will introduce shortly.

Update — July 18

Okay!

Let’s start building. If you remember from my previous blog post/articles, you can recollect me writing about extracting words and then transcribing those to phonetic representation. Those words are nothing but the vocabulary that I just showed.

For building a language model of such a large scale vocabulary, you will need to use specialized tools or algorithms. One such set of algorithms are provided as C Libraries by the name “CMU-Cambridge Statistical Language Modeling Toolkit” or in short CMU-CLMTK. You can head over to their official page to know more about it. I have already installed it. So we are ready to go.

So according to the documentation,

The first step is to find out the number of occurrences. (text2wfreq)

cat ml.txt | text2wfreq > ml.wfreq

Next we need the .wfreq to .vocab file without the numbers and stuff. Just the words.

cat ml.wfreq | wfreq2vocab -top 20000 > ml.vocab

Oops, there are some issues with the generated vocab file regarding repetitions and additional words here and there which are not required. This might have happened while I was filtering the sentences file but forgot to update or skipped updating the transcription file. Some delay in further process. It's already late night! I need to sleep!

Update — July 19

‘Meld’. Thank you StackExchange

With this guy, its easy to compare everything and make changes simultaneously. It should be done by today!

.

.

Done!

Okay, now that the issue have been handled, we are getting somewhere. It should be pretty much straight forward now.

Next we need find list of every id n-gram which occurred in the text, along with its number of occurrences. i.e. Generate a binary id 3-gram of the training text ( ml.txt ), based on this vocabulary ( ml.vocab ).

By default, the id n-gram file is written out as binary file, unless the -write_ascii switch is used in the command.

-temp ./ switch can be used if youwant to run the command without root permission and use the current working directory as the temp folder. Or you can just run it as root, without any use, which by default will use /usr/tmp as temp folder.

cat ml.txt | text2idngram -vocab ml.vocab -temp ./ > ml.idngram

Finally, we can generate the Language Model. This can either be an ARPA model or a Binary.

idngram2lm -idngram ml.idngram -vocab ml.vocab -binary ml.bin

or

idngram2lm -idngram ml.idngram -vocab ml.vocab -arpa ml.arpa

Even though ARPA is available, using the binary format of the language model is recommended for faster operations.

Here is the basic work-flow.

as provided by the Toolkit Documentation.

That’s it. The Language Model is complete. I can now go ahead into next step, that is building and training the Acoustic Model.

by Sreenadh T C at July 24, 2016 06:13 AM

July 21, 2016

Balasankar C

Kerala State IT Policy - A Stakeholder Consultation

Heya folks,

Last Saturday, that is 16th July, I attendeda a meeting regarding the upcoming Kerala State IT Policy. It was a stakeholder consultation organized by DAKF, Software Freedom Law Centre and Ernakulam Public Library Infopark branch. The program was presided by Prasanth Sugathan of SFLC (I had met him during Swatanthra, when I helped Praveen in the Privacy track) and was inaugurated by M. P Sukumaran Nair, advisor to the Minister of Industries. The agenda of the meeting was to discuss about the suggestions that needs to be submitted to the Government before they draft the official IT policy, that will be in effect for the next few years. I attended the meeting representing Swathanthra Malayalam Computing. Even though the meeting had a small audience, some of the key topics were brought into the mix.

Professor Jyothi John, retired principal of Model Engg. College, discussed about MOOCs to improve the education standard of the State. He also talked about improving the industry-academia-research relationship that is in a pathetic state as of now. I was asked to talk a few words. But, since SMC hadn’t taken any official stand or points for the meeting, I actually talked about my views about the issue. Obviously, my topics were more focused on Language Computing, Digital empowerment of the language and as well as how FOSS should be the key stone of the IT policy. I also mentioned about the E-Waste problem that Anivar had discussed the other day on the Whatsapp group.

Me Talking

Me Talking | PC: Sivahari

Mr. Joseph Thomas, the president of FSMI also talked on the importance of FOSS in IT policy (Kiran Thomas had some pretty strong disagreements with it. :D ). Following that, Babu Dominic from BSNL talked about their success stories with FOSS and how the project was scraped by government. There were some brilliant insights from Satheesh, who is a Social Entrepreneur now and once ran an IT-based company.

Following that, the meeting took the form of a round table discussion where interesting points regarding E-Waste and the money-saving nature of FOSS (Microsoft has been targetting Institutions for pirated copies, not home users) were raised by Mr. Bijumon, Asst Professor of Model Engg College. Mr. Jayasreekumar, who is a journalist talked about the important issue of the downtrodden people, or the people in the lower socio-economic belt were not part of the discussion and the image of digital divide that carves. We have to seriously increase diversity of participants in these meetings, as a large part of the population has no representation in them. Such meetings will be only fruitful, if the sidelined communities who also should benefit from this policy are brought together to participate in them.

The general theme of the meeting was pointing towards how the IT policy should focus more on the internal market, and how it should be helpful in entrepreneurs in competing with foreign competitors, atleast in the domestic market.

News Coverage in Deshabhimani

News Coverage | PC: Deshabhimani

More and more meetings of this nature are a must, if the state is to advance in the domain of IT.

July 21, 2016 12:00 AM

July 20, 2016

Anwar N

work progress in browser-addon

Hi,
 About two months passed. We do many testing on online braille-input tool. And some widgets rearranged for user comforts. In the recent weeks we made a good progress in both Firefox and Chrome browser addons. But still we suffer from a grate problem with  these addons, The plugins are not working in google chat and Facebook chat entry's.  We are seeking the solution...

by Unknown (noreply@blogger.com) at July 20, 2016 08:15 PM

July 19, 2016

Balasankar C

GSoC Update: Week #7 and #8

Heya,

Last two weeks were seeing less coding and more polishing. I was fixing the LibIndic modules to utilize the concept of namespace packages (PEP 420) to obtain the libindic.module structure. In the stemmer module, I introduced the namespace package concept and it worked well. I also made the inflector a part of stemmer itself. Since inflector's functionality was heavily dependent on the output of the stemmer, it made more sense to make inflector a part of stemmer itself, rather than an individual package. Also, I made the inflector language-agnostic so that it will accept a language parameters as input during initialization and select the appropriate rules file.

In spellchecker also, I implemented the namespace concept and removed the bundled packages of stemmer and inflector. Other modifications were needed to make the tests run with this namespace concept, fixing coverage to pick the change etc. In the coding side, I added weights to the three metrics so as to generate suggestions more efficiently. I am thinking about formulating an algorithm to make comparison of suggestions and root words more efficient. Also, I may try handling spelling mistakes in the suffixes.

This week, I met with Hrishi and discussed about the project. He is yet to go through the algorithm and comment on that. However he made a suggestion to split out the languages to each file and make init.py more clean (just importing these split language files). He was ok with the work so far, as he tried out the web version of the stemmer.

[caption id="attachment_852" align="aligncenter" width="800"]hrishi_testing_spellchecker Hrishi testing out the spellchecker[/caption]

July 19, 2016 12:47 AM

July 16, 2016

Sreenadh T C

Mentioning the huge contributions

“ In open source, we feel strongly that to really do something well, you have to get a lot of people involved. — Linus Torvalds ”

I have always loved the idea of Open Source and have been fortunate enough to be participating in one of the world’s best platform for a student to develop, grow, and learn. Google Summer of Code 2016 have gone past it’s mid-term evaluation, and so have I. The last couple of weeks have been in a slow pace compared to the weeks in June.

contribution graph — May-June-July

This is simply because, I was ahead of my schedule while in the Mid-term evaluation period , also I didn’t want to rush things up and screw it up. But, I thought this is the right time to mention the contributions that have been taking place towards this Open Source Project.

Gathering recordings or speech data for training would mean that a lot of people have to individually record their part, and then send it to me. Now this might seem simple enough to some of you out there, but believe me, recording 250 lines or sentences in Malayalam with all its care is not going to be that interesting.

Nonetheless, pull requests have been piling up on my Repository since the early days of the project. The contribution has been really awesome.

What more can you ask for when your little brother who have absolutely no idea about what the heck am doing, but decides to record 250 sentences in his voice so that I could be successful in completing the project! (** aww… you little prankster… **)

And he did all this without making much of a mistake or even complaining about the steps I instructed him to follow. He was so careful that he decided to save after I confirm each and every sentence as he records them. (** giggles **). For those who are interested in knowing what he contributed, take a look at this commit and this. Oh and by the way, he is just 11 years old :) .

To not mention other friends along with this, would be unfair.

So here is a big shout out to all 18 other guys and gals without whom this would not have reached this far.

I know this blog post was not much about the project when looking in one aspect but, when you look it in another point of view, this is one of the most important part of my GSoC work.

With the final evaluation, coming up in 4.5 weeks or so, it is time to start wrapping up my work and put up a final submission in a manner that someone with same enthusiasm or even better can take up this and continue to work on it to better it.

I guess that’s it for this week’s update. More to follow as I near the completion of this awesome experience.

puts "until then ciao!"

by Sreenadh T C at July 16, 2016 06:23 AM

July 11, 2016

Arushi Dogra

Update on work

The week started with continuing the task for detection of supported locales. I was facing some problems initially. I was trying to first change the contents of a static file during runtime which I later realised couldn’t be done. So as directed by mentor I changed the approach and decided to prompt the user at the setup time about which languages might not be supported by the phone.
It looks something like this:

Screenshot_2016-07-12-00-24-21

Unfortunately my system crashed and the later part of my time was given to formatting the laptop,taking backup, installing the OS and re-setup of the project. Then I went home for my parents wedding anniversary for 3 days.

My next task : Improving the setup wizard . Since the user might not be interested in all the languages , so instead of showing all the layouts at once , we are planning to first ask the user to chose the language and then the corresponding layout in it. I have to discuss more with Jishnu regarding this task.

by arushidogra at July 11, 2016 07:16 PM

July 06, 2016

Anwar N

Braille-Input-Tool : The final touch

Hi,

            With this two weeks we have done many testing with users and done many additions according to their needs. The first one  is Key reassigning. as you know there are many keyboard variants also user like to set there own keys instead of using f,d,s,j,k and l. But this make the necessity of saving user preferences. So we done this using jstorage. it's working fine
https://github.com/anwar3746/braille-input/commit/9e8bb0b5ef9a54d61dfa5081d0966ec9d10f01a0


Key reassigning can be done by clicking "Configure Keys" button which will popup many entry's where user can remap his keys. Restore option is also provided there.
https://github.com/anwar3746/braille-input/commit/3d3469ab8a68711ba0189d61f02c7231297ded3a


New and Save are the basic things that should be provided by a online editor
https://github.com/anwar3746/braille-input/commit/074829d2f4be81b7fa984931a90a108e3bac03ab

Changing font color, font size and background color are very impotent for partially impaired blind people. For keeping the page accessible we choose combobox containing major color list instead of providing graphical color picker
https://github.com/anwar3746/braille-input/commit/f1f6d3de308386d08977f40bc417c4c1ac0b3eb9

Various bugfixes
https://github.com/anwar3746/braille-input/commit/9b8cbc8d54051e9cb330514aacc6d8e6066cf7c6
https://github.com/anwar3746/braille-input/commit/d8127ceb3dc567bfb1778a437d29c2cfe989b24f
https://github.com/anwar3746/braille-input/commit/d3a01c17db64d4fabbad29b18d605992b633270f
https://github.com/anwar3746/braille-input/commit/f34104bfb55c3e4e7735a23016ee913311444702

Braille-Input-Tool : http://anwar3746.github.io/braille-input/
See all commits : https://github.com/anwar3746/braille-input/commits/gh-pages


by Unknown (noreply@blogger.com) at July 06, 2016 09:09 PM

July 05, 2016

Balasankar C

GSoC Update: Week #5 and #6

Heya,

Last two weeks were spent mostly in getting basic spellchecker module to work. In the first week, I tried to polish the stemmer module by organizing tags for different inflections in an unambiguous way. These tags were to be used in spellchecker module to recreate the inflected forms of the suggestions. For this purpose, an inflector module was added. It takes the output of stemmer module and reverses its operations. Apart from that, I spent time in testing out the stemmer module and made many tiny modifications like converting everything to a sinlge encoding, using Unicode always, and above all changed the library name to an unambiguous one - libindic-stemmer (The old name was stemmer which was way too general).

In the second week, I forked out the spellchecker module, convert the directory structure to match the one I've been using for other modules and added basic building-testing-integration setup with pbr-testtools-Travis combination. Also, I implemented the basic spell checking and suggestion generation system. Like stemmer, marisa_trie was used to store the corpus. Three metrics were used to generate suggestions - Soundex similarity, Levenshtein Distance and Jaccard's Index. With that, I got my MVP (Minimum Viable Product) up and running.

So, as of now, spell checking and suggestion generation works. But, it needs more tweaking to increase efficiency. Also, I need to formulate another comparison algorithm, one tailored for Indic languages and spell checking.

On a side note, I also touched indicngram module, ported it to support Python3 and reformatted it to match the proposed directory that I have been using for other modules. A PR has been created and am waiting for someone to accept it.

July 05, 2016 01:57 PM

June 26, 2016

Sreenadh T C

Hours of data piling up!

drifting along, calm and composed! *wink*

Howdy everyone, well its exactly mid way to Google Summer of Code 2016, and everything have been going as per the schedule and plan, as I type this looking at the matte screen of the Asus Zenbook that just arrived. No more of criticizing of the Electricity and rain which I have been doing in my previous posts ( **giggle** ) but the internet connectivity still haunts me.

The week started off with spending a day setting up the new Zenbook with dual boot, installing dependencies on Ubuntu (sudo apt-get install blah-blah ), setting up git and repo, and on the other hand hoping that Windows will finish updating… … …one day… …! Ultimately, I decided to turn every automatic things off ( **duh** ) so that I can squeeze some speed out of my Broadband connection ( -___- ).

Anyways, the completion of transcribing the dictionary to its phonetic representation means I can now concentrate on collecting the training voices from all the contributors. Almost 12 of the speakers have completed their quota of sentences and around 8 speakers are remaining. Once this is completed, I can actually begin the reorganizing of database and then start the training using that database.

In the meantime, there other files to setup. Like, the file containing the ‘phones’ alone ( ml.PHONE ), the file that contains the relative path to the audio files in the wav directory ( ml.FILEIDS ), “wav/speaker1/file_1.wav” , the filler file that contains phonetic representation of sounds and disturbances for a more accurate recognition ( ml.FILLER ).

Talking about making the ml.FILEIDS file, mapping 4993 sentences from 15+ folders with each one having exactly 250 wav files is not going to be easy. But then there is a catch, notepad++ is there to rescue. Column edit mode ( Alt + Shift + up/down ) and Column replace with increment decimal options are available which will save time writing down each file name.

Note: the column edit will only work as long as the character we want to replace is in same column. Now since, the file id is of the form speaker/file_# , I can easily select the # column and replace it with decimal increment option — 1,2,3,4…

So, that’s how the week have panned out and hoping to continue this good run of form ( * That’s the football side of me typing. Euro 2016 commentary style * ).

puts “until then ciao!”

by Sreenadh T C at June 26, 2016 09:59 AM

Srihari,

Srihari,

Are you referring to the bash command used in the experiment I described or about the ruby scripts from my previous posts. I used the scripts to extract the sentences and words from subtitle file. The same script proved useful in many related situations during the course.

I didn’t have to sit for long time to figure out the script and was not sure if np++ had option for extraction :)

by Sreenadh T C at June 26, 2016 05:56 AM

June 25, 2016

Arushi Dogra

Weekly Blog

I am given the task to detect whether a language is supported by the keyboard or not. In my phone Punjabi is not supported so I did all the testing with that. Whenever a language is not supported it is displayed as blank so that gave me an idea on how I will work on this issue. So I created the bitmap for all the characters of the language and compared it with an empty bitmap. So If the language was not supported it had empty bitmap and I declared it as not supported.

I have to improve on : Currently it is checking every time when the keyboard is opening. So I will do it such that it checks for all languages during the setup wizard and stores the info.

My task for next week is checking in setup wizard for all languages and in the list displaying the languages which cannot be supported as not supportable so that the user can know.

by arushidogra at June 25, 2016 01:35 PM

June 23, 2016

Balasankar C

GSoC Update: Week #3 and #4

Heya,

[Sorry for the delay in the post]

I spent the last two weeks mainly testing out the stemmer module and the defined rules. During that I found out there are many issues for a rule based model because different types of inflections to different parts of speech can yield same inflected form. This can be solved only by machine learning algorithm that incorporates a morphological analyzer and is hence out of scope of my proposal. So I decided to move forward with the stemmer.

I tried to incorporate handling of inflections of verb - like tense change - using rules and was able to do a subset of them. Rest of the forms need more careful analysis and I've decided to get the system working first and then optimize it.

I've also decided to tag the rules so that a history of stemming can be preserved. The stemmer will now generate the stem as well as the tags of rules applied. This metadata can be useful to handle the problem of same letter being inflected to different forms that I faced while developing VibhakthiGenerator.

I spent some time in cleaning up the code more and setting up some local testing setup like a CLI and Web interface.

The PR was accepted by Vasudev and the changes are currently a part of the indicstemmer codebase.

BTW, it is time for the Midterm evaluations of GSoC 2016, where the mentors evaluate the progress of the students and give a pass/fail grade to them. Also, the students get to evaluate the mentors, communication with them and their inputs. I have already completed this and am waiting for my mentor to finish it. Hopefully, everything will go well.

June 23, 2016 03:04 PM