Today’s interview is with Mark Hall one of the original core developers of the WEKA (Waikato Environment for Knowledge Analysis) data mining software. Mark currently works at Pentaho Corporation as technical lead for the WEKA project, among other duties. We go into WEKA’s history and a lot of resources for the machine learning enthusiast. Do not miss the video tutorial at the end of the interview. Enjoy!
F4S: Hello Mark. Please, give us a brief introduction about yourself.
I am one of the original core developers of the WEKA data mining software and I currently work at Pentaho Corporation as technical lead for the WEKA project, data mining consultant and software engineer. Prior to joining Pentaho, I held teaching and postdoctoral fellowship positions at the University of Waikato in New Zealand. I have 15 years experience as an academic researcher in computer science and have published extensively in machine learning and data mining conferences and journals.
F4S: What is WEKA?
WEKA is a toolbox of machine learning algorithms for data mining tasks. It comprises many state-of-the-art algorithms for supervised and unsupervised learning and provides a framework for performing principled experimental comparison of the results of learning.
F4S: Why and when did WEKA come to be?
WEKA grew out of a project funded by the New Zealand government in the early 1990′s for investigating the application of machine learning to agricultural domains. The software came about through the perceived need for a uniﬁed workbench that would allow researchers easy access to state-of-the-art techniques in machine learning. At the time of the project’s inception in 1992, learning algorithms were available in various languages, for use on different platforms, and operated on a variety of data formats.
F4S: In which language(s) and platform(s) is the project developed?
Originally, the workbench was developed to run under Unix systems and was written primarily in C, with some Prolog-based evaluation routines and a TCL/TK-based user interface. However, by the late 1990′s it was becoming difficult to maintain and difficult for users to get installed and working. This was due to factors such as changes to supporting libraries, management of dependencies and complexity of configuration. In 1999 the software was completely re-written in Java.
F4S: Does WEKA have sponsors?
The University of Waikato remains the home of the open source WEKA software. In 1996 Pentaho became a major sponsor of the project.
F4S: How are the sponsors supporting the project?
Pentaho provides development resources and promotes the software in the open source community.
F4S: How many users you estimate WEKA have?
It’s hard to say, but I’d guess there are many thousands of users. The WEKA mailing list has over 3,500 members and the software is downloaded from Sourceforge on average 10,000 times each week.
F4S: Do you know where is WEKA used ?
There is, of course, a large academic following of WEKA. However, many corporations are using (or have used) WEKA as well – some examples include Ford motor company, NASA and the UK NHS.
F4S: How many team members does the project have?
The core development team has always been small. At present there are four core developers. Over the years there have been many contributers to the software, both internal and external to the University of Waikato.
F4S: In what areas of WEKA development do you currently need help?
One area that WEKA has been criticized is its graphical user interfaces, which are often perceived as being a bit dated now. I’ve made some progress in modernizing WEKA’s “Knowledge Flow” UI, but it would be good to get some input and ideas on sprucing up the “Explorer” and “Experimenter” interfaces.
F4S: How can people get involved with the project?
In WEKA 3.7.2 we introduced a package management system (similar to the one in the R statistical software) for WEKA that makes it much easier for folks to contribute. Contributers can create and maintain their own projects for their contribution(s) and host the corresponding installable packages wherever they wish. Information on the package management system and contributing can be found at:
F4S: What features are in the roadmap?
Improving support for the PMML standard – in particular the ability to export PMML models from WEKA. Connectivity for NoSQL databases (HBase, Cassandra, MongoDB etc). Support for data stream mining for text classification in the Knowledge Flow environment. Ongoing UI improvements.
F4S: Which projects, blogs or sites related to open source software for science can you recommend?
Bernhard Pfahringer at the University of Waikato maintains a good list of machine learning/data mining-related blogs:
KDNuggets is one of my favorite sites for information on all things related to knowledge discovery and data mining. This site maintains a list of open source (and proprietary) analytic tools:
F4S: Why do you consider free/libre open source software important for the advancement of your field?
Open source software is crucial in machine learning for encouraging the development of new techniques and facilitating experimental comparison. Having an active community user-base, who have access to the source code, helps to grow and improve the software faster.
F4S: Where people can contact you and learn more about WEKA (website, blog, email, twitter, identi.ca, Facebook, etc.)?
The best place to get started and learn about WEKA is at the main WEKA site at the University of Waikato:
There are Wikis for WEKA at wikispaces and Pentaho:
There is also a book that accompanies the software:
I can be contacted via the WEKA mailing list or the forums at Pentaho:
F4S: Thank you Mark for sharing with us more about you and the WEKA project.
Video: WEKA Data Mining Tutorial for First Time and Beginner Users
Books you may like