ACQUA: Automated Community-based Question Answering through the Discretisation of Shallow Linguistic Features


  • George Gkotsis Knowledge Media Institute, Open University
  • Maria Liakata Department of Computer Science, University of Warwick
  • Carlos Pedrinaci Knowledge Media Institute, Open University
  • Karen Stepanyan London School of Business and Management
  • John Domingue Knowledge Media Institute, Open University



This paper addresses the problem of determining the best answer in Community-based Question Answering (CQA) websites by focussing on the content. In particular, we present a novel system, ACQUA (, that can be installed onto the majority of browsers as a plugin. The service offers a seamless and accurate prediction of the answer to be accepted. Our system is based on a novel approach for processing answers in CQAs. Previous research on this topic relies on the exploitation of community feedback on the answers, which involves rating of either users (e.g., reputation) or answers (e.g. scores manually assigned to answers). We propose a new technique that leverages the content/textual features of answers in a novel way. Our approach delivers better results than related linguistics-based solutions and manages to match rating-based approaches. More specifically, the gain in performance is achieved by rendering the values of these features into a discretised form. We also show how our technique manages to deliver equally good results in real-time settings, as opposed to having to rely on information not always readily available, such as user ratings and answer scores. We ran an evaluation on 21 StackExchange websites covering around 4 million questions and more than 8 million answers. We obtain 84% average precision and 70% recall, which shows that our technique is robust, effective, and widely applicable.


L. A. Adamic, J. Zhang, E. Bakshy, and M. S. Ackerman. Knowledge sharing and yahoo answers: everyone knows something. In Proceedings of the 17th international conference on World Wide Web, pages 665–674. ACM, 2008.

E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 183–194. ACM, 2008.

A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec. Discovering value from community activity on focused question answering sites: a case study of stack overflow. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 850–858. ACM, 2012.

S. Angeletou, M. Rowe, and H. Alani. Modelling and analysis of user behaviour in online communities. In The Semantic Web–ISWC 2011, pages 35–50. Springer, 2011.

G. Burel, Y. He, and H. Alani. Automatic identification of best answers in online enquiry communities. In The Semantic Web: Research and Applications, pages 514–529. Springer, 2012.

J. Callan and M. Eskenazi. Combining lexical and grammatical features to improve readability measures for first and second language texts. In Proceedings of NAACL HLT, pages 460–467, 2007.

K. Collins-Thompson. Computational assessment of text readability: A survey of current and future research (working paper), 2014.

C. Danescu-Niculescu-Mizil, R. West, D. Jurafsky, J. Leskovec, and C. Potts. No country for old mem- bers: User lifecycle and linguistic change in online communities. In Proceedings of the 22nd international conference on World Wide Web, pages 307–318. International World Wide Web Conferences Steering Committee, 2013.

L. Feng, M. Jansche, M. Huenerfauth, and N. Elhadad. A comparison of features for automatic read- ability assessment. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 276–284. Association for Computational Linguistics, 2010.

Y. Freund and L. Mason. The alternating decision tree learning algorithm. In ICML, volume 99, pages 124–133, 1999.

R. Gunning. Technique of clear writing. 1968.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining

software: an update. ACM SIGKDD Explorations Newsletter, 11(1):10–18, 2009.

J. Jeon, W. B. Croft, J. H. Lee, and S. Park. A framework to predict the quality of answers with non- textual features. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, pages 228–235, New York, NY, USA, 2006. ACM.

J. Jones and N. Altadonna. We don’t need no stinkin’badges: examining the social role of badges in the huffington post. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, pages 249–252. ACM, 2012.

J. Liu, Q. Wang, C.-Y. Lin, and H.-W. Hon. Question difficulty estimation in community question answering services. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 85–90, 2013.

Q. Liu, E. Agichtein, G. Dror, E. Gabrilovich, Y. Maarek, D. Pelleg, and I. Szpektor. Predicting web searcher satisfaction with existing community-based answers. In SIGIR, pages 415–424, 2011.

A. Louis. Automatic metrics for genre-specific text quality. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 54–59. Association for Computational Linguistics, 2012.

A. Louis and a. Nenkova. What makes writing great? first experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computional Linguistics, 1:341–352, 2013.

A. Louis and A. Nenkova. Verbose, Laconic or Just Right: A Simple Computational Model of Content Appropriateness under Length Constraints, pages 636–644. Association for Computational Linguistics, 2014.

H. Oktay, B. J. Taylor, and D. D. Jensen. Causal discovery in social media using quasi-experimental designs. In Proceedings of the First Workshop on Social Media Analytics, pages 1–9. ACM, 2010.

S. T. Piantadosi, H. Tily, and E. Gibson. Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences, 108(9):3526–3529, 2011.

E. Pitler and A. Nenkova. Revisiting readability: A unified framework for predicting text quality. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 186–195. Association for Computational Linguistics, 2008.

M. Rowe, M. Fernandez, S. Angeletou, and H. Alani. Ontology paper: Community analysis through semantic rules and role composition derivation. Web Semantics: Science, Services and Agents on the World Wide Web, 18(1):31–47, 2013.

C. Shah and J. Pomerantz. Evaluating and Predicting Answer Quality in Community QA. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 411–418. ACM, 2010.

C. Tan, L. Lee, and B. Pang. The effect of wording on message propagation: Topic- and author-controlled natural experiments on twitter. In Proceedings of ACL, 2014.

K. Tanaka-Ishii, S. Tezuka, and H. Terada. Sorting texts by readability. Computational Linguistics, 36(2):203–227, 2010.

Q. Tian, P. Zhang, and B. Li. Towards predicting the best answers in community-based question- answering services. In Seventh International AAAI Conference on Weblogs and Social Media, 2013.

R. Vadlapudi and R. Katragadda. On automated evaluation of readability of summaries: capturing grammaticality, focus, structure and coherence. In Proceedings of the NAACL HLT 2010 Student Re- search Workshop, pages 7–12. Association for Computational Linguistics, 2010.

L. Yang, S. Bao, Q. Lin, X. Wu, D. Han, Z. Su, and Y. Yu. Analyzing and predicting not-answered questions in community-based question answering services. In AAAI, 2011.

H. Yannakoudakis and T. Briscoe. Modeling coherence in esol learner texts. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 33–43, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.