Wednesday, March 20, 2013

SQL Server 2012 FileTables, Text Files and Full-text Search

In July 2012, fellow SolidQ Mentor and Microsoft MVP, Davide Mauri (@mauridb), discovered an unexpected behavior of the SQL Server 2012 Full-text Search (FTS), and/or Statistical Semantic Search (SSS), while indexing documents placed in a SQL Server 2012 FileTable as TXT files. Instead of the language set in the definition of the full-text index (see CREATE FULLTEXT INDEX in SQL Server Books Online for details) the language determined by the Windows System Locale was used for full-text indexing.

In March 2013 the alleged bug was resolved by Microsoft as "Won't Fix", with the following explanation:

Posted by Microsoft on 6.3.2013 at 8:22
This is by design. In the attached script the Text IFilter would emit the System Locale for the work chunks. And we go by whatever the IFilter emits. The full-text index language will be used when we have a plain text column without the doctype column.

Regardless of the language set in the full-text index definition, the language emitted by the IFilter is used – both for the full-text index as well as for the statistical semantic indexes. Because TXT files do not store any additional metadata, such as the language used in the text, the IFilter resorts to the System Locale in order to "determine" the language. This means that if the language used in the files is different from the one used as the System Locale, the resulting full-text index might not contain correct, or complete, data, and if the language of the System Locale is not supported by the Statistical Semantic Search then the statistical semantic indexes will also not be populated. In other words: the actual language must be determined by other means.



The Solution

To allow multi-lingual documents, or simply documents written in a language different from the one set by the System Locale, that are stored in a FileTable, to be indexed in accordance with the language(s) used in them, these documents need to be stored using a format that supports additional metadata, such as the language of the text, which the corresponding IFilter must be able to retrieve during the full-text index population. For instance, Microsoft Word files, PDF files, or even XML files, allow you to set the language of either the entire document, or of its individual parts. The corresponding IFilters will respect these settings, prompting the use of appropriate word breakers and stemmers for FTS/SSS processing.

In the above Connect item, Davide used a Transact-SQL script to create a couple of text files, insert them into a FileTable, and then index them using both the "regular" full-text indexing as well as the statistical semantic indexing. The files contained English text, the language set for the full-text index was also English; however, in his environment the System Locale was set to Italian. The latter was then actually used for full-text index population, which is not what Davide (or me, for that matter) would have expected.

In the amended version of Davide's script I have used the XML format to store the documents as files, and inside the XML I specified the correct language of each document by using the xml:lang attribute. Download the script (right-click, then Save target as...) and try it out yourself.



ML

No comments: