Searching through PDF and DOC files with TWiki KinoSearch and IndigoPERL

Wikis are great but something that has been bothering me is that I wasn't able to search through the contents of Adobe PDF or Microsoft Word DOC files.   IMHO, one of the biggest benefits of wikis is the ability to search.  If there are lots of documents that are unsearchable then there is a lot of information I can't find when I may need it.

 

TWiki runs well on Windows.  It's entirely written in PERL (in this case, IndigoPERL).  There are several TWiki extensions and addons for searching PDF and DOC files but I couldn't get them to work because they all depend on PERL modules that wouldn't install on IndigoPERL (via the CPAN shell).

 

After trial and error I've found that sometimes a module will fail to install from the CPAN shell (e.g., via the CPAN PERL module) but will install without problem manually.  In some cases you have to manually get an earlier version of the module and build it manually; the most recent version won't pass build tests.  The CPAN Search Site is an absolute godsend when it comes to manually installing CPAN modules.

 

Anyway, these are the steps I took to get the SearchEngineKinoSearch TWiki plugin working for TWiki (4.0.5) on Windows (server 2003R2) using IndigoPerl (5.8.6):


  1. Install VC++ from visual studio 2003.net into c:\vs2003 (or a directory name without spaces that is shorter than 9 characters).  Don't forget to include the Windows SDK tools.
  2. Find and download a copy of p2bat.pl.  Save it to your IndigoPERL bin directory.
  3. When manually installing PERL modules always use the visual studio 2003 command prompt (or run vcvars32.bat) so that IndigoPERL can find the VC++ compiler and nmake.  Manually installing is usually as simple as extracting the module into a directory, changing into that directory and running the following commands:

    1. perl Build.PL (for older modules replace Build.PL with Makefile.PL)
    2. Build (for older modules replace this with nmake)
    3. Build test (for older modules replace this with nmake test)
    4. Build install (for older modules replace this with nmake install)

  4. The PERL CPAN shell (perl -MCPAN -e shell) expects to find several GNU tools so install the windows version of the following GNU tools: gnuPG, grep, gzip, tar and unzip.
  5. KinoSearch relies on other programs to extract the text from PDF and DOC files so install xpdf and antiword.
  6. Manually install the following CPAN modules (the version is important): ExtUtils-CBuilder (version 0.21), Spreadsheet-ParseExcel (version 0.32), KinoSearch (version 0.161).  These can be downloaded from the CPAN Search Site.
  7. After manually extracting the SearchEngineKinoSearchAddOn.zip into the twiki dir, manually modify twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/DOC_antiword.pm and PDF.pm) as follows:

    1. In PDF.pm change system("pdftotext", $filename, $tmp_file, "-q") to system("c:\\bin\\pdftotext.exe", $filename, $tmp_file, "-q"); where c:\\bin\\pdftotext.exe is the fully qualified path of your pdftotext executable (this is a part of xpdf for windows).  The \\ are necessary since \ is a metacharacter in double quoted PERL strings.
    2. In DOC_antiword.pm replace "antiword" with the fully qualified path for your antiword.exe (e.g., c:\\bin\\antiword.exe).

  8. [optional] On my system I've had to modify twiki/lib/TWiki/Sandbox.pm by changing normalizeFileName() to return @result; instead of return join '/', @result; just to get TWiki to work.

 

Converting a DirectX Surface to a GDI+ Bitmap

DirectX surfaces stored in video memory can be rendered a lot faster than images rendered through GDI+ (object oriented graphics device interface windows api).  GDI+, on the other hand, has support for saving images to a lot more formats.

 

I wanted to save a DirectX surface as a JPEG.  There's a Direct3D extension function that supports this; D3DXSaveSurfaceToFile().  Unfortunately it only works with surfaces that are square (width = height).  So to save a non-square JPEG I needed to convert from an IDirect3DSurface9 to a GDI+ Bitmap class then use GDI+'s encoders to save it as a JPEG.

 

To do this, immediately after the call to IDirect3DDevice->Present() do the following:


  • Get the current render target via GetRenderTarget(0), then get the dimensions of the render target via GetDesc().
  • Create a buffer in system memory (not video memory) to hold a copy of the current display.  This is done with CreateOffscreenPlainSurface() w/ D3DPOOL_SYSTEMMEM
  • use GetRenderTargetData() to copy (Blit) the current display to the offscreen buffer.
  • Get a pointer to the offscreen pixel buffer via LockRect().
  • Create a GDI+ Bitmap class using the constructor that takes width, height, pitch, pixel format and a pointer to the pixel buffer.

    • Use the pitch of the offscreen pixel buffer (get it via GetDesc()) not the pitch of the original surface
    • The pixel format of the bitmap must match the pixel format of the surface - this was specified when the IDirect3DDevice was created.

  • use the Bitmap->Save() (defined in Image which is inherited by Bitmap) method to write the file to a JPEG.
  • unlock the rect via UnlockRect() and release any COM interface pointers that were either directly created via QueryInterface() or were indirectly created by calling an interface returning function.

The GDI+ API docs have an excellent description of using the encoders along with a very handy GetEncoderClsid() function.

Copying one big drive/disk/folder to 2 or 3 smaller drives/disks

I needed to copy one big drive (~700GB) to 3 smaller drives.  Since this was going to take a while I didn't want to have to deal with any interactive input.  I basically wanted to copy files to the first drive until it was full then start copying to the 2nd until it was full and, lastly, to copy to the 3rd drive.

 

A batch file to do this follows.  It needs to run from an xp or higher command shell with delayed expansion enabled (e.g., cmd /v:on) as well as command extensions enabled (the default on windows server 2003).

 

This isn't terribly well tested but it worked for my purposes.  If the script encounters a file too big to be stored on the first drive it tries copying it to the second; if it's too big for the second it tries copying it to the third.  If that fails then it prints a line in a copyspan_errors.txt file with the name of the file.

 

Also, it uses the archive attribute to keep track of what should and should not be copied - as long as no other programs are modifying this bit then it should be safe to stop and restart the script with the same arguments.  I ran it with "copyspan.bat d:\ f: g: h:" so that everything on the very large d:\ drive would be backed up to f:, g: and h:. 

 

The batch files contents are:

 

@echo off

REM This batch must run from a command prompt that has
REM delayed expansion enabled (/V:on)

set SRC=%1
set DEST1=%2
set DEST2=%3
set DEST3=%4


REM set the archive bit on the source
REM echo %DATE% %TIME% Setting archive bit in %SRC%
attrib +A "%SRC%*" /S /D


echo %DATE% %TIME% Copying files
rem /M = only copy archive files then clear archive bit
set CPCMD=xcopy /M /F /H /K /Y

if exist copyspan_errors.txt del /q copyspan_errors.txt

set ATTR=
for /R "%SRC%" %%F IN (*) DO (
  for /F "tokens=1*" %%i in ('attrib "%%F"') DO set ATTR=%%i

  if "!ATTR!" EQU "A" (
    %CPCMD% "%%F" "%DEST1%%%~pF"

    if !ERRORLEVEL! NEQ 0 %CPCMD% "%%F" "%DEST2%%%~pF"

    if !ERRORLEVEL! NEQ 0 %CPCMD% "%%F" "%DEST3%%%~pF"

    if !ERRORLEVEL! NEQ 0 echo "%DATE% %TIME% ## error copying %%F" >> copyspan_errors.txt
  ) ELSE (
    echo Skipping already archived file "%%F"
  )
)

:end

echo %DATE% %TIME% Copy complete

Turning COM Errors and HRESULTs into .Net Exceptions

.NET will automatically convert an HRESULT into a specific type of Exception based on the HRESULT.  This is fine for many of the built-in HRESULTs closely related to Win32 API like function calls.  In my case though, I wanted to turn an application specific error that occurs in an ATL COM server (ATL3.0 no less!) into a pretty Exception so that managed code would get a useful error message.

To do this:
  1. Create an empty error object via CreateErrorInfo( &pcerrInfo )
  2. Set the description via pcerrInfo->SetDescription().  CreateErrorInfo allocates a pointer to an object that implements ICreateErrorInfo.  SetDescription() is a method defined on ICreateErrorInfo.  Whatever gets set here will be assigned to the .NET Exception.Message property.
  3. call SetErrorInfo(0, perrInfo) where perrInfo is a pointer to IErrorInfo. 
    1. To get perrInfo pointer, call pcerrInfo->QueryInterface(IID_IErrorInfo, (LPVOID*) &perrInfo);
    2. SetErroInfo() sets the error object for the current thread (actually it clears the current error object then sets it to the new one).
  4. after Release()-ing both pointers, return an error HRESULT (e.g., return E_FAIL).
    1. Don't make any other COM calls before returning the HRESULT as these might overwrite the error object just set.
  5. In .NET this will show up as a COMException where COMException.Message = IErrorInfo->GetDescription()

Since this will be used a lot, I put it into a single function that takes a string.  In practice, I immediately return an error HRESULT after calling the function.


void
CSimpleTestObj::AppSetErrorInfo(LPWSTR pszErr)

{

ICreateErrorInfo *pcerrinfo;

IErrorInfo *perrinfo;

HRESULT hr;

hr = CreateErrorInfo(&pcerrinfo); // create generic error object

if
(SUCCEEDED(hr))

{

// set the
text - in .NET this will map to Exception.Message


pcerrinfo->SetDescription(pszErr);

// need to
get IErrorInfo because this is what the .NET code will see


hr =
pcerrinfo->QueryInterface(IID_IErrorInfo, (LPVOID FAR*) &perrinfo);

if
(SUCCEEDED(hr))

{

// SetErrorInfo sets the
error object for the current thread

// .Net will take it,
make an exception,

// then set
ApplicationException.Message = IErrorInfo.GetDescription()


SetErrorInfo(0, perrinfo);


perrinfo->Release();

}

pcerrinfo->Release();

}

}