August 10, 2006
@ 09:57 AM
Recently I have been working on a lot of data migration to XML - of all kinds - PDF documents to XML, Word document to XML, SGML to XML. My love for the angle brackets is obvious. For the kind of data that I migrate, there aren't any suitable off-the-shelf tools available. The approach is to come up with a customized migration engine to perform the incremental migration on the input.

PDF to XML - the rules
  • PDF doesn't contain any structure information. MS Word and other Word Processors, all hold information in objects - paras, pictures, tables, lists. This information is usually accessible using an object model that the application makes available through an API. PDF on the other hand is like a canvas - with text and images painted on a flat white page.
  • There are tools that help you create a PDF file from an object-based structure consisting of para, images etc, but none of them are able to parse a PDF back into an object structure well enough (some of them like PDFBox can extract text, but still no structure). If you extract information from a PDF file, what you get is a dump of all the text with positional (X and Y co-ordinates on the page) and font information. I have used PDF2HTML for this before - it works well with single column PDF documents.
  • PDF does have some information in an object model – the bookmarks, TOC etc, that can be extracted using some of the available libraries, but that information is rarely of any use.
PDF to XML – what do you want to do with it?
  • The objective is to produce an XML file that is usually hierarchical (sublists within lists, images/para/tables inside the sublist items) – a-la-MSWord Outline Numbering document
PDF to XML – approach
  • The migration engine we developed at Imfinity was a template based application, reverse-MSWord. When you are writing a Word document, you write text and apply formatting to it by selecting a style.
  • The same was done from on the PDF document, but in the reverse direction. Using a divide and conquer approach, the document is first divided into the highest level templates (sections, or probably separate out the lists). This can be done by marking a range of page numbers as one section, or as a top level item in the target DTD. Top level lists can be isolated using indentation or heading formatting.
  • Going further down in the hierarchy, within each high level element, we mark out a certain font, indentation, spacing – as a template. As soon as one is marked, all text items with the same formatting (based on indentation and font) can be separated and tagged as a particular simpler XML element.
  • The above approach incrementally produces a simple structured XML document. This XML document doesn't need to be compliant with any particular DTD, but it should capture all the fields, text that needs to be mapped into the target DTD. A simple structure like the image below might be sufficient.
  • Once we have this basic XML document, we write advanced XSLT scripts to transform the data into the target DTD.

 

*Legally, I am not allowed to use the above logo, yet, but I will have the certs soon :)*

I gave 70-536 on Monday and 70-526 on Wednesday. Passed ! There is very little material available on these new generation tests. So, here I am trying to compile whatever material I found and used for the little preparation I did.

I haven't used dumps before, so I dont know how that works (and its illegal ;)).

But the following guides will help. Preparation link for 70-536. Michael's guide provides links to most of the topics listed in the MS prescribed syllabus. I suggest going through all these links - some briefly and some in detail. The questions are very much from these sections. You will pass if you have some experience working with .NET 2.0 and if you know the basic details about specific classes, interfaces listed in the syllabus.

Similar guides are available for 70-526 - here and here. But these are not complete. So I took the syllabus as the starting point. With a google firefox window open in parallel, I ran through a lot of these topics. Have tried to add to PublicJoe's good work. The updated 70-526 guide is available - 70-526 Preparation Guide. (Initial links provided by PublicJoe)


 

September 24, 2005
@ 05:34 PM

Does any one use the built in CD Burning wizard in Windows XP? If you add 600 MB of data to the CD, there is one cached image made in some folder on C drive. Ok fine, when you start the wizard to start writing to the CD, XP makes another copy - for some weird reason. This would have already taken around 20 minutes of your time and 1.2 GB of your system drive. When everything is on the hard disk, why cant they burn directly from the hard disk. One cached image of the CD is still fine for performance and to prevent buffer underrun, but i don't understand two copies!

So I found this. Wonderful. An interface similar to Ahead Nero Burning ROM and with all features. Burns DVD's, writes ISO files, create boot disks. Oh and did I mention - its written in .NET and is absolutely FREE.


 

September 18, 2004
@ 03:33 PM

Cool Java Concept Map in Flash

PInvoke.NET Wiki – for calling unmanaged Win32 api functions …

LookOut – Acquired by MS – super fast email search

Effective Java – Book to read

The Taligent Effect

The Taligent effect is what happens when a group of people put adherence to a software trend first and lose sight of the value of shipping software that people will actually use.

BullShit Generator – Really cool … use this while writing proposals and in presentations :D

Clemens Vasters on Open Source – this started a hot debate on Slashdot … [via Don Box]

Dear Aiden,

I think you remember the conversation we had recently at this software conference in Dublin. You came up to me and told me how the stuff I was talking about was mostly useless, because it is closed-source, people need to pay for it and that companies charging for software are evil anyways – especially Microsoft. Unfortunately I don’t have your email, but I am sure this will reach you.

That was in 1990 – let’s fast forward to 2004 and you. All software that you and your father could possibly be interested in has already been written. That’s probably not true, but it’s hard to think of something, right? Ok, the software may not run on your favorite operation system and may cost money, but what you can immediately think of is likely there. So where do you put all your energy? Into this absolutely amazing open-source project you co-coordinate. I mean, really, the stuff that you and your buddies are doing there is truly impressive. There are a couple of things I’d probably do differently in terms of design and architecture, but it works well and that’s mostly what matters. And you do make an impact as well. I know that hundreds of people and dozens of companies use your stuff. That’s great.

If someone installs your work from disc 3 of some Linux distro, they couldn’t care less who you are. The whole fame thing you are telling me only works amongst geeks. The good looking, intelligent girl over there at the bar that you’d really like to talk to doesn’t care much whether you are famous amongst a group of geeks and neither does she even remotely fathom why you’d be famous for that stuff in the first place. I mean – get real here.

.NET Report Card – nice article on InfoWorld

What is a haiku: Haiku is a form of poetry popular in Japan, which is becoming more widely appreciated around the world in this century. Haiku writers are challenged to convey a vivid impression in only 17 Japanese characters.

Avalon is cool:

Spellchecking is enabled either in XAML by writing

 <TextBox IsSpellCheckEnabled="True" /> 

or in code by writing

 TextBox.IsSpellCheckEnabled = True; 

 

December 3, 2003
@ 06:01 AM

Online news aggregator. Rocks. Bloglines is a free service that makes it easy to keep up with your favorite blogs and newsfeeds. With Bloglines, you can subscribe to the RSS feeds of your favorite blogs, and Bloglines will monitor updates to those sites. You can read the latest entries easily within Bloglines.

And its FREE. Can share subcriptions and have a "Subscribe to Bloglines" link on your blog. Has an options for webbased/windows notifier.

 

December 3, 2003
@ 06:01 AM

Chip India's article on which technologies can keep you afloat in the job market - the most fundamental requirements of the market. It says

"In today’s job economy, management seeks resilient people with flair in multiple technologies, which indicates that they can cope with current and future development practices."
Surprisingly, C Programming still leads the pack with 20,400 jobs with Java a distant second at 12,600.

[Via: {Sudhakar's .NET Dump Yard;}]

 

December 3, 2003
@ 05:57 AM

VoIP fighting back ?? Vonage just raised $35M

the time wheel

The time wheel.  I like this...  [via Ottmar Liebert ]

[Via: Critical Section]


 

December 3, 2003
@ 05:56 AM

Intel Processor linup for 2004-05

Intel's move to processors with several cores and the expansion of wireless capabilities with its chipsets. Otellini said Thursday that desktop processors will include this feature beginning in 2005.

"Tanglewood" processor for servers. "Grantsdale" desktop chipset, due in 2004 - will contain a capability to turn the PC into a wireless access point through the use of software - Digital Home strategy. "LaGrande" which will create a secure "vault" to store data; and "Vanderpool" a technology to allow virtual OS processes to run on the same system. In 2004, Intel expects a "quick toggle" to code-names Dothan and Prescott, Intel's first 90-nanometer processors for the mobile and desktop markets. The Sonoma chipset will also include a connector for a light meter OEMs can install on the motherboard. In daylight, for example, the chipset will dim the LCD backlight to save power. In wireless, Intel will sample a chip that combines Bluetooth and 802.11 wireless in 2004, and ship WiMAX silicon, which will provide wireless "last mile" access to the home. In 2004, Intel will also ship a version of its 32-bit Prescott chip as part of the Xeon family, and add a version of its 64-bit Madison processor with 9MB of cache.


 

November 21, 2003
@ 10:49 PM

Tiger vs Longhorn in 2006 - Steve Gillmore talks to Jonathan Schwartz

SG: He sees the opportunity to build apps on top of that infrastructure.
Schwartz: No company has ever monetized Microsoft's infrastructure in the history of Microsoft.

SG: What are the desktop killer apps, not in 5, but 2 years, that will seed that market, and force a migration off Office?
Schwartz: The killer app for this desktop is price, because China and India and El Salvador and Brazil can't afford a hundred dollars per desktop from Microsoft.

SG: How do you combat the Longhorn vision in a time frame that's going to make some difference?
Schwartz: It's called "Tiger." J2SE 1.5 will deliver lightning performance on that desktop. We've already provided a rich client called Java, but Microsoft wasn't so interested in helping us with our deployment. So we've done our own now – we've signed up over half of the PC industry to ship our J2SE. And as we fold 1.5 seamlessly integrated into Mozilla, that will give us not only an optimized Web services execution environment on the client, it will give us a beautiful portability story onto a much cheaper desktop called the Java Desktop.

SG: But so what. If Microsoft goes off and bakes its stuff into 100 million desktops…
Schwartz: I don't look at Longhorn and say "Oh, my god, they've architected a better automobile." I look at them and say "You're trying to improve on a buggy whip." If you're just another end node on the network, what are you going to deliver to it? Office productivity is just a feature. We're over it, done with that. The real issue is: what are you going to do with peer-to-peer streaming of video?

Name me a software business last year that was $6 billion. It's tough to do – database, maybe. And this year, ring tones will be $8-10 billion. Maybe an enterprise app is worth that kind of money. All of the high value systems going forward are going to be consumer systems.

[Via: Microsoft Watch from Mary Jo Foley]


 

November 14, 2003
@ 05:52 AM

Another attempt at preventing music piracy. Lets see how long this one lasts.Wired News: Top Stories.


 

November 14, 2003
@ 04:36 AM

code smell is a hint that something has gone wrong somewhere in your code.


 

November 14, 2003
@ 04:07 AM

Paul Hounshell has this article about how to develop applications for which plugins can later be written.


 

Excellent primer to what regular expressions are. Complete regex reference in the appendix. Roy Osherove.


 

November 5, 2003
@ 02:30 AM
All sessions from PDC 2003 ...downloadable slides and code.
 

October 29, 2003
@ 04:35 AM
Great idea here. Instead of shuffling between Yahoo! Briefcase, FTP servers and file splitters, just use this to email your big files. When you email a file using a token, the Token software makes your computer a server. I am not sure how well it works across firewalls, but seems like a good idea altogether.

Joel Spolsky's post also announces this as something that we have always needed. The idea is so simple and its surprising that we have overlooked it for so long.

[Update: I guess good things always come at a price. Even though redeeming a token is free (you still need to download the Redeemer - this should be eliminated somehow to make it more usable), creating a token is not. The basic stuff costs $49 and setting up a Token Server is $1000]
 


Never noticed this before ... but quite useful list of tests in various MS technologies.
 

October 23, 2003
@ 04:01 AM

Hanselman links to by Phillip Greenspun:

"Our students this semester in 6.171, Software Engineering for Internet Applications have divided themselves into roughly three groups.  One third has chosen to use Microsoft .NET, building pages in C#/ASP.NET connecting to SQL Server.  One third has chosen to use scripting languages such as PHP connecting to PostgreSQL and sometimes Oracle.  The final third, which seems to be struggling the most, is using Java Server Pages (JSP) with Oracle on Linux.  JSP is fantastically simpler than "J2EE", which is the recommended-by-Sun way of building applications, but still it seems to be too complex for seniors and graduate students in the MIT computer science program , despite the fact that they all had at least one semester of Java experience in 6.170.

<snip/>But the programmers and managers using Java will feel good about themselves because they are using a tool that, in theory, has a lot of power for handling problems of tremendous complexity.  Just like the suburbanite who drives his SUV to the 7-11 on a paved road but feels good because in theory he could climb a 45-degree dirt slope." [Phillip Greenspun's Blog]


 

Joel Spolsky had written this article in August 2000 about software cost estimates and schedules. Some interesting extracts.

[Spolsky] Testosterone-crazed game companies like to brag on their web sites that the next game will ship "when it's ready". Schedule? We don't need no stinkin' schedule! We're cool game coders! Most companies don't get that luxury. Ask Lotus. When they first shipped 123 version 3.0, it required an 80286 computer, which wasn't very common then. They delayed the product by 16 months while they worked to shoehorn it into the 640K memory limit of the 8086. By the time they were done, Microsoft had a 16 month lead in developing Excel, and, in a great karmic joke, the 8086 was obsolete anyway!

He refers to this story of the rise and fall of Netscape by one of Netscape's employees. Extracts from there.

[Zawinski] Why? Because the company stopped innovating. The company got big, and big companies just aren't creative. There exist counterexamples to this, but in general, great things are accomplished by small groups of people who are driven, who have unity of purpose. The more people involved, the slower and stupider their union is.

And there's another factor involved, which is that you can divide our industry into two kinds of people: those who want to go work for a company to make it successful, and those who want to go work for a successful company. Netscape's early success and rapid growth caused us to stop getting the former and start getting the latter.

Make a schedule

13 silver bullets about scheduling (Spolsky)

  1. Use MS Excel: (MS Project has too many dependencies)
  2. Keep it simple (column and task list based)
  3. Features consists of multiple taks
  4. Let the programmer make the schedule
  5. Pick finegrained tasks: As a rule of thumb, each task should be from 2 to 16 hours. If you have a 40 hour (one week) task on your schedule, you're not breaking it down enough. Time it in hours. Only then can it be considered as well defined.
  6. Keep track of original and current estimate for a task: Helps learn from mistakes
  7. Update everyday
  8. Put items for vacations: Then add the Remaining hours field to estimate time of shipping
  9. Add Debugging time: In principle, developers debug code as they write it. A programmer should never, ever work on new code if they could instead be fixing bugs. The bug count must stay as low as possible at all times
  10. Add Integration time: Invariably there will be repeated code and parts that need to be cleaned up for the system overall.
  11. Add a buffer to the schedule
  12. Dont let managers change the estimated time
  13. Features are like blocks of wood: You cant shrink features to accomodate them in the time you have. Either include them, or just leave them out for the next release.

 

Google and MapPoint have teamed up to research on a new search by location technology. Other projects running at the Google Labs. Wish they had more frequent updates and a feed for the Zeitgeist.
 

October 13, 2003
@ 10:44 PM
The Law of Leaky Abstractions explained by Joel Spolsky in the most real real-world examples. His example of TCP over IP demonstrates that even though TCP is supposed to completely hide IP and make sure that all transmission from one computer to another takes place reliably and the user is not even supposed to know about IP, sometimes the problems in IP leak through the abstraction and dont let TCP work. One of these situations being “your pet snake has chewed through the network cable leading to your computer”.
 

October 11, 2003
@ 03:15 AM
Eric Maino is going to hold this competition in February.
 

October 11, 2003
@ 02:51 AM

eweek.com has this interesting article on Ellison and his role in Oracle. The future of Oracle and why he is still not ready to name a second in command.

So Oracle's stakeholders—customers, partners, shareholders, and employees—are at the mercy of a nearing-sixty CEO who indulges in high-risk behavior and whose interest in his company is fitful. In late 2002, Ellison spent weeks at a time anchored off the coast of New Zealand, while the eighty-foot yacht that he paid for, Oracle, participated in the America's Cup trials. Early on in the trials, Ellison was a crew member, but the captain, Chris Dickson, yanked him for a more veteran sailor. Ironically, Ellison had elevated Dickson to captain to replace someone else. The Oracle head capitulated meekly to being thrown off the boat, conceding that the captain must prevail.


 

October 11, 2003
@ 02:47 AM

Colt Kwong posted this picture from Hong Kong's attempt to create a new Guiness Book Record with hundreds of people assembling PC's at the same time, same place.

Do I see Windows XP being installed on all those PC's ?? :)


 

October 9, 2003
@ 12:37 AM

Bugs are a part of every programmer's life. Here's what The Jargon Dictionary has to say about bugs.

First the definition of bug

bug: n. An unwanted and unintended property of a program or piece of hardware, esp. one that causes it to malfunction. Jargon Dictionary Reference

Common types of bugs

heisenbug: /hi:'zen-buhg/ n. [from Heisenberg's Uncertainty Principle in quantum physics]. A bug that disappears or alters its behavior when one attempts to probe or isolate it.

Bohr bug: /bohr buhg/ n. [from quantum physics]. A repeatable bug; one that manifests reliably under a possibly unknown but well-defined set of conditions.

mandelbug: /man'del-buhg/ n. [from the Mandelbrot set]. A bug whose underlying causes are so complex and obscure as to make its behavior appear chaotic or even non-deterministic.

schroedinbug: /shroh'din-buhg/ n. [MIT: from the Schroedinger's Cat thought-experiment in quantum physics]. A design or implementation bug in a program that doesn't manifest until someone reading source or using the program in an unusual way notices that it never should have worked, at which point the program promptly stops working for everybody until fixed.

aliasing bug (stale pointer bug): n. A class of subtle programming errors that can arise in code that does dynamic allocation.

A more detailed list of bugs


 

September 30, 2003
@ 12:24 AM

Fresh from Computex 2003, courtesy The Inquirer, we have two new rumors surrounding Intel's short-range plans for the successors to its Pentium 4 line and Prescott (wherever it fits in). First, there's news that Intel will be using AMD's 64bit extensions and not their own due to Microsoft:

If you think Intel would eat its own shoes before it adopted AMD64, guess again, it will be compatible with AMD64. If you doubt this, think about one thing, why this is happening. It is not Intel's doing, not by a long shot. While it may use a different name, like the old x86-64, or even Extended x86, it will run all the software that AMD does. Mmmmm, shoes are tasty. So, why is this again? Simple, MS. Microsoft will not support a different 64 bit platform, and frankly I don't blame it, it costs a lot of money to do that. MS gave Intel the choice, support AMD's instruction set, or do without Windows. MS won that battle pretty handily.

This was followed-up two hours later with news about the successors to Pentium 4 - Pentium V/5 and Pentium 6, as well as news that Microsoft's 64 bit OS will be called Windows Elements:

DETAILS HAVE EMERGED of the future design of Intel's Tejas/Pentium V processor, and of how the chip firm will present it to the world. The chip will sample internally at Intel in January 2004 and will take between four to six months to get to market. The Pentium 6 will follow a very similar schedule. The Pentium V is likely to fly along at between 5GHz to 7GHz, have 2MB plus of level two cache, be built on a 90 nanometer process, and have a stackable design. The processor we believe, sits in the LGA 775 pin socket, and above it is a very thin heatsink. But, according to sources close to the firm's plans, another permeable heatsink can sit between this and another microprocessor module, giving a stackable design. The final design of this arrangement is not set in stone. According to this source, and the details have not been confirmed, a module sitting on top could provide 64-bit extensions. And the source claimed, Microsoft is ready to launch a version of Windows called Elements with 64-bit extensions.

What's missing is where this leaves Prescott. Will Prescott be a pre-game Pentium 5 like the Pentium III Katmai was, or will it become a workstation/server fork like the Pentium III Tulatin? My take is that Prescott will be a pre-game Pentium 5, the "coprocessor" 64bit idea won't fly, and that Pentium 6 debut a year after Tejas sounds about right if, and only if, Prescott launches as a Pentium 5 this year in December. As to Windows Elements and 64 bit on the desktop... well, I own a Sun Blade 100. I can tell you from personal experience that a 64 bit CPU means nothing without proper cache and motherboard resources to back it up. ~write-up by IceStorm


 

September 27, 2003
@ 04:37 AM

Very interactive documentation website. Which engine does it use ? The active glossary is annoying though. Can switch it off.


 

September 12, 2003
@ 03:27 AM

Press Win+L to lock your workstation.


You can rename multiple files all at once: Select a group of files, right-click the first file, and select "Rename". Type in a name for the first file, and the rest will follow.
Hold down the shift key when switching to thumbnail view to hide the file names. Do it again to bring them back.
From the View Menu, select "Choose Details" to select which file properties should be shown in the Explorer window. To sort by a file property, check its name in the "Choose Details" in order to make that property available in the "Arrange Icons by" menu.
To arrange two windows side-by-side, switch to the first window, then hold the Control key while right-clicking the taskbar button of the second window. Select "Tile Vertically".
To close several windows at once, hold down the Control key while clicking on the taskbar buttons of each window. Once you have selected all the windows you want to close, right-click the last button you selected and pick "Close Group".
You can turn a folder into a desktop toolbar by dragging the icon of the desired folder to the edge of the screen. You can then turn it into a floating toolbar by dragging it from the edge of the screen into the middle of the screen. (It helps if you minimize all application windows first.)
To organize your Favorites in Explorer instead of using the Organize Favorites dialog, hold the shift key while selecting "Organize Favorites" from the Favorites menu of an Explorer window.
In Internet Explorer, hold the Shift key while turning the mouse wheel to go forwards or backwards.
In some applications (such as Internet Explorer), holding the Control key while turning the mouse wheel will change the font size.


 

September 12, 2003
@ 03:12 AM

Have you ever needed an email .. NOW? Have you ever gone to a website that asks for your email for no reason (other than they are going to sell your email address to the highest bidder so you get spammed forever) ?

Welcome to Mailinator

Its no signup, instant email. Here is how it works: You are on the web, at a party, or talking to your favorite insurance salesman. Whereever you are, someone (or some webpage) asks for your email. You know if you give it, you'll be spammed. On the other hand, you do want at least one email from that person. The answer is to give them a mailinator address. You don't need to sign-up. You just make it up on the spot. Pick jonesy@mailinator.com or bipster@mailinator.com - pick anything you want (up to 15 characters before the @ sign).

Later, come to this site and check the email for that account. Its that easy. Mailinator accounts are created when mail arrives for them. No signup, no personal information, and when you're done - you can walk away. The emails will automatically be deleted for you after a few hours.