What is TextPipe Pro?
TextPipe™ is an industrial strength text transformation, conversion, cleansing and extraction workbench.
What does TextPipe Pro do?
TextPipe Pro is the ideal means to convert text encodings from between 145 language encodings and Unicode, or between 151 code page encodings and Unicode.
TextPipe Pro can convert Mainframe EBCDIC files to ASCII format; you can even paste in your Mainframe copybook and TextPipe Pro will do most of the work for you!
TextPipe Pro can even handle packed-decimal (COMP-3) and zoned decimal fields.
You can also:
- Convert between Unix, PC, Mac and Mainframe end of line and fixed width record formats. Conversion can detect incoming line feeds, and insert new characters such as line feeds between fixed-length records. Invalid End of Line characters can be automatically removed
- Convert unprintable IBM drawing characters to + and | and -
- Convert tabs to spaces or spaces to tabs
- Convert character case to UPPERCASE, lowercase, tOGGLE cASE, Title Case, Sentence case, rANdoM cASE
- Convert character collating sequence from ASCII to EBCDIC or EBCDIC to ASCII. Expand EBCDIC packed or zoned decimal, compress to EBCDIC packed or zoned decimal. Useful for handling mainframe files
- Convert ASCII (Windows OEM) to ANSI and ANSI to ASCII. Useful with Windows and non-English languages
- Convert CSV to Tab-delimited, CSV to XML
- Convert Tab-delimited to CSV, Tab-delimited to XML
- Convert text to a word list
- Convert text to a Hex or Decimal dump (very useful for finding control characters)
- Convert Word documents to text
|
Seven Reasons Why TextPipe Pro is Different
- TextPipe is exceptionally fast. Several unique algorithms speed up processing
- TextPipe handles files of unlimited size, even files larger than 2 Gigabytes! Other applications attempt to load the entire file into memory (grinding your system to a halt).
- TextPipe's unique restrictions control precisely where changes are made. Restrict to a range of lines or columns, to specific Tab or CSV fields, between HTML/XML tags, and inside custom ranges. Restrictions can be combined, for example, to columns 1-10 of lines matching a pattern. Restrictions are essential for extensive but controlled search and replace
- TextPipe performs multiple operations simultaneously. Other applications offer only 1, up to 5, or require a slow multi-pass approach
- If TextPipe's 100+ filters don't suit your needs, you can use industry standard VBScript/JScript to write your own. Other applications either don't offer this facility, or force you to learn a proprietary language
- TextPipe is unique in offering the EasyPattern pattern matching language for those not familiar with text pattern matching (regular expressions). EasyPatterns are English-like and very easy to learn
- TextPipe can be scheduled for non-interactive use, and can be controlled by an external program. Other applications provide only a mouse interface.
|
TextPipe Pro is Easy
TextPipe will save you time, frustration and money. It will fix text data, regardless of the number of changes required, the size or number of files, and the complexity of the transformations.
TextPipe provides a single point of maintenance for all your text processing tasks. You learn one tool, rather than learning 4 or more - and their associated languages, command line options, debugging schemes, idiosyncrasies and operating system differences and dependencies. TextPipe is far less costly to learn, use, develop with and maintain than cobbling together multiple generic tools and custom scripts to achieve one end. It's a Swiss army knife combining the best of perl, awk, grep, sed, and many other less common text processing tools. You'll be productive with TextPipe in minutes, not days.
TextPipe's unmatched power comes from its arsenal of 100+ manipulation filters, its unique architecture and its tremendous flexibility in combining these filters to suit each task. Intuitive line, column, field, tag and attribute restrictions make fixing data extracts simple. You can extract and then modify data from databases, in delimited, XML and SQL Insert Script formats. You can roll your own custom filters using industry standard VBScript and JScript. With TextPipe you can create your own conversions, and deploy them for execution at remote sites. A single click merges files (even those larger than 10 GB), another click extracts emails addresses, and another click sorts and removes duplicates. Try doing that with less than 100 lines of code, in less than 10 seconds!
TextPipe makes it fast and easy to convert, transform and re-purpose data in text files, including:
- HTML, XML and other structured documents from the WWW
- Fixed length or delimited files (CSV, Tab, Pipe, etc)
- Unix, Mainframe and PC/Windows end-of-line formats
- Inside Zip files, and the new Microsoft Office 2007 formats DOCX, XLSX, PPTX
- ASCII, ANSI, Unicode and EBCDIC files
- Security log files from firewalls, web servers etc
- EDIFACT, HL7, SWIFT and other structured formats
- Spooled print files
- Structured and unstructured reports of any size or dimension
TextPipe Pro is Fast
In a recent speed trial, TextPipe made 17 million replacements in a 250,000 record, 75 MB file in 1:45 seconds. Other 'popular' text applications took 35 minutes and 72 minutes, and MS Word took over 3 hours.
TextPipe Pro is Powerful
TextPipe will save you time, frustration and money. It will fix text data, regardless of the number of changes required, the size or number of files, and the complexity of the transformations.
TextPipe Pro is trusted by over 1500 customers in 56 countries to:
- Convert huge files quickly and easily
- Data mine unstructured mainframe reports and web data
- Cleanse and reformat electronic text
- Update web sites
- Perform data warehouse Extract-Transfer-Load (ETL) tasks
- Extract from databases to XML, CSV, tab-delimited
- Split and join massive files
- Convert between a variety of mainframe, PC and unicode data formats and encodings
- Pre-processing training data for Statistical Machine Translation (SMT)
What can TextPipe Pro do for you? Download your free trial to find out.
TextPipe Pro Versions
FREE Trial! |
TextPipe Pro Single User License If you need to run multiple simultaneous copies you need the server version below. |
FREE Trial! |
TextPipe Pro Database Server/Web Server/Email Server License Required when multiple copies of TextPipe run simultaneously on the one machine. May be installed on a server-class computer such as a database server, email server, web server or application server. Note: May NOT be used for multiple users to access with Terminal Services. |
***
TextPipe Pro Unicode and Encoding Conversions
The following is a list of the 145 Unicode conversions and encoding conversions offered by TextPipe Pro.
- Convert Unicode to ANSI
- Convert ANSI to Unicode
- Convert Unicode to ASCII
- Convert ASCII to Unicode
The Unicode conversions are found under Filters Menu\Unicode.
Unicode Normalization filters:
- NFC - Canonical Decomposition, followed by Canonical Composition
- NFD - Canonical Decomposition
- NFKD - Compatibility Decomposition
- NFKC - Compatibility Decomposition, followed by Canonical Composition
- Compose
Conversions between Unicode and:
- European languages
- ASCII
- ISO-8859-1 (Western)
- ISO-8859-2 (Central European)
- ISO-8859-3 (South European)
- ISO-8859-4 (Baltic)
- ISO-8859-5 (Cyrillic)
- ISO-8859-7 (Greek)
- ISO-8859-9 (Turkish)
- ISO-8859-10 (Nordic)
- ISO-8859-13 (Baltic)
- ISO-8859-14 (Celtic)
- ISO-8859-15 (Western)
- ISO-8859-16 (Romanian)
- Windows 1250 (Central Europe)
- Windows 1251 (Cyrillic)
- Windows 1252 (Latin 1)
- Windows 1253 (Greek)
- Windows 1254 (Turkish)
- Windows 1255 (Hebrew)
- Windows 1256 (Arabic)
- Windows 1257 (Baltic)
- Windows 1258 (Vietnam)
- CP437, CP737 DOS Greek, CP775 DOS BaltRim, CP850, CP852, CP853, CP855,
CP856 Hebrew PC, CP857, CP858, CP860, CP861, CP863, CP865, CP866, CP869,
CP1125
- MacRoman, MacCentralEurope, MacIceland, MacCroatian, MacRomaniaCyrillic,
MacUkraine, MacGreek, Mac Dingbats, Mac Farsi , Mac Romania
- Semitic languages
- ISO-8859-6 (Arabic)
- ISO-8859-8 (Hebrew Visual)
- CP255, CP1256
- CP862, CP864
- MacHebrew, MacArabic
- Japanese
- EUC-JP
- SHIFT_JIS
- P932
- ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2, ISO-2022-JP-3
- EUC-JISX0213
- Shift_JISX0213
- Chinese
- EUC-CN
- HZ, GBK
- GB18030 Standard Chinese
- UC-TW
- BIG5
- CP950
- BIG5-HKSCS,
- ISO-2022-CN, ISO-2022-CN-EXT
- Korean
- KOI8-R, KOI8-U, KOI8-RU
- EUC-KR
- CP949
- ISO-2022-KR
- JOHAB
- Armenian
- Georgian
- Georgian-Academy
- Georgian-PS
- Tajik
- Thai
- TIS-620
- CP874 Thai
- MacThai
- Laotian
- Vietnamese
- Platform specific/other
- HP-ROMAN8
- NEXTSTEP
- RISCOS-LATIN1
- C99
- JAVA
- IBM424
- IBM437
- IBM850
- IBM852
- IBM855
- IBM857
- IBM860
- IBM861
- IBM862
- IBM863
- IBM864
- IBM865
- IBM866
- IBM869
- JIS_X0201
- TIS-620
- Full Unicode
- UTF-8
- UCS-2, UCS-2BE, UCS-2LE
- UCS-4, UCS-4BE, UCS-4LE
- UTF-16, UTF-16BE, UTF-16LE
- UTF-32, UTF-32BE, UTF-32LE
- UTF-7, UTF-7 Optional Direct Characters
Note:
- UCS-4 is UTF-32 with support for code points beyond U+10FFFF (which are
supposed to be unassignable forever).
- UCS-2 is UTF-16 with surrogate support removed (so code points beyond
U+FFFF cannot be represented).
- Turkmen
TextPipe Pro Code Page Conversions
The following list defines the 151 code page conversions (*) (also referred to as character sets) supported by TextPipe Pro.
Code-Page Identifiers
(*) The list of available code pages may be different on your system. You
can install additional code pages using Control Panel\Regional Options.
Identifier |
Name |
037 |
IBM EBCDIC - U.S./Canada |
437 |
OEM - United States |
500 |
IBM EBCDIC - International |
708 |
Arabic - ASMO 708 |
709 |
Arabic - ASMO 449+, BCON V4 |
710 |
Arabic - Transparent Arabic |
720 |
Arabic - Transparent ASMO |
737 |
OEM - Greek (formerly 437G) |
775 |
OEM - Baltic |
850 |
OEM - Multilingual Latin I |
852 |
OEM - Latin II |
855 |
OEM - Cyrillic (primarily Russian) |
857 |
OEM - Turkish |
858 |
OEM - Multlingual Latin I + Euro symbol |
860 |
OEM - Portuguese |
861 |
OEM - Icelandic |
862 |
OEM - Hebrew |
863 |
OEM - Canadian-French |
864 |
OEM - Arabic |
865 |
OEM - Nordic |
866 |
OEM - Russian |
869 |
OEM - Modern Greek |
870 |
IBM EBCDIC - Multilingual/ROECE (Latin-2) |
874 |
ANSI/OEM - Thai (same as 28605, ISO
8859-15) |
875 |
IBM EBCDIC - Modern Greek |
932 |
ANSI/OEM - Japanese, Shift-JIS |
936 |
ANSI/OEM - Simplified Chinese (PRC,
Singapore) |
949 |
ANSI/OEM - Korean (Unified Hangeul Code) |
950 |
ANSI/OEM - Traditional Chinese (Taiwan;
Hong Kong SAR, PRC) |
1026 |
IBM EBCDIC - Turkish (Latin-5) |
1047 |
IBM EBCDIC - Latin 1/Open System |
1140 |
IBM EBCDIC - U.S./Canada (037 + Euro
symbol) |
1141 |
IBM EBCDIC - Germany (20273 + Euro symbol) |
1142 |
IBM EBCDIC - Denmark/Norway (20277 + Euro
symbol) |
1143 |
IBM EBCDIC - Finland/Sweden (20278 + Euro
symbol) |
1144 |
IBM EBCDIC - Italy (20280 + Euro symbol) |
1145 |
IBM EBCDIC - Latin America/Spain (20284 +
Euro symbol) |
1146 |
IBM EBCDIC - United Kingdom (20285 + Euro
symbol) |
1147 |
IBM EBCDIC - France (20297 + Euro symbol) |
1148 |
IBM EBCDIC - International (500 + Euro
symbol) |
1149 |
IBM EBCDIC - Icelandic (20871 + Euro
symbol) |
1200 |
Unicode UCS-2 Little-Endian (BMP of ISO
10646) |
1201 |
Unicode UCS-2 Big-Endian |
1250 |
ANSI - Central European |
1251 |
ANSI - Cyrillic |
1252 |
ANSI - Latin I |
1253 |
ANSI - Greek |
1254 |
ANSI - Turkish |
1255 |
ANSI - Hebrew |
1256 |
ANSI - Arabic |
1257 |
ANSI - Baltic |
1258 |
ANSI/OEM - Vietnamese |
1361 |
Korean (Johab) |
10000 |
MAC - Roman |
10001 |
MAC - Japanese |
10002 |
MAC - Traditional Chinese (Big5) |
10003 |
MAC - Korean |
10004 |
MAC - Arabic |
10005 |
MAC - Hebrew |
10006 |
MAC - Greek I |
10007 |
MAC - Cyrillic |
10008 |
MAC - Simplified Chinese (GB 2312) |
10010 |
MAC - Romania |
10017 |
MAC - Ukraine |
10021 |
MAC - Thai |
10029 |
MAC - Latin II |
10079 |
MAC - Icelandic |
10081 |
MAC - Turkish |
10082 |
MAC - Croatia |
12000 |
Unicode UCS-4 Little-Endian |
12001 |
Unicode UCS-4 Big-Endian |
20000 |
CNS - Taiwan |
20001 |
TCA - Taiwan |
20002 |
Eten - Taiwan |
20003 |
IBM5550 - Taiwan |
20004 |
TeleText - Taiwan |
20005 |
Wang - Taiwan |
20105 |
IA5 IRV International Alphabet No. 5
(7-bit) |
20106 |
IA5 German (7-bit) |
20107 |
IA5 Swedish (7-bit) |
20108 |
IA5 Norwegian (7-bit) |
20127 |
US-ASCII (7-bit) |
20261 |
T.61 |
20269 |
ISO 6937 Non-Spacing Accent |
20273 |
IBM EBCDIC - Germany |
20277 |
IBM EBCDIC - Denmark/Norway |
20278 |
IBM EBCDIC - Finland/Sweden |
20280 |
IBM EBCDIC - Italy |
20284 |
IBM EBCDIC - Latin America/Spain |
20285 |
IBM EBCDIC - United Kingdom |
20290 |
IBM EBCDIC - Japanese Katakana Extended |
20297 |
IBM EBCDIC - France |
20420 |
IBM EBCDIC - Arabic |
20423 |
IBM EBCDIC - Greek |
20424 |
IBM EBCDIC - Hebrew |
20833 |
IBM EBCDIC - Korean Extended |
20838 |
IBM EBCDIC - Thai |
20866 |
Russian - KOI8-R |
20871 |
IBM EBCDIC - Icelandic |
20880 |
IBM EBCDIC - Cyrillic (Russian) |
20905 |
IBM EBCDIC - Turkish |
20924 |
IBM EBCDIC - Latin-1/Open System (1047 +
Euro symbol) |
20932 |
JIS X 0208-1990 & 0121-1990 |
20936 |
Simplified Chinese (GB2312) |
21025 |
IBM EBCDIC - Cyrillic (Serbian, Bulgarian) |
21027 |
Extended Alpha Lowercase |
21866 |
Ukrainian (KOI8-U) |
28591 |
ISO 8859-1 Latin I |
28592 |
ISO 8859-2 Central Europe |
28593 |
ISO 8859-3 Latin 3 |
28594 |
ISO 8859-4 Baltic |
28595 |
ISO 8859-5 Cyrillic |
28596 |
ISO 8859-6 Arabic |
28597 |
ISO 8859-7 Greek |
28598 |
ISO 8859-8 Hebrew |
28599 |
ISO 8859-9 Latin 5 |
28605 |
ISO 8859-15 Latin 9 |
29001 |
Europa 3 |
38598 |
ISO 8859-8 Hebrew |
50220 |
ISO 2022 Japanese with no halfwidth
Katakana |
50221 |
ISO 2022 Japanese with halfwidth Katakana |
50222 |
ISO 2022 Japanese JIS X 0201-1989 |
50225 |
ISO 2022 Korean |
50227 |
ISO 2022 Simplified Chinese |
50229 |
ISO 2022 Traditional Chinese |
50930 |
Japanese (Katakana) Extended |
50931 |
US/Canada and Japanese |
50933 |
Korean Extended and Korean |
50935 |
Simplified Chinese Extended and Simplified
Chinese |
50936 |
Simplified Chinese |
50937 |
US/Canada and Traditional Chinese |
50939 |
Japanese (Latin) Extended and Japanese |
51932 |
EUC - Japanese |
51936 |
EUC - Simplified Chinese |
51949 |
EUC - Korean |
51950 |
EUC - Traditional Chinese |
52936 |
HZ-GB2312 Simplified Chinese |
54936 |
Windows XP: GB18030 Simplified
Chinese (4 Byte) |
57002 |
ISCII Devanagari |
57003 |
ISCII Bengali |
57004 |
ISCII Tamil |
57005 |
ISCII Telugu |
57006 |
ISCII Assamese |
57007 |
ISCII Oriya |
57008 |
ISCII Kannada |
57009 |
ISCII Malayalam |
57010 |
ISCII Gujarati |
57011 |
ISCII Punjabi |
65000 |
Unicode UTF-7 |
65001 |
Unicode UTF-8 |
|